# Unit 3, Exercise 2: Standardization

This exercise is an extension of Exercise 1. Here, the goal is to add code to standardize the features such that they have a mean of 0 and a standard deviation of 1 as discussed in Unit 3.7.

Most of the code below is identical to Exercise 1. To avoid not spoil the solution for Exercise 1, the same code parts are missing.

## 1) Installing Libraries

You likely already have all libraries installed and don't need to do anything here.

In [1]:
# !conda install numpy pandas matplotlib --yes

In [2]:
# !pip install torch

In [3]:
# !conda install watermark

In [4]:
#%load_ext watermark
#%watermark -v -p numpy,pandas,matplotlib,torch

## 2) Loading the Dataset

We are using the familiar `read_csv` function from pandas to load the dataset:

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv("data_banknote_authentication.txt", header=None)
df.head()

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [7]:
X_features = df[[0, 1, 2, 3]].values
y_labels = df[4].values

Number of examples and features:

In [8]:
X_features.shape

(1372, 4)

It is usually a good idea to look at the label distribution:

In [9]:
import numpy as np

np.bincount(y_labels)

array([762, 610])

## 3) Defining a DataLoader

The `DataLoader` code is the same code code we used in Unit 3.6:

In [10]:
from torch.utils.data import Dataset, DataLoader


class MyDataset(Dataset):
    def __init__(self, X, y):

        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.float32)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]        
        return x, y

    def __len__(self):
        return self.labels.shape[0]


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/ant-smalls/Desktop/CPSC352/HW/dl-fundamentals/.venv/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/ant-smalls/Desktop/CPSC352/HW/dl-fundament

We will be using 80% of the data for training, 20% of the data for validation. In a real-project, we would also have a separate dataset for the final test set (in this case, we do not have an explicit test set).

In [11]:
train_size = int(X_features.shape[0]*0.80)
train_size

1097

In [12]:
val_size = X_features.shape[0] - train_size
val_size

275

Using `torch.utils.data.random_split`, we generate the training and validation sets along with the respective data loaders:

In [13]:
import torch

dataset = MyDataset(X_features, y_labels)

torch.manual_seed(1)
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(
    dataset=train_set,
    batch_size=10,
    shuffle=True,
)

val_loader = DataLoader(
    dataset=val_set,
    batch_size=10,
    shuffle=False,
)

## 4) Standardization

There are multiple ways to implement the standardization procedure. For this exercise, we are going to implement a procedure that standardizes the features after we created the data loader.

Since this dataset has 4 features, there should be 4 means and 4 standard deviations we compute from the training set. We can do this as follows:

In [14]:
train_mean = torch.zeros(X_features.shape[1])

for x, y in train_loader:
    train_mean += x.sum(dim=0)
    
train_mean /= len(train_set)

train_std = torch.zeros(X_features.shape[1])
for x, y in train_loader:
    train_std += ((x - train_mean)**2).sum(dim=0)

train_std = torch.sqrt(train_std / (len(train_set)-1))

In [15]:
print("Feature means:", train_mean)
print("Feature std. devs:", train_std)

Feature means: tensor([ 0.3854,  1.8680,  1.4923, -1.1999])
Feature std. devs: tensor([2.8575, 5.9216, 4.3869, 2.1041])


We compute the means and standard deviations by iterating over the training loader. This is an approach that even works for large datasets where the entire dataset doesn't fit into memory. 

A simpler approach, which only works for smaller datasets that fit into memory, is as follows:

In [16]:
all_x = []
for x, y in train_loader:
    all_x.append(x)
    
train_std = torch.concat(all_x).std(dim=0)
train_mean = torch.concat(all_x).mean(dim=0)

In [17]:
print("Feature means:", train_mean)
print("Feature std. devs:", train_std)

Feature means: tensor([ 0.3854,  1.8680,  1.4923, -1.1999])
Feature std. devs: tensor([2.8575, 5.9216, 4.3869, 2.1041])


<font color='red'>YOUR TASK is now to implement a standardization function based on these training set parameters above:</font>

In [18]:
def standardize(df, train_mean, train_std): # YOUR CODE
    # YOUR CODE
    return (df-train_mean) / train_std

## 5) Implementing the model

Here, we are resusing the same model code we used in Unit 3.6:

In [19]:
import torch

class LogisticRegression(torch.nn.Module):
    
    def __init__(self, num_features):
        super().__init__()
        self.linear = torch.nn.Linear(num_features, 1)
    
    def forward(self, x):
        logits = self.linear(x)
        probas = torch.sigmoid(logits)
        return probas

## 6) The training loop

In this section, we are using the training loop from Unit 3.6. It's the exact same code except for some small modification: We added the line `if not batch_idx % 20` to only print the lost for every 20th batch (to reduce the number of output lines).

<font color='red'>YOUR TASK is to use the standardization code correctly in the for loop. Then, find a good learning rate and epoch number to that you achieve a training and validation performance of at least 98%.</font>

In [20]:
import torch.nn.functional as F


torch.manual_seed(1)
model = LogisticRegression(num_features=4)
optimizer = torch.optim.SGD(model.parameters(), lr=0.03) ## possible SOLUTION

num_epochs = 75 ## possible SOLUTION

for epoch in range(num_epochs):
    
    model = model.train()
    for batch_idx, (features, class_labels) in enumerate(train_loader):

        features = standardize(features, train_mean, train_std) ## SOLUTION
        probas = model(features)
        
        loss = F.binary_cross_entropy(probas, class_labels.view(probas.shape))
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        ### LOGGING
        if not batch_idx % 20: # log every 20th batch
            print(f'Epoch: {epoch+1:03d}/{num_epochs:03d}'
                   f' | Batch {batch_idx:03d}/{len(train_loader):03d}'
                   f' | Loss: {loss:.2f}')

Epoch: 001/075 | Batch 000/110 | Loss: 0.93
Epoch: 001/075 | Batch 020/110 | Loss: 0.74
Epoch: 001/075 | Batch 040/110 | Loss: 0.67
Epoch: 001/075 | Batch 060/110 | Loss: 0.49
Epoch: 001/075 | Batch 080/110 | Loss: 0.43
Epoch: 001/075 | Batch 100/110 | Loss: 0.44
Epoch: 002/075 | Batch 000/110 | Loss: 0.50
Epoch: 002/075 | Batch 020/110 | Loss: 0.38
Epoch: 002/075 | Batch 040/110 | Loss: 0.44
Epoch: 002/075 | Batch 060/110 | Loss: 0.33
Epoch: 002/075 | Batch 080/110 | Loss: 0.31
Epoch: 002/075 | Batch 100/110 | Loss: 0.32
Epoch: 003/075 | Batch 000/110 | Loss: 0.25
Epoch: 003/075 | Batch 020/110 | Loss: 0.35
Epoch: 003/075 | Batch 040/110 | Loss: 0.33
Epoch: 003/075 | Batch 060/110 | Loss: 0.44
Epoch: 003/075 | Batch 080/110 | Loss: 0.26
Epoch: 003/075 | Batch 100/110 | Loss: 0.28
Epoch: 004/075 | Batch 000/110 | Loss: 0.19
Epoch: 004/075 | Batch 020/110 | Loss: 0.15
Epoch: 004/075 | Batch 040/110 | Loss: 0.36
Epoch: 004/075 | Batch 060/110 | Loss: 0.34
Epoch: 004/075 | Batch 080/110 |

## 7) Evaluating the results

Again, reusing the code from Unit 3.6, we will calculate the training and validation set accuracy.

<font color='red'>Use the code below as is. What do you observe? And why?</font>

In [21]:
def compute_accuracy(model, dataloader):

    model = model.eval()
    
    correct = 0.0
    total_examples = 0
    
    for idx, (features, class_labels) in enumerate(dataloader):
        
        with torch.no_grad():
            probas = model(features)
        
        pred = torch.where(probas > 0.5, 1, 0)
        lab = class_labels.view(pred.shape).to(pred.dtype)

        compare = lab == pred
        correct += torch.sum(compare)
        total_examples += len(compare)

    return correct / total_examples

In [22]:
train_acc = compute_accuracy(model, train_loader)
print(f"Accuracy: {train_acc*100:.2f}%")

Accuracy: 85.41%


<font color='red'>Notice that the code validation accuracy is not shown? It's part of the exercise to implement it :)</font>

In [23]:
## SOLUTION

val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")

Accuracy: 81.45%


<font color='red'>Now, add the standardization to the `compute_accuracy` function above and recompute the training and validation accuracy. What do you observe?</font>

In [24]:
def compute_accuracy(model, dataloader):

    model = model.eval()
    
    correct = 0.0
    total_examples = 0
    
    for idx, (features, class_labels) in enumerate(dataloader):
        
        features = standardize(features, train_mean, train_std) ## SOLUTION
        with torch.no_grad():
            probas = model(features)
        
        pred = torch.where(probas > 0.5, 1, 0)
        lab = class_labels.view(pred.shape).to(pred.dtype)

        compare = lab == pred
        correct += torch.sum(compare)
        total_examples += len(compare)

    return correct / total_examples

train_acc = compute_accuracy(model, train_loader)
print(f"\nAccuracy: {train_acc*100:.2f}%")


val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")


Accuracy: 98.09%
Accuracy: 98.18%


## OBSERVATIONS 
Since the loss of the course of the epoch steadily reduces and converges to 0, we can see that the standardization function is working correctly. We know that this model is strong in generalization because of the training and validation sets being so close in accuracy. This also shows that there is no overfitting going on with the model.