# **Going Deeper -- the Mechanics of PyTorch (Part 2/3)**

## **Project one - predicting the fuel efficiency of a car**

- Real-world project of predicting the fuel efficiency of a car in miles per gallon (MPG).
- Cover;
  - Data pre-processing
  - feature engineering
  - training
  - prediction (inference), and 
  - evaluation.

**Working with feature columns**

- `Numeric` data in PyTorch specifically refers to continuous data of floating point type.
- `feature sets` are comprised of a mixture of different feature type.


![Auto MPG Data Structure](./figures/Auto-MPG.png)

- The features shown in the figure `(model year, cylinders, displacement, horsepower, weight, acceleration, and origin)` were obtained from the `Auto MPG` dataset, which is a common machine learning benchmark dataset for predicting the fuel efficiency of a car in MPG. The full dataset and its description are available from UCI’s machine learning repository at https://archive.ics.uci.edu/ml/datasets/auto+mpg.

- We are going to treat five features from the `Auto MPG` dataset (number of cylinders, displacement, horsepower, weight, and acceleration) as `“numeric”` (here, continuous) features. The model year can be regarded as an `ordered categorical (ordinal)` feature. Lastly, the manufacturing origin can be regarded as an `unordered categorical (nominal)` feature with three possible discrete values, `1, 2, and 3`, which correspond to the `US, Europe, and Japan`, respectively.


- **Data Preprocessing**
  - Dropping the incomplete rows
  - partitioning the dataset into training and test datasets
  - standardizing the continuous features.

In [1]:
import numpy as np
import torch
import torch.nn as nn
import pandas as pd

from IPython.display import Image

In [2]:
data_path = "../data/auto-mpg.csv"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                'acceleration', 'model year', 'origin']


df = pd.read_csv(data_path, usecols=column_names, 
                 na_values= "?", comment='\t')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
2,18.0,8,318.0,150.0,3436,11.0,70,1
3,16.0,8,304.0,150.0,3433,12.0,70,1
4,17.0,8,302.0,140.0,3449,10.5,70,1


In [3]:
df.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
393,27.0,4,140.0,86.0,2790,15.6,82,1
394,44.0,4,97.0,52.0,2130,24.6,82,2
395,32.0,4,135.0,84.0,2295,11.6,82,1
396,28.0,4,120.0,79.0,2625,18.6,82,1
397,31.0,4,119.0,82.0,2720,19.4,82,1


In [4]:
df.shape

(398, 8)

- drop the NA rows

In [5]:
print(df.isna().sum())

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model year      0
origin          0
dtype: int64


In [6]:
df = df.dropna()
df = df.reset_index(drop=True)
df.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
387,27.0,4,140.0,86.0,2790,15.6,82,1
388,44.0,4,97.0,52.0,2130,24.6,82,2
389,32.0,4,135.0,84.0,2295,11.6,82,1
390,28.0,4,120.0,79.0,2625,18.6,82,1
391,31.0,4,119.0,82.0,2720,19.4,82,1


- Split into training and test datasets

In [7]:
import sklearn
import sklearn.model_selection

df_train, df_test = sklearn.model_selection.train_test_split(df, train_size=0.8, random_state=1)
train_stats = df_train.describe().transpose()
train_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mpg,313.0,23.404153,7.666909,9.0,17.5,23.0,29.0,46.6
cylinders,313.0,5.402556,1.701506,3.0,4.0,4.0,8.0,8.0
displacement,313.0,189.51278,102.675646,68.0,104.0,140.0,260.0,455.0
horsepower,313.0,102.929712,37.919046,46.0,75.0,92.0,120.0,230.0
weight,313.0,2961.198083,848.602146,1613.0,2219.0,2755.0,3574.0,5140.0
acceleration,313.0,15.704473,2.725399,8.5,14.0,15.5,17.3,24.8
model year,313.0,75.929712,3.675305,70.0,73.0,76.0,79.0,82.0
origin,313.0,1.591054,0.807923,1.0,1.0,1.0,2.0,3.0


In [8]:
df_train.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0
mean,23.404153,5.402556,189.51278,102.929712,2961.198083,15.704473,75.929712,1.591054
std,7.666909,1.701506,102.675646,37.919046,848.602146,2.725399,3.675305,0.807923
min,9.0,3.0,68.0,46.0,1613.0,8.5,70.0,1.0
25%,17.5,4.0,104.0,75.0,2219.0,14.0,73.0,1.0
50%,23.0,4.0,140.0,92.0,2755.0,15.5,76.0,1.0
75%,29.0,8.0,260.0,120.0,3574.0,17.3,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [16]:
from packaging import version


numeric_column_names = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']

df_train_norm, df_test_norm = df_train.copy(), df_test.copy()


if version.parse(pd.__version__) >= version.parse("2.0.0"):

    for col_name in numeric_column_names:
        mean = train_stats.loc[col_name, 'mean']
        std = train_stats.loc[col_name, 'std']
        df_train_norm[col_name] = (df_train_norm[col_name] - mean) / std
        df_test_norm[col_name] = (df_test_norm[col_name] - mean) / std

else:

    for col_name in numeric_column_names:
        mean = train_stats.loc[col_name, 'mean']
        std  = train_stats.loc[col_name, 'std']
        df_train_norm.loc[:, col_name] = (df_train_norm.loc[:, col_name] - mean) / std
        df_test_norm.loc[:, col_name] = (df_test_norm.loc[:, col_name] - mean) / std
        
df_train_norm.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
203,28.0,-0.824303,-0.90102,-0.736562,-0.950031,0.255202,76,3
255,19.4,0.351127,0.4138,-0.340982,0.29319,0.548737,78,1
72,13.0,1.526556,1.144256,0.713897,1.339617,-0.625403,72,1
235,30.5,-0.824303,-0.89128,-1.053025,-1.072585,0.475353,77,1
37,14.0,1.526556,1.563051,1.636916,1.47042,-1.35924,71,1


In [18]:
df_train_norm['displacement'].mean(), df_train_norm['displacement'].std()

(np.float64(1.2485575229011345e-16), np.float64(1.0))

In [20]:
df_train_norm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 313 entries, 334 to 37
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           313 non-null    float64
 1   cylinders     313 non-null    float64
 2   displacement  313 non-null    float64
 3   horsepower    313 non-null    float64
 4   weight        313 non-null    float64
 5   acceleration  313 non-null    float64
 6   model year    313 non-null    int64  
 7   origin        313 non-null    int64  
dtypes: float64(6), int64(2)
memory usage: 22.0 KB


In [22]:
df_train_norm['origin'].value_counts()

origin
1    192
3     64
2     57
Name: count, dtype: int64

In [23]:
df_train_norm['model year'].value_counts()

model year
73    32
78    31
76    27
71    26
81    25
74    23
75    23
82    22
79    21
72    21
77    21
70    21
80    20
Name: count, dtype: int64

- Let's group the rather fine-grained model year `(model year)` information into buckets to simplify the learning task for the model that we are going to train later.
- We are going to assign each car into one of `four year` buckets, as follows;

$$ 
bucket = \begin{cases}
0 & \text{if } year \lt 73 \\
1 & \text{if } 73 \le year \lt 76 \\
2 & \text{if } 76 \le year \lt 79 \\
3 & \text{if } year \ge 79
\end{cases}
$$

- The chosen intervals were selected arbitrarily to illustrate the concepts of "bucketing".

- We need to group the cars into these buckets, first define three cut-off values `[73, 76, 79]` for the model year feature.

- cut-off values are used to specify half-closed intervals, for instance, 
  - (-$\infty$, 73),
  - [73, 76),
  - [76, 79), and
  - [76, $\infty$)

- The original numeric features will be passed to the `torch.bucketize` function to generate the indices of the buckets.

In [25]:
boundaries = torch.tensor([73, 76, 79])
 
v = torch.tensor(df_train_norm['model year'].values)
df_train_norm['Model Year Bucketed'] = torch.bucketize(v, boundaries, right=True)

v = torch.tensor(df_test_norm['model year'].values)
df_test_norm['Model Year Bucketed'] = torch.bucketize(v, boundaries, right=True)

numeric_column_names.append('Model Year Bucketed')

In [26]:
numeric_column_names

['cylinders',
 'displacement',
 'horsepower',
 'weight',
 'acceleration',
 'Model Year Bucketed']

- we added this `bucketized` feature column to the Python list numeric_column_names.

- Next, we define a list for the unordered categorical feature, `origin`.

- In PyTorch, two ways of working with a categorical feature:
  - using an embedding layer via `nn.Embedding`, or
  - using `one-hot-encoded` vectors (also called indicator)

- In the encoding part, 
  - `index 0` will be encoded as `[1, 0, 0]`
  - `index 1` will be encoded as `[0, 1, 0]`
  - and so on.

- On the other hand, the embedding layer maps each index to a vector of random numbers of the type float, which can be trained. (You can
think of the embedding layer as a more efficient implementation of a one-hot encoding multiplied with a trainable weight matrix.)

- When the number of categories is large, using the embedding layer with fewer dimensions than the number of categories can improve the performance.

In [28]:
df['origin'].value_counts()

origin
1    245
3     79
2     68
Name: count, dtype: int64

- We will use the one-hot encoding approach on the categorical feature in order to convert it into the dense format:

In [29]:
from torch.nn.functional import one_hot

total_origin = len(set(df_train_norm['origin']))

origin_encoded = one_hot(torch.from_numpy(df_train_norm['origin'].values) % total_origin)
x_train_numeric = torch.tensor(df_train_norm[numeric_column_names].values)
x_train = torch.cat([x_train_numeric, origin_encoded], 1).float()
 
origin_encoded = one_hot(torch.from_numpy(df_test_norm['origin'].values) % total_origin)
x_test_numeric = torch.tensor(df_test_norm[numeric_column_names].values)
x_test = torch.cat([x_test_numeric, origin_encoded], 1).float()

In [35]:
x_train.shape

torch.Size([313, 9])

In [36]:
x_test.shape

torch.Size([79, 9])

In [40]:
x_train.ndim, x_test.ndim, x_train_numeric.ndim, origin_encoded.ndim

(2, 2, 2, 2)

- Create the label tensors from the ground truth `MPG` values as follows:

In [41]:
y_train = torch.tensor(df_train_norm['mpg'].values).float()
y_test = torch.tensor(df_test_norm['mpg'].values).float()

In [45]:
y_train.shape

torch.Size([313])

We convered the most common approaches for preprocessing and creating features in PyTorch.

---

### **Training a DNN regression model**

- create a data loader that uses a batch size of `8` for the train data:

In [46]:
from torch.utils.data import TensorDataset, DataLoader

train_ds = TensorDataset(x_train, y_train)
batch_size = 8
torch.manual_seed(1)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

- Next, we will build a model with two fully connected layers where one has `8` hidden units and another has `4`:

In [None]:
# hidden_units = [8, 4]
# input_size = x_train.shape[1]

# all_layers = []
# for hidden_unit in hidden_units:
#     layer = nn.Linear(input_size, hidden_unit)
#     all_layers.append(layer)
#     all_layers.append(nn.ReLU())
#     input_size = hidden_unit

# all_layers.append(nn.Linear(hidden_units[-1], 1))

# model = nn.Sequential(*all_layers)
# model

Sequential(
  (0): Linear(in_features=9, out_features=8, bias=True)
  (1): ReLU()
  (2): Linear(in_features=8, out_features=4, bias=True)
  (3): ReLU()
  (4): Linear(in_features=4, out_features=1, bias=True)
)

In [54]:
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_size, hidden_units):
        super().__init__()
        
        layers = []
        in_features = input_size
        
        for h in hidden_units:
            layers.append(nn.Linear(in_features, h))
            layers.append(nn.ReLU())
            in_features = h
        
        layers.append(nn.Linear(in_features, 1))  # final layer
        
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = MLP(input_size=x_train.shape[1], hidden_units=[8, 4])
model


MLP(
  (net): Sequential(
    (0): Linear(in_features=9, out_features=8, bias=True)
    (1): ReLU()
    (2): Linear(in_features=8, out_features=4, bias=True)
    (3): ReLU()
    (4): Linear(in_features=4, out_features=1, bias=True)
  )
)

- Define the `MSE` loss function for regression and use `SGD` for optimization:

In [55]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

- Train the model for `200` epochs and display the train loss for every `20` epochs:

In [58]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 15

for epoch in range(num_epochs):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = loss_fn(pred, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    if epoch % log_epochs==0:
        print(f'Epoch {epoch}  Loss {loss_hist_train/len(train_dl):.4f}')

Epoch 0  Loss 5.7565
Epoch 15  Loss 5.4374
Epoch 30  Loss 5.6641
Epoch 45  Loss 5.7004
Epoch 60  Loss 5.4297
Epoch 75  Loss 5.3291
Epoch 90  Loss 5.3576
Epoch 105  Loss 5.7076
Epoch 120  Loss 5.3187
Epoch 135  Loss 5.3564
Epoch 150  Loss 6.2201
Epoch 165  Loss 5.9075
Epoch 180  Loss 5.3062
Epoch 195  Loss 5.8284


- Let's evaluate the regression performance of the trained model on the test dataset.
- To predict the target values on new data points, we can feed their features to the model:

In [59]:
with torch.no_grad():
    pred = model(x_test.float())[:, 0]
    loss = loss_fn(pred, y_test)
    print(f'Test MSE: {loss.item():.4f}')
    print(f'Test MAE: {nn.L1Loss()(pred, y_test).item():.4f}')

Test MSE: 8.8530
Test MAE: 1.9443


---

## **Project two - classifying MNIST hand-written digits**

- Categorize `MNIST` handwritten digits.

**1.** The setup step includes loading the dataset and specifying hyperparameters (the size of the train set and test set, and the size of mini-batches):

In [60]:
import torchvision
from torchvision import transforms

image_path = "../data"
transform = transforms.Compose([
    transforms.ToTensor()
])

mnist_train_dataset = torchvision.datasets.MNIST(
    root=image_path, train=True,
    transform=transform, download=False
)

mnist_test_dataset = torchvision.datasets.MNIST(
    root=image_path, train=False,
    transform=transform, download=False
)

In [61]:
batch_size = 64
torch.manual_seed(1)
train_dl = DataLoader(mnist_train_dataset, batch_size, shuffle=True)

In [67]:
mnist_train_dataset.data.shape, mnist_test_dataset.classes

(torch.Size([60000, 28, 28]),
 ['0 - zero',
  '1 - one',
  '2 - two',
  '3 - three',
  '4 - four',
  '5 - five',
  '6 - six',
  '7 - seven',
  '8 - eight',
  '9 - nine'])

- In the preceeding code, we constructed a data loader with batches of `64` samples.

**2.** We preprocess the input features and the labels. 

- The features are the pixels of the images we read from **Step 1**.
- The `ToTensor()` method converts the pixel features into a floating type tensor and also normalizes the pixels from the `[0, 255]` to `[0, 1]` range.
- The labels are integers from `0` to `9` representing ten digits.
- We don't need to do any scaling or further conversion.

**3.** Contruct the `NN` model:

In [68]:
mnist_train_dataset[0][0].shape

torch.Size([1, 28, 28])

In [69]:
import torch
import torch.nn as nn

class MLPClassifier(nn.Module):
    def __init__(self, image_shape, hidden_units):
        super().__init__()
        
        c, h, w = image_shape
        input_size = c * h * w
        
        layers = [nn.Flatten()]
        
        in_features = input_size
        for units in hidden_units:
            layers.append(nn.Linear(in_features, units))
            layers.append(nn.ReLU())
            in_features = units
        
        layers.append(nn.Linear(in_features, 10))  # output logits
        
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

image_shape = mnist_train_dataset[0][0].shape
model = MLPClassifier(image_shape, hidden_units=[32, 16])
model

MLPClassifier(
  (net): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=32, bias=True)
    (2): ReLU()
    (3): Linear(in_features=32, out_features=16, bias=True)
    (4): ReLU()
    (5): Linear(in_features=16, out_features=10, bias=True)
  )
)

- The model starts with a flatten layer that flattens an input image into a one-dimensional tensor.
- The input images are in the shape of `[1, 28, 28]`.
- Model has two hidden layers, with `32` and `16` units respectively.
- Ends with an output layer of ten units representing ten classes, activated by a softmax function.

**4.** Use the model for training, evaluation, and prediction:

In [72]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [73]:
torch.manual_seed(1)
num_epochs = 20

for epoch in range(num_epochs):
    accuracy_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)
        loss = loss_fn(pred, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
        accuracy_hist_train += is_correct.sum()
        
    accuracy_hist_train /= len(train_dl.dataset)
    print(f'Epoch {epoch}  Accuracy {accuracy_hist_train:.4f}')

Epoch 0  Accuracy 0.8531
Epoch 1  Accuracy 0.9287
Epoch 2  Accuracy 0.9413
Epoch 3  Accuracy 0.9506
Epoch 4  Accuracy 0.9558
Epoch 5  Accuracy 0.9592
Epoch 6  Accuracy 0.9627
Epoch 7  Accuracy 0.9650
Epoch 8  Accuracy 0.9674
Epoch 9  Accuracy 0.9690
Epoch 10  Accuracy 0.9710
Epoch 11  Accuracy 0.9729
Epoch 12  Accuracy 0.9739
Epoch 13  Accuracy 0.9750
Epoch 14  Accuracy 0.9765
Epoch 15  Accuracy 0.9778
Epoch 16  Accuracy 0.9778
Epoch 17  Accuracy 0.9798
Epoch 18  Accuracy 0.9806
Epoch 19  Accuracy 0.9811


- We used the `cross-entropy` loss function for `multiclass` classification and the `Adam optimizer` for gradient descent. 
- Trained on `20` epochs and displayed the train accuracy for every epoch.

- Evaluate on the testing set:

In [74]:
pred = model(mnist_test_dataset.data / 255.)
is_correct = (torch.argmax(pred, dim=1) == mnist_test_dataset.targets).float()
print(f'Test accuracy: {is_correct.mean():.4f}') 

Test accuracy: 0.9649


- `96` percent test accuracy.