# Training a Classifier on the *Salammbô* Dataset with PyTorch
We use here the `CrossEntropyLoss` for a binary classification. We implement the binary case as a multiclass problem with two classes.

Author: Pierre Nugues

We first need to import some modules

In [1]:
import torch
import torch.nn as nn
import numpy as np

### Reading the dataset
We can read the data from a file with the svmlight format or directly create numpy arrays

In [2]:
X = np.array(
    [[35680, 2217], [42514, 2761], [15162, 990], [35298, 2274],
     [29800, 1865], [40255, 2606], [74532, 4805], [37464, 2396],
     [31030, 1993], [24843, 1627], [36172, 2375], [39552, 2560],
     [72545, 4597], [75352, 4871], [18031, 1119], [36961, 2503],
     [43621, 2992], [15694, 1042], [36231, 2487], [29945, 2014],
     [40588, 2805], [75255, 5062], [37709, 2643], [30899, 2126],
     [25486, 1784], [37497, 2641], [40398, 2766], [74105, 5047],
     [76725, 5312], [18317, 1215]
     ])

y = np.array(
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

## Scaling the Data
Scaling and normalizing are usually very significant with neural networks. We use sklean transformers. They consist of two main methods: `fit()` and `transform()`.

### Normalizing

In [3]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_norm = normalizer.fit_transform(X)
X_norm[:4]

array([[0.99807515, 0.06201605],
       [0.99789783, 0.06480679],
       [0.99787509, 0.06515607],
       [0.99793128, 0.06428964]])

### Standardizing

In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=True, with_std=True)
X_scaled = scaler.fit_transform(X_norm)
X_scaled[:4]

array([[ 1.68336574, -1.7197772 ],
       [ 0.57376529, -0.56145427],
       [ 0.43143908, -0.41648279],
       [ 0.78308579, -0.77610221]])

## Creating PyTorch Tensors
PyTorch has its own implementation of matrices called tensors. They are more or less equivalent to NumPy arrays. With the `CrossEntropyLoss` $\mathbf{y}$  is a vector of indices

In [5]:
Y = torch.LongTensor(y)
Y

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1])

In [6]:
X_scaled = torch.Tensor(X_scaled)
X_scaled

tensor([[ 1.6834, -1.7198],
        [ 0.5738, -0.5615],
        [ 0.4314, -0.4165],
        [ 0.7831, -0.7761],
        [ 1.5095, -1.5348],
        [ 0.6568, -0.6464],
        [ 0.7646, -0.7571],
        [ 0.9700, -0.9692],
        [ 0.8610, -0.8564],
        [ 0.3516, -0.3355],
        [ 0.2834, -0.2665],
        [ 0.6618, -0.6515],
        [ 1.2025, -1.2115],
        [ 0.6947, -0.6852],
        [ 1.7127, -1.7511],
        [-0.5712,  0.5835],
        [-0.9400,  0.9424],
        [-0.0189,  0.0371],
        [-0.9622,  0.9639],
        [-0.3768,  0.3925],
        [-1.1617,  1.1560],
        [-0.3802,  0.3957],
        [-1.5856,  1.5599],
        [-1.0314,  1.0306],
        [-1.5463,  1.5228],
        [-1.7352,  1.7012],
        [-0.8880,  0.8921],
        [-0.7341,  0.7426],
        [-1.2155,  1.2076],
        [ 0.0071,  0.0112]])

## Creating a Model

We set a seed to have reproducible results

In [7]:
np.random.seed(1337)

We define a classifier equivalent to a logistic regression with two classes

In [8]:
class Model(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 2)
        
    def forward(self, x):
        return self.fc1(x)

And a model with one hidden layer

In [9]:
class Model2(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 10)
        self.fc2 = nn.Linear(10, 2)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

We create the model. To try the network with one hidden layer, set `complex` to true

In [10]:
complex = True

In [11]:
input_dim = X_scaled.shape[1]
if not complex:
    model = Model(input_dim)
else:
    model = Model2(input_dim)
loss_fn = nn.CrossEntropyLoss()    # binary cross entropy loss
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

## Fitting the Model
We will show three methods: batch gradient descent, stochastic descent, and minibatches

### Batch Gradient Descent

We fit the whole dataset (batch gradient descent)

In [12]:
model.train()               # sets PyTorch in the train mode
for epoch in range(100):
    Y_pred = model(X_scaled)
    loss = loss_fn(Y_pred, Y)
    optimizer.zero_grad()   # resets the gradients
    loss.backward()         # gradient backpropagation
    optimizer.step()        # weight updates

### Stochastic Gradient Descent

or, we fit the model with a batch size of one item (stochastic gradient descent)

In [13]:
model.train()
for epoch in range(50):
    for x_scaled, y in zip(X_scaled, Y):
        y_pred = model(x_scaled)
        loss = loss_fn(y_pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

### Minibatch Gradient Descent

Or we fit it with mini-batches, first with a simple inner loop

In [14]:
batch_size = 4
model.train()
for epoch in range(50):
    # Would need to shuffle X and y
    for i in range(0, X_scaled.size()[0], batch_size):
        Y_batch_pred = model(X_scaled[i:i + batch_size])
        loss = loss_fn(Y_batch_pred, Y[i:i + batch_size])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Then with a dataloader

In [15]:
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(X_scaled, Y)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

In [16]:
model.train()
for epoch in range(50):
    for X_scaled_batch, Y_batch in dataloader:
        Y_batch_pred = model(X_scaled_batch)
        loss = loss_fn(Y_batch_pred, Y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

## The weights

In [17]:
list(model.parameters())

[Parameter containing:
 tensor([[-0.3396,  0.4205],
         [ 0.1931, -0.5328],
         [ 0.0524,  0.2045],
         [ 0.5766, -0.4635],
         [-0.3821,  1.0696],
         [-0.2679,  0.1251],
         [-0.6548,  0.7661],
         [ 0.3985, -0.4925],
         [ 1.1248, -0.9981],
         [ 0.4658, -0.1773]], requires_grad=True),
 Parameter containing:
 tensor([-0.2447,  0.2726, -0.2628,  0.4770,  0.6425, -0.4957,  0.7729, -0.3846,
          0.1056,  0.2783], requires_grad=True),
 Parameter containing:
 tensor([[ 0.0772,  0.4339, -0.2504,  0.6221, -0.5545,  0.1489, -1.0388, -0.0928,
           0.8791,  0.3127],
         [ 0.3546, -0.2893, -0.1068, -0.3945,  1.0181,  0.2261,  0.6665, -0.2697,
          -0.8445, -0.4070]], requires_grad=True),
 Parameter containing:
 tensor([-0.0949,  0.0394], requires_grad=True)]

Also in the form of a dictionary

In [18]:
model.state_dict()

OrderedDict([('fc1.weight',
              tensor([[-0.3396,  0.4205],
                      [ 0.1931, -0.5328],
                      [ 0.0524,  0.2045],
                      [ 0.5766, -0.4635],
                      [-0.3821,  1.0696],
                      [-0.2679,  0.1251],
                      [-0.6548,  0.7661],
                      [ 0.3985, -0.4925],
                      [ 1.1248, -0.9981],
                      [ 0.4658, -0.1773]])),
             ('fc1.bias',
              tensor([-0.2447,  0.2726, -0.2628,  0.4770,  0.6425, -0.4957,  0.7729, -0.3846,
                       0.1056,  0.2783])),
             ('fc2.weight',
              tensor([[ 0.0772,  0.4339, -0.2504,  0.6221, -0.5545,  0.1489, -1.0388, -0.0928,
                        0.8791,  0.3127],
                      [ 0.3546, -0.2893, -0.1068, -0.3945,  1.0181,  0.2261,  0.6665, -0.2697,
                       -0.8445, -0.4070]])),
             ('fc2.bias', tensor([-0.0949,  0.0394]))])

## Prediction

### Probabilities

We compute the probabilities to belong to class 1 for all the training set

In [19]:
model.eval()
y_pred_proba = model(X_scaled)
y_pred_proba[:4]

tensor([[ 5.5468, -5.2841],
        [ 2.2089, -1.9820],
        [ 1.5893, -1.4184],
        [ 2.8332, -2.5995]], grad_fn=<SliceBackward0>)

We recompute it with matrices

In [20]:
m_params = list(model.parameters())

In [21]:
if complex:
    print(torch.sigmoid(torch.relu(X_scaled @ m_params[0].T + m_params[1]) @ m_params[2].T + m_params[3])[:4])
else:
    print(torch.sigmoid(X_scaled @ m_params[0].T + m_params[1])[:4])

tensor([[0.9961, 0.0050],
        [0.9011, 0.1211],
        [0.8305, 0.1949],
        [0.9444, 0.0692]], grad_fn=<SliceBackward0>)


### Classes

In [25]:
y_pred = torch.argmax(y_pred_proba, dim=-1)
y_pred

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1])

## Evaluation

In [26]:
from sklearn.metrics import classification_report

print(classification_report(Y, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      1.00      1.00        15

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



We computed the accuracy from the training set. This is not a good practice. We should use a dedicated test set instead.