# PyTorch Neural Network Classification

This notebook closely follows the material available at learnpytorch.io [[1]](https://www.learnpytorch.io/) with occasional refactoring and extension for consistency of style and to make connections with other parts of the package. It is also more extensive on examples and does less revisit of lower level concepts once discussed.

### Classification Problems

A classification problem refers to the identification of which class an observation belongs to amongst a set of classes. We distinguish two major cases.

* Binary. There are only two classes to choose from.
* Multi-class. There are more than two classes.

Note that while binary classification could be a subcase of multi-class classification if the definition is relaxed, however, considering the binary case individually allows for the development of specialized methodology that is often more powerful.

There is also a variant named multi-label classification where multiple nonexclusive labels can be assigned to the target observation. Observe that classification here is not an entirely correct term as classes are mutually exclusive by definition in most mathematical terminology.

In [None]:
import matplotlib.pyplot as plt
import torch
from sklearn.datasets import make_circles
from torch import nn

### Sample problem

We will rely on `sklearn`'s `make_circles` function to generate toy dataset of concentric circles in two dimension. We will control the number of points generated in total for the two circles using a prefixed sample size of 1000 and add a standard normal variation of 0.03. The random state is also fixed for reproducibility of the problem. The output is the coordinates and a classification using 0's and 1's.

In [None]:
sample_size = 1000
points, classification = make_circles(
    sample_size,
    noise=0.03,
    random_state=42,
)
print(points[:10])
print()
print(classification[:10])

We can readily visualize the circles using a scatter plot in a creative way. The first dimension of the points in `x` will serve as the horizontal input, similarly, the vertical line will be marked with the second dimension. The colormap will be `RdYlBu` which basically colors based on values of the unit interval, assigning red to 0, blue to 1, and *yellowish* to mid range numbers.

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(
    x=points[:, 0], 
    y=points[:, 1], 
    c=classification,
    cmap=plt.cm.RdYlBu,
)
plt.show()

Recall that in a generic machine learning workflow, we first take historical data then turn it into tensors. Our data at the moment is represented by `numpy` arrays.

In [None]:
print(type(points))
print(type(classification))

In [None]:
points = torch.from_numpy(points).type(torch.float)
classification = torch.from_numpy(classification).type(torch.float)
print(points[:10])
print()
print(classification[:10])

The next step would be to split the data into training and testing sets. The `sklearn` package offers a randomized split by means of the `train_test_split` utility. This needs to be used carefully, see [[2]](https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50), but in our case we ignore the finer details. We pick a standard Pareto split of 80 and 20 percents.

In [None]:
from sklearn.model_selection import train_test_split


train_input, test_input, train_output, test_output = train_test_split(
    points,
    classification,
    test_size=0.2,
    random_state=42,
)

We continue with building the model itself. In order to do so, a short detour is needed into the finer architecture of it. We will make use of a hyperparameter, the number of hidden neurons. Instead of going through a single layer of points to classification, our model will consist of two linear layers, the first which is capable of taking in 2 features, standing for the x and y coordinates, and then outputs 5 features, these will serve as the hidden neurons. The second linear layer will build on top of these 5 features as input and outputs a single feature, the classification itself.

In [None]:
class CircleModelLinearLayers(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2, out_features=5)
        self.layer_2 = nn.Linear(in_features=5, out_features=1)
    
    def forward(self, points):
        return self.layer_2(self.layer_1(points))

Let us make optional use of Cuda if available and create an instance of the model, then make an untrained prediction.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = CircleModelLinearLayers().to(device)
with torch.inference_mode():
    prediction = model(train_input.to(device)).squeeze()
print(prediction[:10])

Notice that the outputs are not in the form of a classification. These are raw outputs from the two-layer sequential linear model and referred to as logits, see [[3]](https://en.wikipedia.org/wiki/Logit) for more details. It is essentially the quantile function of the logistic distribution or in other formulation it is the inverse of the cummulative distribution function of the logistic distribution. The cummulative distribution function has its values interpreted as probabilities, consequently, applying the inverse of the logit function would transform the raw output into values that are interpreted as probabilities. The inverse of the logit function is the expit function and can be readily used by referencing the `sigmoid` function in `torch`.

In [None]:
probabilities = torch.sigmoid(prediction)
print(probabilities[:10])

To obtain a particular classification output, we still need to map the probabilities to 0's and 1's. As a rule of thumb, we can simply go with 0 for probabilities less than 0.5 and 1 for anything at least 0.5.

In [None]:
classification = torch.round(probabilities)
print(classification[:10])

The training will require a loss function and an optimizer. Optimizers are mostly problem space agnostic, however, the loss function needs to be adjusted to the scenario. In the case of a linear regression the loss was defined as the mean absolute deviation. This would be not appropriate as we would basically value perfect matches only. For binary classification problems, the binary cross entropy is one of the standard choices as the loss function, see [[4]](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a).

In [None]:
loss_function = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(params=model.parameters(), lr=0.1)

We are left to build the training and testing loop.

In [None]:
torch.manual_seed(42)

train_input, test_input = train_input.to(device), test_input.to(device)
train_output, test_output = train_output.to(device), test_output.to(device)
epoch_count = 100

for epoch in range(epoch_count):
    model.train()
    train_logits = model(train_input).squeeze()
    train_classification = torch.round(torch.sigmoid(train_logits))
    
    train_loss = loss_function(train_logits, train_output)
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    model.eval()
    
    with torch.inference_mode():
        test_logits = model(test_input).squeeze()
        test_classification = torch.round(torch.sigmoid(test_logits))
        test_loss = loss_function(test_logits, test_output)
    
    if epoch % 10 == 0:
        print(f"Epoch: {epoch} | Loss: {train_loss:.5f} | Test loss: {test_loss:.5f}")

Observe that the loss is not decreasing visibily as the training process progresses. We have multiple options to improve the model by tweaking the hyperparameters.

* More layers
* More hidden units
* More epochs
* Activation functions to introduce non-linearity
* Adjust learning rate
* Change loss function
* Use transfer learning.

Here, the most probable issue is that our modeling is working with lines, consequently, it can "cut" data along lines. However, the data is near-perfectly balanced in terms of the classification (circular) and this is not going to lead anywhere. We need to introduce non-linearity. This can be done through a rectifier, see [[5]](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)). There is an out of the box solution by means of `torch`'s `ReLU` class.

In [None]:
class CircleModelNonLinear(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2, out_features=5)
        self.layer_2 = nn.Linear(in_features=5, out_features=1)
        self.relu = nn.ReLU()
    
    def forward(self, points):
        return self.layer_2(self.relu(self.layer_1(points)))

In [None]:
model = CircleModelNonLinear().to(device)

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

In [None]:
torch.manual_seed(42)

train_input, test_input = train_input.to(device), test_input.to(device)
train_output, test_output = train_output.to(device), test_output.to(device)
epoch_count = 100

for epoch in range(epoch_count):
    model.train()
    train_logits = model(train_input).squeeze()
    train_classification = torch.round(torch.sigmoid(train_logits))
    
    train_loss = loss_function(train_logits, train_output)
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    model.eval()
    
    with torch.inference_mode():
        test_logits = model(test_input).squeeze()
        test_classification = torch.round(torch.sigmoid(test_logits))
        test_loss = loss_function(test_logits, test_output)
    
    if epoch % 10 == 0:
        print(f"Epoch: {epoch} | Loss: {train_loss:.5f} | Test loss: {test_loss:.5f}")

Slightly better, but not much. We may increase the epoch count as well.

In [None]:
model = CircleModelNonLinear().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

torch.manual_seed(42)

train_input, test_input = train_input.to(device), test_input.to(device)
train_output, test_output = train_output.to(device), test_output.to(device)
epoch_count = 10000

for epoch in range(epoch_count):
    model.train()
    train_logits = model(train_input).squeeze()
    train_classification = torch.round(torch.sigmoid(train_logits))
    
    train_loss = loss_function(train_logits, train_output)
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    model.eval()
    
    with torch.inference_mode():
        test_logits = model(test_input).squeeze()
        test_classification = torch.round(torch.sigmoid(test_logits))
        test_loss = loss_function(test_logits, test_output)
    
    if epoch % 1000 == 0:
        print(f"Epoch: {epoch} | Loss: {train_loss:.5f} | Test loss: {test_loss:.5f}")

### References

[1] Learn PyTorch for Deep Learning: Zero to Mastery book, accessed online on 2023.04.25 at https://www.learnpytorch.io/

[2] Mayukh Bhattacharyya, 3 Things You Need To Know Before You Train-Test Split, Towards Data Science article, accessed online on 2023.04.25 at https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50

[3] Logit, Wikipedia article, accessed online on 2023.04.25 at https://en.wikipedia.org/wiki/Logit

[4] Daniel Godoy, Understanding binary cross-entropy/log loss: a visual explanation, Towards Data Science article, accessed online on 2023.04.25 at https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

[5] Rectifier, Wikipedia article, accessed online on 2023.04.25 at https://en.wikipedia.org/wiki/Rectifier_(neural_networks)