### Developing a neural network with PyTorch - Study of the Breast Cancer Wisconsin Dataset

The idea here is to get familiar with both virtual environments and PyTorch. We also want to compare NN network performances with the one we built - [Building a Neural Network](#https://github.com/Gdeterline/Neural-Network-Build/blob/main/breast_cancer_classification.ipynb), using only numpy. We evaluated the model performance on the Breast Cancer Wisconsin dataset.
Developing a neural network with PyTorch - Study of the Breast Cancer Wisconsin Dataset is therefore a way to hit two birds with one stone.

Import required packages

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import torch
import sklearn.preprocessing as preprocessing
from sklearn.model_selection import train_test_split
import torch.nn as nn
from torch.optim import SGD
from PyTorchModel import Model

In [2]:
# Prepare data
data = pd.read_csv('./datasets/breast_cancer_data.csv')

In [3]:
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


We will preprocess the data the same way we did in the previous notebook.
This time, we will use PyTorch to build the neural network.

<ins>Nota Bene:</ins> It is important to note that the absolute performance of the model is not the main focus. In the previous project, we chose to focus on several features of the input data. We will focus on the same ones here. Therefore, this could lead to a lower performance of the model. But at least, we'll be able to compare the two models.

In [4]:
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})

In [5]:
selected_columns = [
    'diagnosis', 
    'radius_mean', 
    'perimeter_mean', 
    'area_mean', 
    'concavity_mean', 
    'concave points_mean', 
    'radius_se', 
    'perimeter_se', 
    'area_se', 
    'radius_worst', 
    'perimeter_worst', 
    'area_worst', 
    'compactness_worst', 
    'concavity_worst', 
    'concave points_worst'
]

# Create the new DataFrame with the selected columns
data_postpr = data[selected_columns].copy(deep=True)

# Split X and y data
y = data_postpr['diagnosis']
X = data_postpr.drop('diagnosis', axis=1)

print(type(X), '\n', type(y))

# Normalize the data
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
y_train = np.array(y_train)
y_test = np.array(y_test)

<class 'pandas.core.frame.DataFrame'> 
 <class 'pandas.core.series.Series'>


In [6]:
print(type(X_train), '\n', type(X_test), '\n', type(y_train), '\n', type(y_test)) 

<class 'numpy.ndarray'> 
 <class 'numpy.ndarray'> 
 <class 'numpy.ndarray'> 
 <class 'numpy.ndarray'>


The data is now loaded and preprocessed exactly as in the previous notebook. We will now build the neural network using PyTorch. 

First, we need to convert the arrays to PyTorch tensors.

In [7]:
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)
y_train = y_train.unsqueeze(1)
y_test = torch.tensor(y_test, dtype=torch.float32)
y_test = y_test.unsqueeze(1)

In [8]:
print(X_train.size())
print(y_train.size())

torch.Size([455, 14])
torch.Size([455, 1])


The data is ready to be used in the neural network. We will now build the neural network, train it, and evaluate its performance.

We will apply the following steps:
- Define the neural network
- Define the loss function
- Define the optimizer
- Train the network
- Evaluate the performance of the network

In [9]:
# Defining the model
model = Model()
print(model)

Model(
  (layer1): Linear(in_features=14, out_features=128, bias=True)
  (layer2): Linear(in_features=128, out_features=64, bias=True)
  (layer3): Linear(in_features=64, out_features=1, bias=True)
)


In [10]:
optimizer = torch.optim.SGD(model.parameters(), lr = 0.001) # Stochastic Gradient Descent - same LR as use case NN built
loss_func = nn.BCELoss()
epochs = 20000

for epoch in range(epochs):
    optimizer.zero_grad()
    y_hat = model(X_train)
    loss = loss_func(y_hat, y_train)
    loss.backward()
    optimizer.step()
    
    if epoch % 500 == 0:
            print(f"Epoch {epoch} - Loss: {loss.item()}")




Epoch 0 - Loss: 0.6624242663383484
Epoch 500 - Loss: 0.6580744981765747
Epoch 1000 - Loss: 0.654190182685852
Epoch 1500 - Loss: 0.6502673625946045
Epoch 2000 - Loss: 0.6462366580963135
Epoch 2500 - Loss: 0.6420353651046753
Epoch 3000 - Loss: 0.6376010775566101
Epoch 3500 - Loss: 0.6328703165054321
Epoch 4000 - Loss: 0.6277782917022705
Epoch 4500 - Loss: 0.6222587823867798
Epoch 5000 - Loss: 0.6162433624267578
Epoch 5500 - Loss: 0.6096624732017517
Epoch 6000 - Loss: 0.6024459004402161
Epoch 6500 - Loss: 0.5945242643356323
Epoch 7000 - Loss: 0.5858305096626282
Epoch 7500 - Loss: 0.576303243637085
Epoch 8000 - Loss: 0.5658897757530212
Epoch 8500 - Loss: 0.55455082654953
Epoch 9000 - Loss: 0.5422652959823608
Epoch 9500 - Loss: 0.5290355086326599
Epoch 10000 - Loss: 0.5148926377296448
Epoch 10500 - Loss: 0.4999004304409027
Epoch 11000 - Loss: 0.4841579496860504
Epoch 11500 - Loss: 0.46779853105545044
Epoch 12000 - Loss: 0.4509868323802948
Epoch 12500 - Loss: 0.43391114473342896
Epoch 13000 

The model is now trained. We will now evaluate its performance on the test set.

In [14]:
y_pred_train = model(X_train)
accuracy_train = (y_pred_train.round() == y_train).float().mean()
print(f'Accuracy on the training dataset : {accuracy_train}%')

y_pred_test = model(X_test)
accuracy_test = (y_pred_test.round() == y_test).float().mean()
print(f'Accuracy on the testing dataset : {accuracy_test}%')


Accuracy on the training dataset : 0.9296703338623047%
Accuracy on the testing dataset : 0.9561403393745422%


Under the exact same conditions, the neural network we built ourselves works better than the one we built with PyTorch.
Yet there are two main things we need to improve in the model we built ourselves:
- The computation time is much longer (still pretty short for this small dataset). Ours takes about 1min30 to train, while the PyTorch model takes about 30s.
- The PyTorch model is much more flexible - the activation functions can change easily depending on the layer, etc.