## Model Retraining

We will now try to load the model that we created in the previous lab and retrain it by changing some hyper parameters to get better accuracy

To start we will need to import our required libraries and packages.  We will also load our diabetes data set, create test and training sets

In [10]:
# Diabetes model using PyTorch
# Uses the data file:  diabetes.csv
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

After we have imported our required libraries and packages we load the data into a dataframe (df).

In [11]:
# This code assumes that the file is in a data folder up one dir from this file.
data_file_name = '../data/diabetes.csv'
model_saved_name = '../model/PytorchDiabetesModel.pt'

df = pd.read_csv(data_file_name)
X = df.drop('Outcome', axis=1)  # Independent Feature
y = df['Outcome']                 # Dependent Feature

Before we can re-train the model, we do the same data processing as before to maintain consistency

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Creating Tensors (multidimensional matrix) x-input data  y-output data
X_train = torch.FloatTensor(X_train.values)
X_test = torch.FloatTensor(X_test.values)
y_train = torch.LongTensor(y_train.values)
y_test = torch.LongTensor(y_test.values)

Now we can load our model and try to retrain it by changing the epochs hyper parameter. An epoch defines the number of times the entire data set has to be worked through the learning algorithm. Every sample in the training dataset has had a chance to update the internal model parameters once during an epoch

So the intent behind increasing the number of epochs is that the algorithm has more time to train itself and it can make better predictions on data that it has not seen before!

In [44]:
class ANN_model(nn.Module):
    def __init__(self, input_features=8, hidden1=20, hidden2=10, out_features=2):
        super().__init__()
        self.f_connected1 = nn.Linear(input_features, hidden1)
        self.f_connected2 = nn.Linear(hidden1, hidden2)
        self.out = nn.Linear(hidden2, out_features)
            
    def forward(self, x):
        x = F.relu(self.f_connected1(x))
        x = F.relu(self.f_connected2(x))
        x = self.out(x)
        return x

    def save(self, model_path):
        torch.save(model.state_dict(), model_path)

    def load(self, model_path):
        self.load_state_dict(torch.load(model_path))
        self.eval()


torch.manual_seed(20)

model = ANN_model()
model.load(model_saved_name)


# Backward Propagation - loss and optimizer
loss_function = nn.CrossEntropyLoss()  # CrossEntropyLoss also used in Tensorflow
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)  # Tensorflow also uses Adam

epochs = 1000
final_losses = []
for i in range(epochs):
    i = i+1
    y_pred = model.forward(X_train)
    loss = loss_function(y_pred, y_train)
    final_losses.append(loss)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Once our model is created we should test the model's accuracy.  We can do this by comparing the results from the test data set.

In [45]:
# Accuracy - comparing the results from the test data

predictions = []
with torch.no_grad():
    for i, data in enumerate(X_test):
        y_pred = model(data)
        predictions.append(y_pred.argmax())
   
score = accuracy_score(y_test, predictions)  # Simply calculates number of hits / length of y_test
print(score)

0.7012987012987013


Here we see that even if we trained our model for double the time that it trained on previously, it still produced an accuracy that was lower than before! This phenomenon is called as overfitting.

You only get accurate predictions if the machine learning model generalizes to all types of data within its domain. Overfitting occurs when the model cannot generalize and fits too closely to the training dataset instead. Overfitting happens due to several reasons, such as:

    • The training data size is too small and does not contain enough data samples to accurately represent all possible input data values.

    • The training data contains large amounts of irrelevant information, called noisy data.

    • The model trains for too long on a single sample set of data.
    
    • The model complexity is high, so it learns the noise within the training data.