import libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd;
from scipy.stats import zscore
import torch as torch
import torch.nn as nn

Excercise #2: 

Cross Entropy Loss

In [2]:
fulldataset = pd.read_csv('./iris.csv')

np_x=fulldataset[list(fulldataset.columns)[0:-1]].apply(zscore).to_numpy();
np_y = pd.get_dummies(fulldataset['variety']).to_numpy();

n_classes = 3;
n_features = np_x.shape[1];

x_train=np_x
y_train=np_y



Now to create our tensor of input variables.

In [3]:
t_x_train=torch.tensor(x_train,requires_grad=False,dtype=torch.float64,device='cpu');
t_y_train=torch.tensor(y_train,requires_grad=False,dtype=torch.float64,device='cpu');

Initialize variables for Gradient Descent:

In [4]:
init_std_dev = 0.01;
initialW=init_std_dev*np.random.randn(n_features,n_classes)

Creating variables for weights:

In [5]:
W = torch.tensor(initialW,requires_grad=True,device='cpu');
b = torch.zeros((1,n_classes),requires_grad=True,device='cpu');

Here lets define our CrossEntropyLoss and Softmax functions.

In [6]:
lossModel = nn.CrossEntropyLoss()
sm = nn.Softmax(dim=0)

For this experiment, our loop is virtually identical to the one in excercise #1. We are using identical learning rates with an identical accuracy tolerance. The only thing which has changed is our loss function.

As such, we can directly compare the speed at which this method converges to the previous. A lower iteration count for the same learning rate means that it converges to the solution faster.

In [7]:
learning_rate = [0.5, 0.05, 0.005, 0.0005]
for rate in learning_rate:
    W = torch.tensor(initialW,requires_grad=True,device='cpu');
    b = torch.zeros((1,n_classes),requires_grad=True,device='cpu');
    optimizer = torch.optim.Adam([W,b],lr=rate)
    iteration_limit = 100000; #Desired maximum iterations
    tol = 0.95 # Desired Accuracy
    i = 0
    accuracy = 0
    while accuracy < tol and i < iteration_limit:
        # clear previous gradient calculations
        optimizer.zero_grad();
        # calculate model predictions
        linear_predictions = torch.matmul(t_x_train,W)+b
        activations = 1.0 / (1.0 + torch.exp(-linear_predictions));
        norm_predictions = sm(linear_predictions)
        #calculate loss
        loss = lossModel(linear_predictions, t_y_train)
        risk = torch.mean(loss)
        #calculate gradients of risk w.r.t. W,b and propagate them back
        loss.backward();
        # use the gradient to change W, b
        optimizer.step();
        # calculate accuracy (on the training set!)
        true_class = np.argmax(t_y_train.detach().cpu().numpy(),axis=1)
        pred_class = np.argmax(activations.detach().cpu().numpy(),axis=1)
        accuracy = np.count_nonzero(true_class == pred_class)/pred_class.shape[0];
        prediction_error = np.abs(np.mean(t_y_train.detach().numpy()-activations.detach().numpy()))
        i = i+1
    print('End of loop results:')
    print('Completed in '+str(i)+' iterations with '+str(round(accuracy*100,4))+' percent accuracy and an error of '+str(round(prediction_error,4))+' on training data.')
    print('--------------------------------')

End of loop results:
Completed in 10 iterations with 95.3333 percent accuracy and an error of 0.275 on training data.
--------------------------------
End of loop results:
Completed in 34 iterations with 95.3333 percent accuracy and an error of 0.1877 on training data.
--------------------------------
End of loop results:
Completed in 328 iterations with 95.3333 percent accuracy and an error of 0.1894 on training data.
--------------------------------
End of loop results:
Completed in 2864 iterations with 95.3333 percent accuracy and an error of 0.2064 on training data.
--------------------------------


Cross entropy loss is lightning fast compared to mean squared error. However its interesting to note that the mean square error is remarkably higher than the previous exercise. This however is an expected result. Optimizing mean squared error directly one would a lower mean squared error.

The speed advantage of Cross Entropy Loss makes this version of the model much more viable for high accuracy models. As demonstrated in the final section of Excercise #1, a model which attains 99.9% accuracy was unattainable in less than 100,000 iterations.

In [13]:
learning_rate = 0.05
W = torch.tensor(initialW,requires_grad=True,device='cpu');
b = torch.zeros((1,n_classes),requires_grad=True,device='cpu');
optimizer = torch.optim.Adam([W,b],lr=rate)
iteration_limit = 100000; #Desired maximum iterations
tol = 0.9866 # Desired Accuracy
i = 0
accuracy = 0
while accuracy < tol and i < iteration_limit:
    # clear previous gradient calculations
    optimizer.zero_grad();
    # calculate model predictions
    linear_predictions = torch.matmul(t_x_train,W)+b
    activations = 1.0 / (1.0 + torch.exp(-linear_predictions));
    norm_predictions = sm(linear_predictions)
    #calculate loss
    loss = lossModel(linear_predictions, t_y_train)
    risk = torch.mean(loss)
    #calculate gradients of risk w.r.t. W,b and propagate them back
    loss.backward();
    # use the gradient to change W, b
    optimizer.step();
    # calculate accuracy (on the training set!)
    true_class = np.argmax(t_y_train.detach().cpu().numpy(),axis=1)
    pred_class = np.argmax(activations.detach().cpu().numpy(),axis=1)
    accuracy = np.count_nonzero(true_class == pred_class)/pred_class.shape[0];
    prediction_error = np.abs(np.mean(t_y_train.detach().numpy()-activations.detach().numpy()))
    i = i+1
print('End of loop results:')
print('Completed in '+str(i)+' iterations with '+str(round(accuracy*100,4))+' percent accuracy and an error of '+str(round(prediction_error,4))+' on training data.')
print('--------------------------------')

End of loop results:
Completed in 18007 iterations with 98.6667 percent accuracy and an error of 0.2529 on training data.
--------------------------------


As you can see above, we are able to attain 98.6% accuracy in only ~18,000 iterations. Less than half the iterations to achieve the same accuracy as mean square error. Despite the lower prediciction accuracy this speed advantage is probably more than worth the loss in prediction error.