# Deep Learning Assignment

**Question:** What is categorical cross-entropy and explain why it's a much better cost function than accuracy, and why it might sometimes be a better choice than mean squared error?

**Answer:** Categorical cross-entropy is a loss function which follows the following formula:$\sum_{i=1}^ny_i ⋅ log\widehat{ y_i}$ where $\widehat{y_i}$
is the $i$-th scalar value in the model output, $y_i$ is the corresponding target value, and n is the number of scalar values in the model output. Categorical cross-entropy is used for single labelled categorisation i.e. when an example can belong to one class only, because of this it is often used in classification tasks. Categorical cross-entropy is much better for classification tasks while mean squared error is much better for regression problems. This is due to the fact that for classification tasks the decision boundary is often large and if you used MSE in this case then it would not punish the misclassifications enough this is also a reason why it is sometimes a better choice than mean squared error.

In [21]:
#Import modules 
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from keras import Sequential
from keras.layers import Dense
%matplotlib inline

In [22]:
#Load the dataset
df = pd.read_csv('MNIST_train.csv')

#Inspect the dataset
df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In this dataset the column 'label' is the target variable, in this case it tells us which number is actually in the picture. The rest of the columns are the predictors and are used to show the image of the number. Knowing this lets split the data into the targets and predictors.

In [23]:
#splitting data into target and predictors
X = df['label'].values
target = keras.utils.to_categorical(X,10)
predictors = df.drop('label',axis=1).values

In [24]:
#let's split the data into test data and training data
X_train, X_test, y_train, y_test = train_test_split(predictors,target, test_size=0.3, random_state=0)

In [25]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50,activation='relu',input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50,activation='relu'))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3)
model.evaluate(X_test,y_test)



[0.937828779220581, 0.7979364991188049]

Firstly, let's define what the loss function actually does. The loss function is used to optimize a machine learning algorithm. The loss is calculated on both training and validation and its interpretation is based on how well the model is doing in these two sets. It is the sum of errors made for each example. The lower the loss value the better a model is. 

As we can see when training for one epoch we have a loss of 0.9378 and an accuracy of 79.79% . Let's see how the accuracy and loss is effected when we run the exact same model but this time with one hidden layer instead of two. Initially I would think the accuracy and loss would be worse as a more complex model usually achieves better loss and accuracy.

In [26]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50,activation='relu',input_shape=(784,)))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3)
model.evaluate(X_test,y_test)



[1.1241025924682617, 0.6874603033065796]

As we can see our guess was correct. The loss is now 1.1241 and the accuracy is 68.75% which is worse than before. Let's take a look at the effect of running the model for more than one epoch and see if this has any effect on the performance.

# Running for more Epochs

In [27]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50,activation='relu',input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50,activation='relu'))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3, epochs=20)
model.evaluate(X_test,y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.42005637288093567, 0.9305555820465088]

After running the model with two hidden layers for 20 epochs we can see the loss is now 0.4201 and the accuracy is 93.06%. This is a significance improvement from before, lets see if the model with one hidden layer follows the same pattern.

In [28]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50,activation='relu',input_shape=(784,)))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train, validation_data=(X_test,y_test) ,validation_split=0.3, epochs=20)
model.evaluate(X_test,y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.3921052813529968, 0.9253174662590027]

As we can see the accuracy and loss after 20 epochs have also improved for the model with one hidden layer. This model now has a loss of 0.3921 and an accuracy of 92.53% where it only had a loss of 1.1241 and an accuracy of 68.75% when run for one epoch.

# Implementing EarlyStopping

Models usually converge to a certain performance and it is often the case where the time taken to run an extra epoch is not worth the small improvement in performance. So this brings up the question of what is a good number of epochs to run your model for? In keras there is a function called EarlyStopping. This function will stop running epochs when the validation loss doesn't improve for n epochs. In this case I will specify the patience(n) equal to 2(for experimental purpose) which means it will stop running when the validation loss hasn't improved for two epochs. I will also specify epochs = 50, it is okay to specify it as a large number as it is going to be stopped anyway.

In [29]:
#import EarlyStopping
from keras.callbacks import EarlyStopping

# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50,activation='relu',input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50,activation='relu'))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

#define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model(specify epochs=30 here as it will stop regardless)
model.fit(X_train,y_train,validation_split=0.3, epochs=50,callbacks=[early_stopping_monitor])
model.evaluate(X_test,y_test)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


[0.43639492988586426, 0.9123809337615967]

As we can see for our model with two hidden layers it stopped running after 8 epochs. Although the accuracy is now worse than before if we were running a model with a few thousand hidden layers the difference in the run time between these 8 epochs and the 20 epochs which gave us the better accuracy may not be worth the small increase in accuracy. One other alternative method you could play around with different values for the patience which could give you a better accuracy. 

In [30]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50,activation='relu',input_shape=(784,)))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

#define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model(specify epochs=30 here as it will stop regardless)
model.fit(X_train,y_train,validation_split=0.3, epochs=50,callbacks=[early_stopping_monitor])
model.evaluate(X_test,y_test)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50


[0.3984324336051941, 0.9171428680419922]

The model with one hidden layer ran for 12 epochs compared to the model with two hidden layers which ran for 8 epochs. Although the model with one hidden layer got a better accuracy in this case it was run for 4 more epochs and was only an increase in .47% accuracy. To conclude I believe the model with two outputs converges quicker to a better accuracy.

# Using Less Neurons

As we can see we got our best accuracies when we ran our models for 20 epochs. Let's do the same for both a model with one hidden layer and a model with two hidden layers. Let's firstly decrease the number of neurons in these layers and see how this effects the accuracy.

In [31]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(30,activation='relu',input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(30,activation='relu'))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3, epochs=20)
model.evaluate(X_test,y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.4507608413696289, 0.9082539677619934]

We got a loss of 0.4201 when we used 50 neurons and a loss of 0.4508 when we used 30 neurons. Both the loss and the accuracy is better when using 50 neurons. The accuracy is 2.24% better which is quite a lot. Let's see if this follows a similar pattern for our model with only one hidden layer.

In [33]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(30,activation='relu',input_shape=(784,)))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3, epochs=20)
model.evaluate(X_test,y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.4794556796550751, 0.8876984119415283]

We see a similar pattern for the model with one hidden layer. When we used 50 neurons we got much better loss and accuracy. Also one thing to note is the models didnt take as long to run when using fewer neurons which once again comes back to the problem of trying to balance run time and good performance. Going off this you would presume that when we use more neurons the accuracy and loss would be better in both cases but for argument sake lets have a look.

# Using More Neurons

In [34]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(70,activation='relu',input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(70,activation='relu'))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3, epochs=20)
model.evaluate(X_test,y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.2829357981681824, 0.9526190757751465]

To no surprise both the accuracy and loss is much better when we used 70 neurons. But also take a look at the time it took to run the model. Take a look at the model with two hidden layers that we ran for 20 epochs using 30 neurons in each hidden layer. The max run time of the 20 epochs is 953us while the max run time of our model with two hidden layers run for 20 epochs but with 70 neurons is 2ms. This takes more than twice the time. Although in our case the difference is barely noticeable when it is being run but let's say for example we are running a model with thousands of hidden layers. It could be the difference between one epoch taking 2 hours to run compared to an hour. Let's now take a look at the output we get when we test a model with only one hidden layer and for 20 epochs and with 70 neurons.

In [35]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(70,activation='relu',input_shape=(784,)))

# Add the output layer
model.add(Dense(10,activation='softmax'))

# Compile the model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

# Fit the model
model.fit(X_train,y_train,validation_split=0.3, epochs=20)
model.evaluate(X_test,y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.40199121832847595, 0.9276190400123596]

To no surprise we get a better loss and a better accuracy when using more neurons. When it comes to run time there isn't much of a difference but as I mentioned when this is run for large models then the difference will be noticeable.

# Conclusion

To conclude, to get a model which gives the best performance is quite trivial. From our examples we could say that you should use a high number of epochs combined with a high number of neurons per layer and also a high number of hidden layers but in reality there is no point in having a model if it takes way too long to run. Therefore it is essential to be able to find somewhat of a 'happy medium' i.e. a model which doesn't take ages to run but also gives a good performance. I mentioned one method which can help that(EarlyStopping) but there are many more.