## Creating a Neural Network
### Catherine Al Aswad (305541)
### CS 4120 Assignment 3


## 1) Most Code:

In [1]:
# Reading in the necessary packages and setting a seed.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.preprocessing
import sklearn.linear_model
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics


%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

In [2]:
# activFunc takes a numeric input x so that x can be inputed into a function 0,1,2,3,4
# option == 0 : plugs x into a sigmoid function
# option == 1 : plugs x into a tanh function
# option == 2 : plugs x into a ReLu function
# option == 3 : plugs x into a leaky ReLu function
# option == 4 : plugs x into a softmax function
# returns 'a' for the chosen function
def activFunc(x, option):
    a = 0
    if option == 0:           # sigmoid
        a = 1/(1+np.exp(-x))
        
    elif option == 1:    # tanh
        a = (np.exp(x) - np.exp(-x))/(np.exp(x) +np.exp(-x))
   

    elif option == 2:   # ReLu
        a = np.maximum(0,x)
            
    elif option == 3:   # leaky ReLu
        a = np.maximum(0.01*x,x)
        
    elif option == 4:    # softmax
        a = np.exp(x)/ np.exp(x).sum(axis = 0) 

    return a

# activFuncDeriv takes a numeric input x so that x can be inputed into a function 0,1,2,3
# option == 0 : plugs x into the derivative of the sigmoid function
# option == 1 : plugs x into the derivative of the tanh function
# option == 2 : plugs x into the derivative of the ReLu function
# option == 3 : plugs x into the derivative of the leaky ReLu function
# returns 'da' for the chosen function
def activFuncDeriv(x, option):
    da = 0
    if option == 0:           # sigmoid derivative
        da = x - np.power(x, 2)
        
    elif option == 1:    # tanh derivative
        da =  1 - np.power(x, 2)
        
    elif option == 2:   # ReLu derivative,  x>0 then it is 1, otherwize 0
        da = np.greater(x, 0).astype(int)

    elif option == 3:   # leaky ReLu derivative
        da = np.where(-5 <= 0, 0.01, 1)
        
    return da

# load_mnist_dataset: loads the mnist dataset. 
# The X features are scaled
# The target Y is changed from one column to 10 columns, one for each digit 0 to 9, to create a multiclass label
# The data is split into 40% test data and 60% train data 
# The method returns the feautures and labels of the train and test data
def load_mnist_dataset():
    np.random.seed(1)
    mnist = fetch_openml('mnist_784', version=1)
    # print(mnist.keys())
    X = mnist['data'].astype(np.float32)
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    Y = mnist['target'].astype(np.float32)
    multilabel = OneHotEncoder(sparse=False)
    Y = multilabel.fit_transform(Y.reshape(-1, 1))
    
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=4)

    
    return X_train.T, X_test.T, y_train.T, y_test.T

In [3]:
# loading the mnist dataset and creating the test and train datasets
X_train, X_test, y_train, y_test = load_mnist_dataset()

In [4]:
# init_params initializes the parameters of our network to small non-zero random weights
# n_x represents the number of input features
# n_h represents the number of hidden units (in the hidden layer)
# n_y represents the number of output units
# n_hl represents the number of hidden layers; there is at least 2 hidden layers
# The function can initialize weights for more than 2 layers - for as much as inducated by n_hl
# The function returns a dictionary params that stores the weights and b's
def init_params(n_x, n_h, n_y, n_hl):
    
    params = {}
    np.random.seed(2)
    
    # first hidder layer
    W1 = np.random.randn(n_h, n_x)* 0.01
    b1 = np.zeros((n_h, 1))
    params["W1"] = W1
    params["b1"] = b1

    # The size of the weights and b matrices for the 2nd to the n_hl - 1 hidder layer are the same in this simple neural network
    for i in range(2,n_hl):
        W = np.random.randn(n_h, n_h)* 0.01
        b = np.zeros((n_h, 1))
        params["W"+str(i)] = W
        params["b"+str(i)] = b
    
    # last hidder layer
    Wn = np.random.randn(n_y, n_h)* 0.01
    bn = np.zeros((n_y, 1))
    params["W"+str(n_hl)] = Wn
    params["b"+str(n_hl)] = bn
     
    return params

In [5]:
# This function performs forward propagation
# it receives as parameters the matrix X containing the input features for the entire training set
# and the paramters of the network in the dictionary params
# actFun specifies which activation function to use and takes a value of 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
# Includes inverted dropout (for regularization) with a default keep_prob = 0.6, so 60% of the units will be kept
# Applied to as many layers as the number of weights in the params dictionary, and can be greater than or equal to 2 layers
# The softmax function is applied to the last layer of the neural network 
#       to calculate the final output (adapting for multiclass classification).
# The function returns the final output A_temp, with a dictionary containig all Zi and Ai matrices
def forward_propagation(X, params, actFun, keep_prob = 0.6):
    
    forwd = {}
    
    # first hidder layer
    W1 = params['W1']
    b1 = params['b1']
    Z1 = np.dot(W1, X) + b1
    A1 = activFunc(Z1, actFun)  
    forwd["Z1"] = Z1
    forwd["A1"] = A1
    A_temp = A1
    
    for i in range(2, int(len(params)/2)):      # hidder layer 2 to lastHiddenLayer-1
        W = params['W'+str(i)]
        b = params['b'+str(i)]
        Z = np.dot(W, A_temp) + b
        A = activFunc(Z, actFun)
        drop = np.random.randn(A.shape[0], A.shape[1]) < keep_prob   # applying inverted dropout (not applied on 1st and last layer)
        A = (np.multiply(A, drop))/keep_prob
        
        forwd["Z"+str(i)] = Z
        forwd["A"+str(i)] = A
        A_temp = A

    # last hidder layer
    Wn = params['W'+str(int(len(params)/2))]
    bn = params['b'+str(int(len(params)/2))]
    Zn = np.dot(Wn, A_temp) + bn
    An = activFunc(Zn, 4)  # softmax layer
    forwd["Z"+str(int(len(params)/2))] = Zn
    forwd["A"+str(int(len(params)/2))] = An
    A_temp = An

    return A_temp, forwd

In [6]:
# Here we compute the cost function over the entire training set
# all we need is the predicted value by the network (Y_pred) 
# and the actual class of the training examples (Y)
def compute_cost(Y_pred, Y):
    m = Y.shape[1] # number of example
    logprobs = np.multiply(np.log(Y_pred),Y) + np.multiply(np.log(1 - Y_pred), (1 - Y))
    cost = - (1/m) * np.sum(logprobs) 
    cost = float(np.squeeze(cost))  # makes sure cost is a real number.
    
    return cost

In [7]:
# This function performs backward propagation
# it receives as parameters the matrix X containing the input features for the entire training set
# and the paramters of the network in the dictionary params
# actFun specifies which activation function to use and takes a value of 0 (sigmid drivative) ,1(tanh drivative),
#        2 (relu drivative),3 (leaky relu drivative)
# Y is the observed labels for the trainign dataset, and 
# forwd is the dictionary that contains the Z ans A matrices.
# Applied to as many layers as the number of weights in the params dictionary, and can be greater than or equal to 2 layers
# The function returns a dictionary grads of the gradients dWi and dbi for all the layers
def backward_propagation(params, forwd, X, Y, actFun):
    m = X.shape[1]
    grads = {}
    
    An = forwd['A'+str(int(len(forwd)/2))]
    dZn = An - Y  #dz in slide 10, result of applying the chain rule (see also slide 18)
    
    dZ_current = dZn     
    for i in range(int(len(params)/2),1,-1):     
        Ak_next = forwd['A'+str(i-1)]
        dW = 1/m*np.dot(dZ_current, Ak_next.T)                 # dW for last hidder layers to 2nd hidder layer
        db = 1/m*np.sum(dZ_current, axis=1, keepdims=True)     # db for last hidder layers to 2nd hidder layer
        grads["dW"+str(i)] = dW
        grads["db"+str(i)] = db
        
        Wcurrent = params['W'+str(i)]
        dZ_next = np.multiply(np.dot(Wcurrent.T, dZ_current), activFuncDeriv(Ak_next , actFun))   # applies the derivative of the activation function
        dZ_current = dZ_next
        
    dW1 = 1/m*np.dot(dZ_current, X.T)     # first hidder layer
    db1 = 1/m*np.sum(dZ_current, axis=1, keepdims=True)
    grads["dW1"] = dW1
    grads["db1"] = db1
    
    return grads

In [8]:
# This function uses the gradients and the learning rate to update the parameters
# learn_rate = 1.2 is the default
# params is a dictionary containing the weight and b's of the layers
# Applied to as many layers as the number of weights in the params dictionary, and can be greater than or equal to 2 layers
# returns a dictionary of the updated parameters
def update_params(params, grads, learn_rate = 1.2):
    params2 = {}
    for i in range(1,int(len(params)/2)+1):
        W = params['W'+str(i)]
        b = params['b'+str(i)]
        dW = grads['dW'+str(i)]
        db = grads['db'+str(i)]
        W = W - learn_rate*dW
        b = b - learn_rate*db
        params2["W"+str(i)] = W
        params2["b"+str(i)] = b
        
    return params2

In [9]:
# Here we create and train the actual Neural Network model
# We receive the dataset features X and classes Y
# we receive the number of hidden units as a hyperparameter (n_h)
# and we also get as a hyperparameter how many iterations to train
# actFun specified the activation function 0 (sigmid) ,1(tanh),2(relu),3(leaky relu), default is sigmoid function
# n_hl represents the number of hidden layers; there is at least 2 hidden layers
# keep_prob for tuning the degree of regularization for the inverted dropout regularization, with a 0.6 default value
# learn_rate for tuning the learning rate of the model, set to a default rate of 3
# retruns the learnt parameters params of the model
def nn_model(X, Y, n_h, actFun = 0, num_iterations = 1, print_cost=False, n_hl=2, keep_prob=0.6 , learn_rate = 3):
    np.random.seed(3)
    n_x = X.shape[0]
    n_y = Y.shape[0] 

    params = init_params(n_x, n_h, n_y, n_hl)

    # This loop is to perform the forward and backward iterations
    for i in range(0, num_iterations):
        # Inside the loop all the computations (forward and backward computations) are vectorized
        A2, forwd = forward_propagation(X, params, actFun, keep_prob)
        cost = compute_cost(A2, Y)
        grads = backward_propagation(params, forwd, X, Y, actFun)
        params = update_params(params, grads, learn_rate)

        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return params

In [10]:
# This method uses the learned parameters params and a set of input values X to perform a prediction 
# using activation function actFun 
# and default regularization hyperparameter keep_prob=0.6
def predict(params, X, actFun, keep_prob=0.6):
    Y_pred, forwd = forward_propagation(X, params, actFun, keep_prob)
    
    # turns to a hardmax layer: the digit with the highest probability for a given observation
    # is turned to 1, while the other digit probabilities are replaces with a probability of 0 
    predictions = (Y_pred == Y_pred.T.max(axis=1)[:,None].T).astype(int)  > 0.5

    return predictions

In [11]:
# This function returns a table with the accuracy score (in %) of predicting each digit, the overall average accuracy%, and the 
# mean squared error between the predicted and observed labels.
# modelLabel gives the data summary a model name to be associated with once it is returned
def getAccuracy_MSE (pred, labels, modelLabel = ''):
    totalAccuracy = 0
    # percentages
    Acc_MSE = pd.DataFrame(columns  = ['0','1','2','3','4','5','6','7','8','9','TotalAcc','MSE', 'Model','dataset'])

    for i in range(0,10):   # for all 10 classes  0 to 9
        newAccuracy = float((np.dot(labels[i],pred[i].T))) + np.dot(1-labels[i],1-pred[i].T)
        totalAccuracy = totalAccuracy + newAccuracy
        Acc_MSE.at[0, str(i)] = round(float(newAccuracy/float(labels[i].size)*100), 1)  
        
    Acc_MSE.at[0, 'TotalAcc'] = round(float(totalAccuracy/float(labels.T.size)*100), 1)    # avergae accuracy
    Acc_MSE.at[0, 'MSE'] = round(metrics.mean_squared_error(labels, pred), 4)       # MSE
    Acc_MSE.at[0, 'Model'] = modelLabel
    
    return Acc_MSE

In [49]:
# modelStructure : creates a table with summarized information for a neural network it fits with different structures
# function: the activation function to use in the model: 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
# HU_LowerLimit : lower limit for the number of hidden units in a layer
# HL_LowerLimit : lower limit for the number of hidden layers
# learnRate : learning rate 
# keepProb : inverted dropout regularization parameter
# nbIterations : number of iterations to use to train the model
# Modelsummary : table that contains the basic information about the neural network structure, 
#                 the accuracy scores, and MSE scores for the train and test data

def modelStructure(function, HU_LowerLimit, HL_LowerLimit, learnRate, keepProb, nbIterations):
    
    # contains the basic information about the neural network structure
    # Model :  model name signified by a number (or model index)
    # dataset : wether the information in the row is for the training or test data
    # HU : number of hidden units in a layer
    # HL : number of hidden layers
    # LR : learning rate
    # nbIter : number of iterations
    # KP : keepProb : inverted dropout regularization parameter
    # Function: the activation function to use in the model: 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
    Modelsummary = pd.DataFrame(columns  = ['Model','dataset', 'HU','HL', 'LR','nbIter','KP','Function'])
    
    # will contain the accuracy scores and MSE scores for the train and test data
    AnalysisSummary = pd.DataFrame()

    for k in range(0,6):    # training 6 models with randomized number of hiddenUnits and hiddenLayers
        hiddenUnits = np.random.randint(HU_LowerLimit,HU_LowerLimit+40)
        hiddenLayers = np.random.randint(HL_LowerLimit,HL_LowerLimit+40)
        Modelsummary.loc[len(Modelsummary.index)] = [str(k),'train', hiddenUnits, hiddenLayers, learnRate, nbIterations, keepProb, function]
        Modelsummary.loc[len(Modelsummary.index)] = [str(k),'test', hiddenUnits, hiddenLayers, learnRate, nbIterations, keepProb, function]

        # training model
        params = nn_model(X_train, y_train, n_h = hiddenUnits, num_iterations = nbIterations, print_cost=True, actFun = function, n_hl=hiddenLayers, keep_prob=keepProb, learn_rate = learnRate)

        # Computing the total accuracy (%), accuracy for each digit, and MSE
        tr_pred = predict(params, X_train, function, keep_prob = keepProb)    # train data
        tr = getAccuracy_MSE(tr_pred, y_train, str(k))
        tr.at[0, 'dataset'] = 'train'

        te_pred = predict(params, X_test, function, keep_prob = keepProb)    # test data
        te = getAccuracy_MSE(te_pred, y_test, str(k))
        te.at[0, 'dataset'] = 'test'

        AnalysisSummary = pd.concat([AnalysisSummary, (pd.concat([tr, te], axis=0))], axis=0)                        
    
    Modelsummary = pd.merge(Modelsummary, AnalysisSummary, on = ['Model','dataset'])
        
    return Modelsummary

## 2) Preliminary NN Structure Exploration:

There is a large number of hyperparamters: activation function type, number of iterations, learning rate, number of hidden layers, number of hidden units, inverted dropout regularization paramter. Due to time constraints and the use of an average computer with limited space, a grid search cannot be done and completed in time. So, an alterative method of exploring hyperparamter combinations is adopted.

In an attempt to decide on the best structure for the neural network, some simple models are explored. First, 6 models are fitted with the sigmoid function, with a randomized number of hidden units (range from 4 to 44), a randomized number of hidden layers (range from 4 to 42), learnRate = 3, inverted dropdout drops 40% of the units, and 10 iterations. The information is summarized below for the train and test data. Note that the model was not fit with the test data, but the row of the test data does show the neural network structure characteristics they were used on. With a small number of iterations, we see that the models already fit the train and test data very well. The MSE for the train and test data are very small and close to eachother, which shows there may not be any overfitting or underfitting. The models have an approximate total accuracy of 82%. The digit '1' has the lowest accuracy for all the models. There is no significant difference between the prediction accuracies of the models, even though they have different numbers of hidden units and layers.

Note that the cost for each of these models is high, so we will later try to increase the number of iterations to try and reduce the cost.

In [50]:
function = 0    # sigmoid
hiddenUnits_LL = 4
hiddenLayers_LL = 2
learnRate = 3
keepProb = 0.6
nbIter = 10  
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250903
Cost after iteration 0: 3.250913
Cost after iteration 0: 3.252495
Cost after iteration 0: 3.252330
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.251373


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,7,32,3,10,0.6,0,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,7,32,3,10,0.6,0,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,8,16,3,10,0.6,0,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
3,1,test,8,16,3,10,0.6,0,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
4,2,train,36,27,3,10,0.6,0,90.1,16.3,90.1,89.8,90.2,91.1,90.2,84.4,90.3,90.1,82.3,0.1775
5,2,test,36,27,3,10,0.6,0,90.1,16.5,89.9,89.8,90.4,90.8,90.1,84.6,90.2,89.9,82.3,0.1775
6,3,train,25,20,3,10,0.6,0,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,25,20,3,10,0.6,0,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,4,23,3,10,0.6,0,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,4,23,3,10,0.6,0,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


With the same neural network characteristics as above, we try to use a tanh activation function instead. (The number of hidden units and layers are still randomized here). The information is summarized below for the train and test data. With a small number of iterations, we see that the models already fit the train and test data very well. The MSE for the train and test data are very small and close to eachother, which shows there may not be any overfitting or underfitting. The models have an approximate total accuracy of 82%. The digit '1' has the lowest accuracy for all the models. These observations are very similar to the observations using the sigmoid activation function.

A significant observation that we can mention is about model 1 (2nd model). This model has an accuracy of about 82.7% with a lower MSE score and a slightly smaller cost than the other models. What makes this model different than the others is that it has a large number of hidden units, and a relalively small number of hidden layers. All the other models that use the sigmoid and tanh activation functions do not have this structural property. Later, when we try to improve the model, we can explore a large number of hidden units, and a relalively small number of hidden layers.

Note that the cost for each of these models is high, so we will later try to increase the number of iterations to try and reduce the cost.

In [51]:
function = 1    # tanh
hiddenUnits_LL = 4
hiddenLayers_LL = 2
learnRate = 3
keepProb = 0.6
nbIter = 10 
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250646
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,29,32,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,29,32,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,42,3,3,10,0.6,1,89.0,75.6,86.0,89.8,88.5,85.6,76.6,84.5,71.9,79.5,82.7,0.173
3,1,test,42,3,3,10,0.6,1,89.0,75.7,85.8,89.8,88.7,85.5,76.6,85.0,71.8,79.3,82.7,0.1728
4,2,train,16,26,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
5,2,test,16,26,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
6,3,train,19,5,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,19,5,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,26,30,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,26,30,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


With the same neural network characteristics used so far, we try to use a ReLu activation function instead. (The number of hidden units and layers are still randomized here). The information is summarized below for the train and test data. We see that the observations that can be made here are identical to the observations made about the first set of models that use the sigmoid activation function.

In [15]:
function = 2    # relu
hiddenUnits_LL = 4
hiddenLayers_LL = 2
learnRate = 3
keepProb = 0.6
nbIter = 10 
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,27,6,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,27,6,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,33,12,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
3,1,test,33,12,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
4,2,train,10,13,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
5,2,test,10,13,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
6,3,train,14,13,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,14,13,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,27,11,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,27,11,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


With the same neural network characteristics used so far, we try to use a leaky ReLu activation function instead. (The number of hidden units and layers are still randomized here). The information is summarized below for the train and test data. We see that the observations that can be made here are identical to the observations made about the first set of models that use the sigmoid function.

In [16]:
function = 3  # leaky relu    
hiddenUnits_LL = 4
hiddenLayers_LL = 2
learnRate = 3
keepProb = 0.6
nbIter = 10 
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,28,20,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,28,20,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,36,8,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
3,1,test,36,8,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
4,2,train,26,5,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
5,2,test,26,5,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
6,3,train,25,21,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,25,21,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,10,39,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,10,39,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


## 3) Number of Hidden Units and Layers Tunning:

From the initial models, we saw that we need to explore a model structure with a large number of hidden units, and a relalively small number of hidden layers. This observation was made when looking at the models that use the tanh function. The other models did not show any significant indications as to what model structure is better. Below we re-fit 6 models with each of the sigmoid, tanh, relu, and leaky relu activation functions in an attempt to increase model accuracy, improve prediction capability, and reduce MSE from the train and test dataset. This approach is taken to reduce any presence of underfitting.

Below, we fit 6 models using the sigmoid activation function, with learnRate = 3, keepProb = 0.6, and 10 iterations. The number of hidden layers were randomly chosen, such that they are reasonaly less than the number of hidden units in a layer. The total accuracy for the models is still about 82% and almost the same MSE as before. 

In [17]:
function = 0    # sigmoid
hiddenUnits_LL = 40
hiddenLayers_LL = 4
learnRate = 3
keepProb = 0.6
nbIter = 10  
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.254604
Cost after iteration 0: 3.253265
Cost after iteration 0: 3.252980
Cost after iteration 0: 3.254291
Cost after iteration 0: 3.253270
Cost after iteration 0: 3.252078


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,78,10,3,10,0.6,0,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1775
1,0,test,78,10,3,10,0.6,0,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,51,15,3,10,0.6,0,89.7,88.8,90.0,86.7,90.2,91.1,90.2,14.2,90.3,90.1,82.1,0.1787
3,1,test,51,15,3,10,0.6,0,89.7,88.7,89.9,86.8,90.4,90.8,90.1,13.5,90.2,89.9,82.0,0.1799
4,2,train,65,28,3,10,0.6,0,88.6,19.5,90.1,89.8,90.2,91.1,90.2,82.5,90.3,90.1,82.2,0.1776
5,2,test,65,28,3,10,0.6,0,88.7,19.9,89.9,89.8,90.4,90.8,90.1,82.9,90.2,89.9,82.3,0.1773
6,3,train,75,43,3,10,0.6,0,90.1,11.3,90.1,89.7,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,75,43,3,10,0.6,0,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,76,31,3,10,0.6,0,90.1,14.8,90.1,89.8,86.4,91.1,90.2,89.3,90.3,90.1,82.2,0.1777
9,4,test,76,31,3,10,0.6,0,90.1,14.8,89.9,89.8,86.6,90.8,90.1,90.0,90.2,89.9,82.2,0.1777


Below, we fit 6 models using the tanh activation function, with learnRate = 3, keepProb = 0.6, and 10 iterations. The number of hidden layers were randomly chosen, such that they are reasonaly less than the number of hidden units in a layer. The total accuracy for the models is still about 82% and almost the same MSE as before. It does not appear that the difference between the number of hidden units and hidden layers made any improvements to the model. 

This is also the case with the models that use the Relu and leaky Relu activation functions in the next two tables.

In [18]:
function = 1    # tanh
hiddenUnits_LL = 40
hiddenLayers_LL = 4
learnRate = 3
keepProb = 0.6
nbIter = 10 
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250829
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,70,6,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,70,6,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,63,39,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
3,1,test,63,39,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
4,2,train,78,8,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
5,2,test,78,8,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
6,3,train,50,36,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,50,36,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,62,25,3,10,0.6,1,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,62,25,3,10,0.6,1,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


In [19]:
function = 2    # relu
hiddenUnits_LL = 40
hiddenLayers_LL = 10
learnRate = 3
keepProb = 0.6
nbIter = 10 
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,43,26,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,43,26,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,45,45,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
3,1,test,45,45,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
4,2,train,63,31,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
5,2,test,63,31,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
6,3,train,69,32,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,69,32,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,57,25,3,10,0.6,2,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,57,25,3,10,0.6,2,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


In [20]:
function = 3  # leaky relu    
hiddenUnits_LL = 40
hiddenLayers_LL = 10
learnRate = 3
keepProb = 0.6
nbIter = 10 
modelStructure(function, hiddenUnits_LL, hiddenLayers_LL, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830
Cost after iteration 0: 3.250830


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,61,25,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
1,0,test,61,25,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
2,1,train,56,15,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
3,1,test,56,15,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
4,2,train,79,23,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
5,2,test,79,23,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
6,3,train,77,34,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
7,3,test,77,34,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774
8,4,train,48,15,3,10,0.6,3,90.1,11.2,90.1,89.8,90.2,91.1,90.2,89.3,90.3,90.1,82.2,0.1776
9,4,test,48,15,3,10,0.6,3,90.1,11.3,89.9,89.8,90.4,90.8,90.1,90.0,90.2,89.9,82.3,0.1774


The MSE scores for the train and test dataset are close to each other, in the previous models. For now, we can assume that this closeness indicates little to no overfitting. So, we will try to improve the accuracy of our best model structure so far by increasing the number of iterations. 

The best model so far has:
- an overall accuracy of 82.7%
- train MSE= 0.173 and test MSE= 0.1728
- uses the tanh activation function
- 41 hidden units
- 3 hidden layers
- learning rate equal to 3
- inverted dropout keep_prob parameter = 0.6
- 10 iterations


## 4) Number of Iterations Tuning:

Moving forward, we will explore and try to improve on the best model so far, with the following structure:
- uses the tanh activation function
- 41 hidden units
- 3 hidden layers

As mentioned before, we are restiricted by time and computer power. The activation functions sigmoid, relu, and leaky relu showed similar results, but nothing exceptional. Although we will not continue exploring them in this assignment, it does not mean better models cannot be found with these activation funcitons.

We will set the learning rate = 0.9 and the inverted dropout keep_prob parameter = 0.6, untill we get to tuning them.

In [21]:
# learnIter : creates a table with summarized information for a neural network it fits with a fixed structure, but different nb of iterations
# function: the activation function to use in the model: 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
# hiddenUnits : number of hidden units in a layer
# hiddenLayers : number of hidden layers
# learnRate : learning rate 
# keepProb : inverted dropout regularization parameter
# nbIterations : integer, numbers of iterations to train the models with
# Modelsummary : table that contains the basic information about the neural network structure, and 
#                 the accuracy scores and MSE scores for the train and test data

def learnIter(function, hiddenUnits, hiddenLayers, learnRate, keepProb, nbIterations):
    
    # contains the basic information about the neural network structure
    # Model :  model name signified by a number (or model index)
    # dataset : wether the information in the row is for the training or test data
    # HU : number of hidden units in a layer
    # HL : number of hidden layers
    # LR : learning rate
    # nbIter : number of iterations
    # KP : keepProb : inverted dropout regularization parameter
    # Function: the activation function to use in the model: 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
    Modelsummary = pd.DataFrame(columns  = ['Model','dataset', 'HU','HL', 'LR','nbIter','KP','Function'])
    
    # will contain the accuracy scores and MSE scores for the train and test data
    AnalysisSummary = pd.DataFrame()

    k = 0    # counter for the model being fitted
    Modelsummary.loc[len(Modelsummary.index)] = [str(k),'train', hiddenUnits, hiddenLayers, learnRate, nbIterations, keepProb, function]
    Modelsummary.loc[len(Modelsummary.index)] = [str(k),'test', hiddenUnits, hiddenLayers, learnRate, nbIterations, keepProb, function]

    # training model
    params = nn_model(X_train, y_train, n_h = hiddenUnits, num_iterations = nbIterations, print_cost=True, actFun = function, n_hl=hiddenLayers, keep_prob=keepProb, learn_rate = learnRate)

    # Computing the total accuracy (%), accuracy for each digit, and MSE
    tr_pred = predict(params, X_train, function, keep_prob = keepProb)    # test dataset
    tr = getAccuracy_MSE(tr_pred, y_train, str(k))
    tr.at[0, 'dataset'] = 'train'

    te_pred = predict(params, X_test, function, keep_prob = keepProb)     # train dataset
    te = getAccuracy_MSE(te_pred, y_test, str(k))
    te.at[0, 'dataset'] = 'test'

    AnalysisSummary = pd.concat([AnalysisSummary, (pd.concat([tr, te], axis=0))], axis=0) 
    
    Modelsummary = pd.merge(Modelsummary, AnalysisSummary, on = ['Model','dataset'])
        
    return Modelsummary

Below, we see the cost of the model for every 100 iterations. We took a maximum of 200 iterations. The cost decreased drastically from 10 to 200 iterations. Additionally, the model's accuracy is about 98% for the test data, with an overall high accuracy for each of the digits. The MSE is smaller that what we had before, so the additional iterations did improve the model. Lastly, the MSE for the train and test datasets are close to each other and small, showing that there is little signs of overfitting and underfitting.

Before deducing the best model, let us try running some hyperparameter tunning for the learning rate and the inverted dropout keep_prob parameter, with 200 iterations.

In [22]:
function = 1    # tanh
hiddenUnits = 41
hiddenLayers = 3
learnRate = 0.9
keepProb = 0.6
nbIter = 200
learnIter(function, hiddenUnits, hiddenLayers, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250680
Cost after iteration 100: 0.454086


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,41,3,0.9,200,0.6,1,99.3,99.3,98.8,98.1,98.9,98.2,99.3,98.7,98.3,98.2,98.7,0.0131
1,0,test,41,3,0.9,200,0.6,1,99.3,99.2,98.5,97.8,98.5,97.7,99.0,98.4,98.0,97.9,98.4,0.0156


## 5) Learning Rate and Regularization Tunning:

Building on the best model so far, this section explores different random combinations of learning rates and regularization paramter values for the model with:
- tanh activation function
- 41 hidden units
- 3 hidden layers
- 200 iterations

This model structure is the best so far, since it has the highest train and test accuracies of 98.7% and 98.4% respectively, with small MSE values.
When we trained the model with 200 iterations, the train and test MSE were very close to each other with a difference of 0.0025. However, the difference between the train and test MSE was 0.0002, when the model was trained with 10 iterations. This shows that the variance did slightly increase. To decrease the variance, we will look at inverted dropout keep_prob parameters less than 0.6, which is what we used before. This way, the inverted dropout regularization method with keep less than 60% of the units in each internal layer.

In [45]:
# fixLearningRate_reg : creates a table with summarized information for a neural network it fits with different learning rates and regularization parameter values
# function: the activation function to use in the model: 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
# hiddenUnits : number of hidden units in a layer
# hiddenLayers : number of hidden layers
# keepProb_UpperLimit : upper limit for the inverted dropout regularization parameter, at most should be equal to 1
# nbIterations : integer, numbers of iterations to train the models with
# learning rate option is a randomized number in the interval [0,0.9)
# keepProb regularization parameter is a randomized number in the interval [0.3, keepProb_UpperLimit)
# Modelsummary : table that contains the basic information about the neural network structure, and 
#                 the accuracy scores and MSE scores for the train and test data

def fixLearningRate_reg(function, hiddenUnits, hiddenLayers, keepProb_UpperLimit , nbIterations):
    
    # contains the basic information about the neural network structure
    # Model :  model name signified by a number (or model index)
    # dataset : wether the information in the row is for the training or test data
    # HU : number of hidden units in a layer
    # HL : number of hidden layers
    # LR : learning rate
    # nbIter : number of iterations
    # KP : keepProb : inverted dropout regularization parameter
    # Function: the activation function to use in the model: 0 (sigmid) ,1(tanh),2(relu),3(leaky relu)
    Modelsummary = pd.DataFrame(columns  = ['Model','dataset', 'HU','HL', 'LR','nbIter','KP','Function'])
    
    # will contain the accuracy scores and MSE scores for the train and test data
    AnalysisSummary = pd.DataFrame()

    for k in range(0,5):       # training 5 models with randomized learning rate and regularization values
        np.random.seed(np.random.randint(100000, size=5)[k])
        learnRate = round(np.random.uniform(0, 0.9),3)
        keepProb = round(np.random.uniform(0.3, keepProb_UpperLimit),3)
        Modelsummary.loc[len(Modelsummary.index)] = [str(k),'train', hiddenUnits, hiddenLayers, learnRate, nbIterations, keepProb, function]
        Modelsummary.loc[len(Modelsummary.index)] = [str(k),'test', hiddenUnits, hiddenLayers, learnRate, nbIterations, keepProb, function]

        # training model
        params = nn_model(X_train, y_train, n_h = hiddenUnits, num_iterations = nbIterations, print_cost=True, actFun = function, n_hl=hiddenLayers, keep_prob=keepProb, learn_rate = learnRate)

        # Computing the total accuracy (%), accuracy for each digit, and MSE
        tr_pred = predict(params, X_train, function, keep_prob = keepProb)     # train dataset
        tr = getAccuracy_MSE(tr_pred, y_train, str(k))
        tr.at[0, 'dataset'] = 'train'

        te_pred = predict(params, X_test, function, keep_prob = keepProb)      # test dataset
        te = getAccuracy_MSE(te_pred, y_test, str(k))
        te.at[0, 'dataset'] = 'test'

        AnalysisSummary = pd.concat([AnalysisSummary, (pd.concat([tr, te], axis=0))], axis=0)                        
    
    Modelsummary = pd.merge(Modelsummary, AnalysisSummary, on = ['Model','dataset'])
        
    return Modelsummary

The table below gives the summary for all the different models. The total accuracy for the first 4 models is not better than the 98.7% train datast and 98.4% test dataset accuracy found with learnRate = 0.9 and keepProb = 0.6. Similarly, their MSE values are not smaller nor closer to what we had before. 

The last model (model 4) has a 98.6% train datast and 98.4% test dataset accuracy, with a 0.0024 MSE difference. This model has slightly less accuracy with the train dataset and slightly larger MSE values, compared to the previous model with learnRate = 0.9 and keepProb = 0.6. It appears that the model preforms better with a learning rate around 0.8 or 0.9, while dropping around 60% of the units in each hidden layer. 

In [46]:
function = 1    # tanh
hiddenUnits = 41
hiddenLayers = 3
keepProb = 0.6
nbIter = 200  
fixLearningRate_reg(function, hiddenUnits, hiddenLayers, keepProb, nbIter)

Cost after iteration 0: 3.250658
Cost after iteration 100: 1.030961
Cost after iteration 0: 3.250612
Cost after iteration 100: 0.818649
Cost after iteration 0: 3.250650
Cost after iteration 100: 2.477276
Cost after iteration 0: 3.250622
Cost after iteration 100: 0.733049
Cost after iteration 0: 3.250675
Cost after iteration 100: 0.489039


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,41,3,0.326,200,0.497,1,98.9,98.9,98.3,97.7,98.1,97.6,98.8,98.4,97.4,97.2,98.1,0.0186
1,0,test,41,3,0.326,200,0.497,1,98.9,98.9,98.2,97.5,97.9,97.5,98.7,98.2,97.4,97.2,98.0,0.0196
2,1,train,41,3,0.398,200,0.374,1,90.1,88.8,90.1,89.8,90.2,52.2,90.2,89.3,46.4,90.1,81.7,0.1828
3,1,test,41,3,0.398,200,0.374,1,90.1,88.7,89.9,89.8,90.4,52.1,90.1,89.9,47.0,89.9,81.8,0.182
4,2,train,41,3,0.154,200,0.472,1,98.1,97.8,97.2,95.7,96.2,95.9,98.1,97.0,95.2,94.8,96.6,0.034
5,2,test,41,3,0.154,200,0.472,1,98.2,97.9,97.0,95.5,96.1,96.0,98.1,97.1,95.2,95.1,96.6,0.0338
6,3,train,41,3,0.458,200,0.396,1,90.1,88.8,90.1,89.3,90.2,53.1,89.5,86.9,53.4,89.1,82.0,0.1797
7,3,test,41,3,0.458,200,0.396,1,90.1,88.7,89.9,89.2,90.4,52.7,89.3,87.4,53.0,89.1,82.0,0.1801
8,4,train,41,3,0.799,200,0.578,1,99.3,99.3,98.7,98.0,98.8,98.1,99.2,98.7,98.2,98.1,98.6,0.0136
9,4,test,41,3,0.799,200,0.578,1,99.2,99.2,98.4,97.8,98.5,97.7,99.0,98.3,98.0,97.8,98.4,0.016


In a last attempt to improve the model, a model with the characteristics below is fitted:
- tanh activation function
- 41 hidden units
- 3 hidden layers
- 201 iterations
- learnRate = 0.8
- keepProb = 0.6

The learning rate was taken to be 0.8 since we established that a learnRate of about 0.8, coupled with the above structure characteristics, yields a model with high accuracies. Taking a learning rate = 0.799 is too specific, so learnRate = 0.8 may help the neural network be generalized.
The resulting model has the best accuracies so far with 98.9% train datast and 98.6% test dataset accuracy. Additionally, it has a 0.0115 - 0.0138 = 0.0023 MSE difference.

The best model we had before with the characteristics below:
- tanh activation function
- 41 hidden units
- 3 hidden layers
- 200 iterations
- learnRate = 0.9
- keepProb = 0.6

has a 98.7% train datast and 98.4% test dataset accuracy, and a 0.0131 - 0.0156 = 0.0025 MSE difference.

In [52]:
function = 1    # tanh
hiddenUnits = 41
hiddenLayers = 3
learnRate = 0.8
keepProb = 0.6
nbIter = 201
learnIter(function, hiddenUnits, hiddenLayers, learnRate, keepProb, nbIter)

Cost after iteration 0: 3.250680
Cost after iteration 100: 0.485554
Cost after iteration 200: 0.348898


Unnamed: 0,Model,dataset,HU,HL,LR,nbIter,KP,Function,0,1,2,3,4,5,6,7,8,9,TotalAcc,MSE
0,0,train,41,3,0.8,201,0.6,1,99.4,99.4,98.9,98.4,99.0,98.5,99.4,98.9,98.5,98.3,98.9,0.0115
1,0,test,41,3,0.8,201,0.6,1,99.3,99.3,98.6,98.1,98.6,98.2,99.1,98.6,98.2,98.2,98.6,0.0138


## Conclusion:

The neural network with the characteristics below is the best model found in this assignment:
- tanh activation function
- 41 hidden units
- 3 hidden layers
- 201 iterations
- learnRate = 0.8
- keepProb = 0.6

This model has the best accuracies with 98.9% train datast and 98.6% test dataset accuracy, coupled with the smallest train and test MSE values of 0.0115, 0.0138 respectively. The MSE values are small and close, which shows little to no signs of underfitting. The MSE difference of 0.0023 is small, which shows little to no signs of overfitting.


## Further Questions and Difficulties:

This assignment does not take the best approach to tune all 6 hyperparameters: activation function, nb of hidden units, nb if hidden layers, learning rate, nb of iterations, and the inverted dropout regularization paramter. This is due to time constraints and limited computing power. By applying randomized grid searches, other well-performing and more efficient neural networks may be found. For the same limitations, cross-validation was not performed. However, our confidence in the generalizability of the neural network would be higher if we apply cross-validation.

The inner layers of the neural network, in this assignment, where all taken to have the same number of units and dimensions, for simplicity. Other neural network structures can be explored that have varying number of units in each layer.

Although a neural netowrk was found by increasing the number of iterations, a large number of iterations was not tried. This led to reducing the cost, without seeing a turning point for when it starts to increase. This means that the neural netowrk that was found may still be improved if more iterations are run. However, this is was not further explored in the assignment, since the accuracy was already very high, and the train and test MSE statrted to deviate from each other as the number of iterations increased, which signify overfitting. 

## Bibliography:

- This report builds and adds on the code from the 'Backpropagation, NN_backpropagation' file provided in the CS4120 moodle course page.

Bolufe-Rohler, Anotnio. CS 4120, Machine Learning and Data Minig Course. Concepts taken from lectures: 
- 'Lecture 12 - Intro Neural Networks'
- 'Lecture 13 - Neural Networks Implementation'
- 'Lecture 14 - Deep Neural Networks'
- 'Lecture 16 - Important Concept Deep Learning'
- 'Lecture 17 - Deep Learning Optimization'