# Stochastic Gradient Descent and Mini-Batch Gradient Descent Modifications in HW-3
### Esra Kantarcı - 20160808023

**Details of the task:**

*Hi Everyone, for the previous task you created multi layer perceptron to classify rice data. During the train, you used backpropagation method to get update information and updated the weights. The way you updated the weights is called gradient descent and you used whole dataset as your reference is called batch gradient descent. You already implemented that. But in real-life scenarios generally we can't fit all of the training data into memory. Therefore, we use methods called stochastic gradient descent and mini-batch gradient descent. These methods use single or given size(usually referred to as batch size) samples to update the network.*

*For this task, we expect you to implement SGD and Mini-Batch Gradient Descent into your multilayer perceptron you created last week and test them with rice data.*

In [67]:
import numpy as np
import pandas as pd
import time
import os
from sklearn.utils import shuffle

## Data: 

So, as the first, everything we used in the Homework-3 stays the same. Therefore, primitive normalization and taking test and training values are still the same.

In [68]:
os.chdir("C:\\Users\\esrac\\Downloads")
df = pd.read_csv('Ricetrain.csv', header=None)
df_shuffled = df.sample(frac=1)
y = df_shuffled.iloc[0:,7].values
y = np.where(y == "Osmancik", -1 ,1)
X = df_shuffled.iloc[0:, 0:7].values

df_test = pd.read_csv('Ricetest.csv', header=None)
df_shuffled_test = df_test.sample(frac=1)

y_test = df_shuffled_test.iloc[0:,7].values
print(y_test)
y_test = np.where(y_test == "Osmancik", -1 ,1)
X_test = df_shuffled_test.iloc[0:, 0:7].values

X_normed = X / X.max(axis=0)
X_test_normed = X_test / X_test.max(axis=0)

['Osmancik' 'Osmancik' 'Osmancik' ... 'Osmancik' 'Cammeo' 'Cammeo']


In [69]:
x_train = X_normed
x_test =X_test_normed
y_train = y
y_test = y_test
#
print('Train - Predictors shape', x_train.shape)
print('Test - Predictors shape', x_test.shape)
print('Train - Target shape', y_train.shape)
print('Test - Target shape', y_test.shape)

Train - Predictors shape (2667, 7)
Test - Predictors shape (1143, 7)
Train - Target shape (2667,)
Test - Target shape (1143,)


# Neural Network from HW-3 

The neural network from the HW-3 uses a simple gradient descent algorithm.


# Gradient Descent

Let's recall what is gradient descent from the HW-3. In simple gradient descent, which we use in the backpropagation phase for the weight update, we simply try to reach to the global minimum by using training samples. We already explained the backbone in the HW-3, therefore I am only adding 2 representative graphics for recalling purposes. 


Here is the neural network from the HW-3 and we will implement Stocastic and Mini-Batch versions afterwards. As you expect, we will only modify the gradient function and just for the readibility purposes, we will add respective training versions. (But we could override it as well.)

What we had used in our initial gradient descent implementation was batch gradient descent. It updates weights after each training epoch. So, all the iterations cycle uses all the training examples for updates. It is stable, straightforward application and easy to process paralelly. It is kind of slow but very effective. Let's take a look at Mini-Batch and Stocastic versions.


## Stochastic Gradient Descent

Since the word is a little bit long, I will call this word SGD. 
It is a gradient descent algorithm which updates weights by every training sample again and again. For example, at the first you got 1 sample, then 2, you do not calculate weights immediately but you calculate them at each iteration. This update model can be slower than normal and finding correct weight values takes more time and it gets updated much often.

The advantages of SGD are the immediate performance update by every sample and easily applicable for the algorithms. But it also has disadvantages. It is computationally more complex therefore more expensive. It takes time and since there are too many updates, it does not always end up with the best minimum value.

In order to do this, I had only added "iterations" variable, so inside 2 for-loops, on each sample, the weights will get updated. You can see the similar implementations in the references below.


## Mini-Batch Gradient Descent

Mini-batch gradient descent is a cool trick to make batch gradient descent more robust to avoid local minimas and faster computations using slides.

It takes a number of samples (shuffled) and calculates weights on that iteration cycle. So, batch gradient descent was doing the same for the number of training sample, in that case, mini-batch does it for a fixed number of the shuffled training samples at each epoch.

Batch size itself is a hyperparameter. Small values converges quickly but has more error rate. Large values converges slowly but with ends up with more accurate predictions.



# Modified Implementation:

In [108]:
class ModifiedNeuralNetwork(object):

    def __init__(self, inputLayer=3, hiddenLayers=[3, 5], lastLayer=2):
       

        self.inputLayer = inputLayer
        self.hiddenLayers = hiddenLayers
        self.lastLayer = lastLayer
        layers = [inputLayer] + hiddenLayers + [lastLayer]
        weights = []
        for i in range(len(layers) - 1):
            w = np.random.rand(layers[i], layers[i + 1])
            weights.append(w)
        self.weights = weights
        
        activations = []
        for i in range(len(layers)):
            a = np.zeros(layers[i])
            activations.append(a)
        self.activations = activations

        derivatives = []
        for i in range(len(layers) - 1):
            d = np.zeros((layers[i], layers[i + 1]))
            derivatives.append(d)
        self.derivatives = derivatives
        
    def forwardProp(self, inputs):
        
        activations = inputs
        self.activations[0] = activations
        
        for i, w in enumerate(self.weights):
            net_inputs = np.dot(activations, w)
            activations = self._sigmoid(net_inputs)
            self.activations[i + 1] = activations
        return activations

    def backProp(self, error):
        for i in reversed(range(len(self.derivatives))):

            activations = self.activations[i+1]
            delta = error * self._sigmoid_derivative(activations)
            delta_re = delta.reshape(delta.shape[0], -1).T

            current_activations = self.activations[i]
            current_activations = current_activations.reshape(current_activations.shape[0],-1)

            self.derivatives[i] = np.dot(current_activations, delta_re)
            error = np.dot(delta, self.weights[i].T)
            
    def gradientDescent(self, learningRate=0.4):
        # update the weights by stepping down the gradient
        for i in range(len(self.weights)):
            weights = self.weights[i]
            derivatives = self.derivatives[i]
            weights += derivatives * learningRate
             
                
    def train(self, inputs, targets, epochs=50, learningRate=0.2):
        #epochs and learning rate are hyperparameters
        #training is similar to perceptron but needs forward feed
        #backpropagation and gradient descent as well
        for i in range(epochs):
            sumErrors = 0
            #we will loop through the inputs and its indexes
            for j, input in enumerate(inputs):
                target = targets[j]
                #activating the layer by forward propagation
                output = self.forwardProp(input)
                #calculating the error 
                error = target - output
                
                self.backProp(error)

                # now perform gradient descent on the derivatives
                # (this will update the weights)
                self.gradientDescent(learningRate)
                sumErrors += self._mse(target, output)
                #we can report using sumErrors at each epoch to 
                #check the error, but I did not display it as the video suggested.

        print("Trained the model.")
            
                
                
    def miniBatchGradientDescent(self, learningRate=0.4):
        for i in range(len(self.weights)):
            weights = self.weights[i]
            derivatives = self.derivatives[i]
            weights += derivatives * learningRate
            
    def miniBatchTrain(self, inputs, targets, epochs=50, learningRate=0.2, batchSize=30):
        #epochs and learning rate are hyperparameters
        #training is similar to perceptron but needs forward feed
        #backpropagation and gradient descent as well
        for i in range(epochs):
            sumErrors = 0
            #we will loop through the inputs and its indexes
            X, y = shuffle(inputs, targets)
            x_random = X[:batchSize]
            y_random = y[:batchSize]
            inputs = x_random
            targets = y_random
        
            for j, input in enumerate(inputs):
                target = targets[j]
                #activating the layer by forward propagation
                output = self.forwardProp(input)
                #calculating the error 
                error = target - output
                
                self.backProp(error)

                # now perform gradient descent on the derivatives
                # (this will update the weights)
                self.miniBatchGradientDescent(learningRate)
                sumErrors += self._mse(target, output)
                #we can report using sumErrors at each epoch to 
                #check the error, but I did not display it as the video suggested.

        print("Trained the model with minibatch gradient descent with batch size:", batchSize )
    
             
    def stochasticGradientDescent(self, learningRate=0.4,iterations=10):
        for i in range(iterations):
            for j in range(len(self.weights)):
                weights = self.weights[j]
                derivatives = self.derivatives[j]
                weights += derivatives * learningRate           
    
    def sgdTrain(self, inputs, targets, epochs=50, learningRate=0.2, iterations=10):
        #epochs and learning rate are hyperparameters
        #training is similar to perceptron but needs forward feed
        #backpropagation and gradient descent as well
        for i in range(epochs):
            sumErrors = 0
            X, y = shuffle(inputs, targets)
            inputs = X
            targets = y

            #we will loop through the inputs and its indexes
            for j, input in enumerate(inputs):
                target = targets[j]
                #activating the layer by forward propagation
                output = self.forwardProp(input)
                #calculating the error 
                error = target - output
                
                self.backProp(error)

                # now perform gradient descent on the derivatives
                # (this will update the weights)
                self.stochasticGradientDescent(learningRate)
                sumErrors += self._mse(target, output)
                #we can report using sumErrors at each epoch to 
                #check the error, but I did not display it as the video suggested.

        print("Trained the model with stochastic gradient descent")
    
    def classification(self,output): 
        return np.where(output/np.average(output) < 1, 1, -1)

    def _mse(self, target, output):
        return np.average((target - output) ** 2)

    def _sigmoid(self, x):
        y = 1.0 / (1 + np.exp(-x))
        return y

    def _sigmoid_derivative(self, x):
        return x * (1.0 - x)

# Results:

## Batch Gradient Descent

The original code used batch style gradient descent. The overall accuracy is between 84%-87% stable. It takes on average 4 seconds.

## Stochastic Gradient Descent
The SGD overall accuracy is between 81%-85% and stable. It takes on average 8 seconds. The number of iterations do not have big impact on the time complexity. Shuffling affects the prediction results.


## Mini-Batch Gradient Descent
Batch size really affects the result accuracy. The accuracy can be lower than 65% according to shuffle and the low batch size. But if the batch size, which is a hyperparameter, is selected right, the accuracy can be higher than stochastic, too. But the dramatic improvement is on the timecomplexity: it only takes 0.05-0.08 seconds to evaluate.

In [109]:
# Adding datasets for training the network
items = X_normed
targets = y


start_time = time.time()

#Creating the 1 hidden layered network with 5 neurons inside
#we want to get the first layer's size as the number of the features
inputSize = items.shape[1]
nn = ModifiedNeuralNetwork(inputSize, [5], 1)
nn.train(items, targets, 30, 0.1)

#After training, adding the test data
input = X_test_normed
target = y_test

#Predictions and classifications.
output = nn.forwardProp(input)
result = nn.classification(output).T

print("--- %s seconds ---" % (time.time() - start_time))
print(result)


dataset = pd.DataFrame({'Predicted': result[len(result)-1], 'Actual': target})
dataset['Score_diff'] = dataset['Predicted'].sub(dataset['Actual'], axis = 0) 
dataset["Score_diff"] = np.where(dataset["Score_diff"] == 0 , 0 , 1)
dataset
column_sums = dataset.sum(axis=0)
accuracy = 1 - (column_sums[2] / target.shape[0])
print("Accuracy is:", accuracy*100, "%.")

Trained the model.
--- 4.763277530670166 seconds ---
[[-1 -1 -1 ...  1  1  1]]
Accuracy is: 85.03937007874016 %.


In [110]:
# Adding datasets for training the network
items = X_normed
targets = y


start_time = time.time()
#Creating the 1 hidden layered network with 5 neurons inside
#we want to get the first layer's size as the number of the features
inputSize = items.shape[1]
nn = ModifiedNeuralNetwork(inputSize, [5], 1)
iterations = items.shape[0]
nn.sgdTrain(items, targets, 30, 0.1, iterations)

#After training, adding the test data
input = X_test_normed
target = y_test

#Predictions and classifications.
output = nn.forwardProp(input)
result = nn.classification(output).T
print("--- %s seconds ---" % (time.time() - start_time))
print(result)

dataset = pd.DataFrame({'Predicted': result[len(result)-1], 'Actual': target})
dataset['Score_diff'] = dataset['Predicted'].sub(dataset['Actual'], axis = 0) 
dataset["Score_diff"] = np.where(dataset["Score_diff"] == 0 , 0 , 1)
dataset
column_sums = dataset.sum(axis=0)
accuracy = 1 - (column_sums[2] / target.shape[0])
print("Accuracy is:", accuracy*100, "%.")

Trained the model with stochastic gradient descent
--- 8.56310510635376 seconds ---
[[-1 -1 -1 ...  1  1  1]]
Accuracy is: 82.1522309711286 %.


In [111]:
# Adding datasets for training the network
items = X_normed
targets = y

start_time = time.time()
#Creating the 1 hidden layered network with 5 neurons inside
#we want to get the first layer's size as the number of the features
inputSize = items.shape[1]
nn = ModifiedNeuralNetwork(inputSize, [5], 1)
nn.miniBatchTrain(items, targets, 30, 0.1, 10)

#After training, adding the test data
input = X_test_normed
target = y_test

#Predictions and classifications.
output = nn.forwardProp(input)
result = nn.classification(output).T
print("--- %s seconds ---" % (time.time() - start_time))
print(result)

dataset = pd.DataFrame({'Predicted': result[len(result)-1], 'Actual': target})
dataset['Score_diff'] = dataset['Predicted'].sub(dataset['Actual'], axis = 0) 
dataset["Score_diff"] = np.where(dataset["Score_diff"] == 0 , 0 , 1)
dataset
column_sums = dataset.sum(axis=0)
accuracy = 1 - (column_sums[2] / target.shape[0])
print("Accuracy is:", accuracy*100, "%.")

Trained the model with minibatch gradient descent with batch size: 10
--- 0.027922630310058594 seconds ---
[[-1  1  1 ...  1  1  1]]
Accuracy is: 62.467191601049876 %.


In [113]:
# Adding datasets for training the network
items = X_normed
targets = y

start_time = time.time()

#Creating the 1 hidden layered network with 5 neurons inside
#we want to get the first layer's size as the number of the features
inputSize = items.shape[1]
nn = ModifiedNeuralNetwork(inputSize, [5], 1)
nn.miniBatchTrain(items, targets, 30, 0.1, 32)

#After training, adding the test data
input = X_test_normed
target = y_test

#Predictions and classifications.
output = nn.forwardProp(input)
result = nn.classification(output).T
print("--- %s seconds ---" % (time.time() - start_time))
print(result)

dataset = pd.DataFrame({'Predicted': result[len(result)-1], 'Actual': target})
dataset['Score_diff'] = dataset['Predicted'].sub(dataset['Actual'], axis = 0) 
dataset["Score_diff"] = np.where(dataset["Score_diff"] == 0 , 0 , 1)
dataset
column_sums = dataset.sum(axis=0)
accuracy = 1 - (column_sums[2] / target.shape[0])
print("Accuracy is:", accuracy*100, "%.")

Trained the model with minibatch gradient descent with batch size: 32
--- 0.07283854484558105 seconds ---
[[-1 -1 -1 ...  1 -1  1]]
Accuracy is: 85.8267716535433 %.


In [116]:
# Adding datasets for training the network
items = X_normed
targets = y

start_time = time.time()

#Creating the 1 hidden layered network with 5 neurons inside
#we want to get the first layer's size as the number of the features
inputSize = items.shape[1]
nn = ModifiedNeuralNetwork(inputSize, [5], 1)
nn.miniBatchTrain(items, targets, 30, 0.1, 24)

#After training, adding the test data
input = X_test_normed
target = y_test

#Predictions and classifications.
output = nn.forwardProp(input)
result = nn.classification(output).T
print("--- %s seconds ---" % (time.time() - start_time))
print(result)

dataset = pd.DataFrame({'Predicted': result[len(result)-1], 'Actual': target})
dataset['Score_diff'] = dataset['Predicted'].sub(dataset['Actual'], axis = 0) 
dataset["Score_diff"] = np.where(dataset["Score_diff"] == 0 , 0 , 1)
dataset
column_sums = dataset.sum(axis=0)
accuracy = 1 - (column_sums[2] / target.shape[0])
print("Accuracy is:", accuracy*100, "%.")

Trained the model with minibatch gradient descent with batch size: 24
--- 0.05485367774963379 seconds ---
[[-1 -1 -1 ...  1 -1  1]]
Accuracy is: 86.43919510061242 %.


In [117]:
# Adding datasets for training the network
items = X_normed
targets = y


start_time = time.time()
#Creating the 1 hidden layered network with 5 neurons inside
#we want to get the first layer's size as the number of the features
inputSize = items.shape[1]
nn = ModifiedNeuralNetwork(inputSize, [5], 1)
iterations = 100
nn.sgdTrain(items, targets, 30, 0.1, iterations)

#After training, adding the test data
input = X_test_normed
target = y_test

#Predictions and classifications.
output = nn.forwardProp(input)
result = nn.classification(output).T
print("--- %s seconds ---" % (time.time() - start_time))
print(result)

dataset = pd.DataFrame({'Predicted': result[len(result)-1], 'Actual': target})
dataset['Score_diff'] = dataset['Predicted'].sub(dataset['Actual'], axis = 0) 
dataset["Score_diff"] = np.where(dataset["Score_diff"] == 0 , 0 , 1)
dataset
column_sums = dataset.sum(axis=0)
accuracy = 1 - (column_sums[2] / target.shape[0])
print("Accuracy is:", accuracy*100, "%.")

Trained the model with stochastic gradient descent
--- 8.339909076690674 seconds ---
[[-1 -1 -1 ...  1  1  1]]
Accuracy is: 85.12685914260717 %.


# Conclusion:

In conclusion, when we consider the trade-offs such as time complexity and accuracy, the best results are given by mini-batch approach on the fastest way. All the variants are good optimizers, but my preference would be on 32-64 window-sized mini-batch implementation.

Missing local minima and noise are also problematic cases in mini-batch algorithm, therefore the safest approach is using simple gradient descent for no-risk lovers.


# Bibliography 


- Imad Dabbura, Gradient Descent Algorithm and Its Variants, https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

- Artem Oppenmann, Stochastic Batch and Mini Batch Gradient Descent, https://towardsdatascience.com/stochastic-batch-and-mini-batch-gradient-descent-demystified-8b28978f7f5

- Hamza Mahmood, Gradient Descent, https://towardsdatascience.com/gradient-descent-3a7db7520711

- Sushrut Bidwai, Week 10 Machine Learning, http://sushrutbidwai.com/week-10---machine-learning

- Jasmeet Singh, Implementing SGD from Scratch, https://towardsdatascience.com/implementing-sgd-from-scratch-d425db18a72c

- Jason Brownlee, A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size, 
https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

- Jason Brownlee, How to Code Neural Network with Backpropagation in Python (from scratch): https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/

- Valerio Velardo, Training a Neural Network 6-7-8: https://www.youtube.com/watch?v=Z97XGNUUx9o

- Github, mattm, Simple Neural Network: https://github.com/mattm/simple-neural-network

- Github, emilwallner, Deep Learning From Scratch: https://github.com/emilwallner/Deep-Learning-From-Scratch

- Github, KarnageKnight, Neural Network with n Hidden Layers: https://github.com/KarnageKnight/Neural-Network-with-n-hidden-layers

- Github, JGuymont, Numpy Multilayer Perceptron: https://github.com/JGuymont/numpy-multilayer-perceptron

- Github, musikalkemist, Deep Learning for Audio with Python (my main code reference): https://github.com/musikalkemist/DeepLearningForAudioWithPython 