
# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

The objective of this experiment is understand a modern Back Propagation Implementation 

In this experiment we will be using MNIST database. The MNIST database is a dataset of handwritten digits. It has 60,000 training samples, and 10,000 test samples. Each image is represented by 28 x 28 pixels, each containing a value 0 - 255 with its gray scale value.

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

### Keywords

MLP

BackPropagation

Chain Rule

Softmax

Activation Function

Solvers

Gradient descent

Zero Weight Initialization

Xavier Weight Initialization

Hyper Parameters

### Setup Steps

In [0]:
#@title Please enter your registration id to start: (e.g. P181900101) { run: "auto", display-mode: "form" }
Id = "P19A06E_test" #@param {type:"string"}


In [0]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "981234567" #@param {type:"string"}


In [0]:
#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()
  
notebook="BLR_M3W13_SAT_EXP_5" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
   
    print ("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "id" : Id, "file_hash" : file_hash, "notebook" : notebook}

      r = requests.post(url, data = data)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", datetime.datetime.now().date().strftime("%d %b %Y"))
      print("Time of submission: ", datetime.datetime.now().time().strftime("%H:%M:%S"))
      print("View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions")
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
  
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


In [0]:
# Importing Required Packages
import numpy as np
from scipy import ndimage
from matplotlib import pyplot as plt
from sklearn import manifold, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_mldata

from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier


Loading the dataset from sklearn dataset

In [0]:
#Load MNIST datset 
from sklearn.datasets import fetch_openml 
mnist = fetch_openml('mnist_784') 
X, Y = mnist.data, mnist.target
Y = Y.astype(int)

X = X[::10, :]     ## taking the whole data will take a lot of processing time
Y = Y[::10]
# digits = datasets.load_digits(n_class=10)
# # Create our X and y data
# X = digits.data
# Y = digits.target
print(X.shape, Y.shape)
num_examples = X.shape[0]      ## training set size
nn_input_dim = X.shape[1]      ## input layer dimensionality
nn_output_dim = len(np.unique(Y))       ## output layer dimensionality

params = {
    "lr":0.0001,        ## learning_rate
    "max_iter":500,
    "weight_init":"xavier",
    "h_dimn":100,     ## hidden_layer_size
}

(7000, 784) (7000,)


#### Weight Initializations


Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

Zero Weight Initialization: This turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.



As a solution, it is common to initialize the weights of the neurons to small numbers (random or unique) and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. Instead of using random initializations, it is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

It is worth mentioning that if you do not know which technique should be chosen as weight initilalizaion method, Xaiver is often choosen as a initial try.



In [0]:
def xavier_init(fan_in, fan_out):
    ## using FanAvg variation
    n = (fan_in+fan_out)/2
    limit = np.sqrt(3.0 * 1 / n)
    return np.random.uniform(size = (fan_in, fan_out), low = -limit, high = +limit)

In [0]:
def weight_initialization(params):
    hdim = params["h_dimn"]
    winit = params["weight_init"]
    if winit == "random":
        np.random.seed(0)
        W1 = np.random.randn(nn_input_dim, hdim)
        b1 = np.random.randn(1, hdim)
        W2 = np.random.randn(hdim, nn_output_dim)
        b2 = np.random.randn(1, nn_output_dim)
    elif winit == "zeros":
        W1 = np.zeros((nn_input_dim, hdim))
        b1 = np.zeros((1, hdim))
        W2 = np.zeros((hdim, nn_output_dim))
        b2 = np.zeros((1, nn_output_dim))
    elif winit == "xavier":
        W1 = xavier_init(nn_input_dim, hdim)
        b1 = xavier_init(1, hdim)
        W2 = xavier_init(hdim, nn_output_dim)
        b2 = xavier_init(1, nn_output_dim)
    elif winit == "uniform":
        W1 = np.random.uniform(size=(nn_input_dim, hdim), low=-1, high=1)/np.sqrt(nn_input_dim)
        b1 = np.random.uniform(size=(1, hdim), low=-1, high=1)
        W2 = np.random.uniform(size=(hdim, nn_output_dim), low=-1, high=1)/np.sqrt(hdim)
        b2 = np.random.uniform(size=(1, nn_output_dim), low=-1, high=1)
    elif winit == "normal":
        W1 = np.random.normal(loc = 0, scale = 0.5, size = (nn_input_dim, hdim))
        b1 = np.random.normal(loc = 0, scale = 0.5, size=(1, hdim))
        W2 = np.random.normal(loc = 0, scale = 0.5, size = (hdim, nn_output_dim))
        b2 = np.random.normal(loc = 0, scale = 0.5, size=(1, nn_output_dim))
    return W1, b1, W2, b2 


In [0]:
def softmax(x):
    exp_scores = np.exp(x)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    return probs

In [0]:
def build_model():
    W1, b1, W2, b2 = weight_initialization(params)
    # This is what we return at the end
    model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
    return model

In [0]:
def feedforward(model, x):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    probs = softmax(z2)
    return a1, probs


In [0]:
def backpropagation(model, x, y, a1, probs):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    delta3 = probs
    delta3[range(y.shape[0]), y] -= 1
    dW2 = (a1.T).dot(delta3)
    db2 = np.sum(delta3, axis=0, keepdims=True)
    delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
    dW1 = np.dot(x.T, delta2)
    db1 = np.sum(delta2, axis=0)
    return dW2, db2, dW1, db1


In [0]:
def calculate_loss(model, x, y):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    # Forward propagation to calculate predictions
    _, probs = feedforward(model, x)
    
    # Calculating the cross entropy loss
    corect_logprobs = -np.log(probs[range(y.shape[0]), y])
    data_loss = np.sum(corect_logprobs)
    
    return 1./y.shape[0] * data_loss


In [0]:
def test(model, x, y):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # Forward propagation to calculate predictions
    _, probs = feedforward(model, x)
    preds = np.argmax(probs, axis=1)
    return np.count_nonzero(y==preds)/y.shape[0]


In [0]:
def train(model, X_train, X_test, Y_train, Y_test, verbose=True):
    # Gradient descent. For each batch...
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    for i in range(0, params["max_iter"]):

        # Forward propagation
        a1, probs = feedforward(model, X_train)

        # Backpropagation
        dW2, db2, dW1, db1 = backpropagation(model, X_train, Y_train, a1, probs)

        # Gradient descent parameter update
        W1 += -params["lr"] * dW1
        b1 += -params["lr"] * db1
        W2 += -params["lr"] * dW2
        b2 += -params["lr"] * db2
        
        # Assign new parameters to the model
        model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
        if verbose and i % 50 == 0:
            print("Loss after iteration %i: %f" %(i, calculate_loss(model, X_train, Y_train)),
                  ", Test accuracy:", test(model, X_test, Y_test), "\n")
    return model

#### Experimenting with different Weight Initializations and evaluate the corresponding test accuracies

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4)
t = ["xavier","uniform","normal","zeros","random"]

for i in range(5):
    params["weight_init"] = t[i]
    model = build_model()
    model = train(model, X_train, X_test, Y_train, Y_test, verbose=False)
    print(params, "TestAccuracy=", test(model,X_test, Y_test))
    

{'lr': 0.0001, 'max_iter': 500, 'weight_init': 'xavier', 'h_dimn': 100} TestAccuracy= 0.8989285714285714
{'lr': 0.0001, 'max_iter': 500, 'weight_init': 'uniform', 'h_dimn': 100} TestAccuracy= 0.9010714285714285
{'lr': 0.0001, 'max_iter': 500, 'weight_init': 'normal', 'h_dimn': 100} TestAccuracy= 0.7753571428571429
{'lr': 0.0001, 'max_iter': 500, 'weight_init': 'zeros', 'h_dimn': 100} TestAccuracy= 0.0925
{'lr': 0.0001, 'max_iter': 500, 'weight_init': 'random', 'h_dimn': 100} TestAccuracy= 0.7146428571428571


####  Selecting Hyperparameters

scikit-learn provides a function: GridSearchCV to optimize your neural network's hyper-parameters automatically. We just provide the range or possible value of hyperparameters as the parameter

In [21]:
parameters = {'activation' : ["tanh", "relu"],
            'learning_rate_init' : [0.0001, 0.001],
            'hidden_layer_sizes' : [(300,), (300, 100), (100, 50)],
            'solver' : ["adam","sgd"]
             }
clf = MLPClassifier()
clf = GridSearchCV(estimator=clf, param_grid=parameters, verbose=2, cv=2)
clf.fit(X_train, Y_train)   ## might take about 10 minutes depending on number of total parameters

Fitting 2 folds for each of 24 candidates, totalling 48 fits
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.5s remaining:    0.0s


[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam, total=  39.4s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam 




[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam, total=  39.6s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd, total=  35.4s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd, total=  35.3s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=adam, total=  38.7s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=adam, total=  27.1s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=sgd, total=  34.7s
[CV] activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=sgd, total=  35.4s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=adam 




[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=adam, total=  46.7s




[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=adam, total=  46.6s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=sgd, total=  42.3s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.0001, solver=sgd, total=  42.1s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=adam, total=  10.8s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=adam, total=   9.5s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd, total=  42.1s
[CV] activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd, total=  41.9s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam 




[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam, total=  18.0s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam 




[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam, total=  17.5s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd, total=  15.9s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd, total=  15.6s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=adam, total=   8.1s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=adam 
[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=adam, total=   5.9s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd, total=  15.7s
[CV] activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd 




[CV]  activation=tanh, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd, total=  15.6s
[CV] activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam 
[CV]  activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam, total=  10.5s
[CV] activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam 
[CV]  activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=adam, total=  12.8s
[CV] activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd 
[CV]  activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd, total=   4.4s
[CV] activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd 
[CV]  activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.0001, solver=sgd, total=   4.7s
[CV] activation=relu, hidden_layer_sizes=(300,), learning_rate_init=0.001, solver=adam 
[CV]  activation=relu, hidden_layer_siz



[CV]  activation=relu, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd, total=  37.6s
[CV] activation=relu, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd 




[CV]  activation=relu, hidden_layer_sizes=(300, 100), learning_rate_init=0.001, solver=sgd, total=  39.0s
[CV] activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam 
[CV]  activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam, total=  13.9s
[CV] activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam 
[CV]  activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=adam, total=  11.5s
[CV] activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd 
[CV]  activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd, total=   6.3s
[CV] activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd 
[CV]  activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.0001, solver=sgd, total=   7.2s
[CV] activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=adam 
[CV]  activ



[CV]  activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd, total=  14.1s
[CV] activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd 
[CV]  activation=relu, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, solver=sgd, total=   7.9s


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 16.4min finished


GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'activation': ['tanh', 'relu'], 'learning_rate_init': [0.0001, 0.001], 'hidden_layer_sizes': [(300,), (300, 100), (100, 50)], 'solver': ['adam', 'sgd']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [0]:
print(clf.best_estimator_)

### Please answer the questions below to complete the experiment:

In [0]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging me", "Was Tough, but I did it", "Too Difficult for me"]


In [0]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}

In [0]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["Yes", "No"]

In [0]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")