# Deep learning from scratch: homework 2

### Haoyang Han hhr8512

### General instructions

Complete the exericse listed below in this Jupyter notebook - leaving all of your code in Python cells in the notebook itself.  Feel free to add any necessary cells.  

### When submitting this homework:

**Make sure you have put your name at the top of each file**
    
**Make sure all output is present in your notebook prior to submission**

**If possible please do not zip your files when uploading to canvas**

#### <span style="color:#a50e3e;">Exercise 1. </span>  Perform mulclass classification on the MNIST dataset

Use the *multiclass softmax* cost function detailed in [Section 10.2 of the course notes](https://jermwatt.github.io/mlrefined/blog_posts/10_Linear_multiclass_classification/10_2_Multiclass_classification.html) to perform multiclass classification on a preprocessed subset of $10,000$ images from the [MNIST handwritten digit dataset](https://en.wikipedia.org/wiki/MNIST_database), which is located in the same folder as this notebook and called

``mnist_contrast_normalized.csv``

Make sure you

- Set the regularization parameter `lam` from the multiclass softmax to zero for your experiments


- Use the gradient descent `Python` code block shown in [Section 6.4 of the course notes](https://jermwatt.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html). 


- You standard normalize each feature of the input to greatly speed up gradient descent - this simply involves subtracting off the mean and dividing off the standard deviation of each feature as discussed in Sections 8.4, 9.4, and 10.3 of the course notes


- Write a custom two-panel function in `Python` to show the cost function value per iteration of gradient descent in one panel, and the corresponding number of misclassifications per iteration in the other.   You can find an efficient implementation of the multiclass misclassification counting function in Section 10.2 of the course notes 


- Use a steplength of the form $10^{\gamma}$ where $\gamma$ is an integer - try to find the largest steplength of this form that produces reasonable convergence.  Having normalized your input you might be surprised how large of a steplength value you can use in practice!  One way to find a working steplength is to try various values taking just a few steps (e.g., 5 or 10) of gradient descent and plotting the cost function / misclassification history plots over such short runs to visually confirm that the trend is decreasing - picking the largest steplength value that does indeed produce an overall decreasing trend, making a new run with this steplength value for a larger number of steps.


- Using at most 300 iterations of gradient descent you should be able to learn parameters that provide less that 300 misclassifications (around 97% accuracy).  

Below are a few `Python` including one that loads in the bsaic `autograd` and `matplotlib` libraries, and one that loads in the dataset, and a suggested initialization for gradient descnt.

Import necessary libraries.

In [5]:
# import necessary library
import autograd.numpy as np   
from autograd import value_and_grad 
import matplotlib.pyplot as plt

# # this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

Import data.

In [6]:
data = np.loadtxt('mnist_test_contrast_normalized.csv',delimiter = ',')
x = data[:,:-1].T
y = data[:,-1:]

You can use an initialization for your runs of gradient descent of the following form.

In [26]:
# How to show pictures of each number?
# Here I think I didn't fully understand the question.

In [27]:
# Upload following functions given by repo

In [28]:
# gradient descent

# using an automatic differentiator - like the one imported via the statement below - makes coding up gradient descent a breeze
from autograd import value_and_grad 

# gradient descent function - inputs: g (input function), alpha (steplength parameter), max_its (maximum number of iterations), w (initialization)
def gradient_descent(g,alpha_choice,max_its,w):
    # compute the gradient function of our input function - note this is a function too
    # that - when evaluated - returns both the gradient and function evaluations (remember
    # as discussed in Chapter 3 we always ge the function evaluation 'for free' when we use
    # an Automatic Differntiator to evaluate the gradient)
    gradient = value_and_grad(g)

    # run the gradient descent loop
    weight_history = []      # container for weight history
    cost_history = []        # container for corresponding cost function history
    alpha = 0
    for k in range(1,max_its+1):
        # check if diminishing steplength rule used
        if alpha_choice == 'diminishing':
            alpha = 1/float(k)
        else:
            alpha = alpha_choice
        
        # evaluate the gradient, store current weights and cost function value
        cost_eval,grad_eval = gradient(w)
        weight_history.append(w)
        cost_history.append(cost_eval)

        # take gradient descent step
        w = w - alpha*grad_eval
            
    # collect final weights
    weight_history.append(w)
    # compute final cost function value via g itself (since we aren't computing 
    # the gradient at the final step we don't get the final cost function value 
    # via the Automatic Differentiatoor) 
    cost_history.append(g(w))  
    return weight_history,cost_history



def model(x,w):
    # tack a 1 onto the top of each input point all at once
    o = np.ones((1,np.shape(x)[1]))
    x = np.vstack((o,x))
    
    # compute linear combination and return
    a = np.dot(x.T,w)
    return a


# multiclass perceptron regularized by the summed length of all normal vectors

lam = 10**(-1)  # our regularization paramter 

def multiclass_softmax(w):        
    # pre-compute predictions on all points
    all_evals = model(x_normalized,w)
    
    # compute softmax across data points
    a = np.log(np.sum(np.exp(all_evals),axis = 1)) 
    
    # compute cost in compact form using numpy broadcasting
    b = all_evals[np.arange(len(y)),y.astype(int).flatten()]
    cost = np.sum(a - b)
    
    # add regularizer
    cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2
    
    # return average
    return cost/float(len(y))

def multiclass_perceptron(w):        
    # pre-compute predictions on all points
    all_evals = model(x,w)
    
    # compute maximum across data points
    a =  np.max(all_evals,axis = 1)        
    
    # compute cost in compact form using numpy broadcasting
    b = all_evals[np.arange(len(y)),y.astype(int).flatten()]
    cost = np.sum(a - b)
    
    # add regularizer
    cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2
    
    # return average
    return cost/float(len(y))

def multiclass_counting_cost(w):                
    # pre-compute predictions on all points
    all_evals = model(x_normalized,w)

    # compute predictions of each input point
    y_predict = (np.argmax(all_evals,axis = 1))[:,np.newaxis]

    # compare predicted label to actual label
    count = np.sum(np.abs(np.sign(y - y_predict)))

    # return number of misclassifications
    return count



def standard_normalizer(x):
    # compute the mean and standard deviation of the input
    x_means = np.mean(x,axis = 1)[:,np.newaxis]
    x_stds = np.std(x,axis = 1)[:,np.newaxis]   

    # create standard normalizer function based on input data statistics
    normalizer = lambda data: (data - x_means)/x_stds
    
    # return normalizer and inverse_normalizer
    return normalizer





In [32]:
# normalizetion
normalizer = standard_normalizer(x)
x_normalized = normalizer(x)

# gradient descent
w = 0.1*np.random.randn(x.shape[0] + 1,10)
g = multiclass_softmax
max_its = 300
alpha_choice_1 = 0.1    # here we could only choose alpha = 0.1 or alpha = 1.
alpha_choice_2 = 1
# Becasue the iteration speed of alpha = 0.1 is so low, we need 1600 iterations to reach the 97% accuracy.
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice_1,max_its,w)
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice_2,max_its,w)

In [29]:
'''
w = 0.1*np.random.randn(x.shape[0] + 1,10)
g = multiclass_perceptron
max_its = 300
alpha_choice = 2
weight_history,cost_history = gradient_descent(g,alpha_choice,max_its,w)
'''
#compute misclassification history
#counting_cost = cost_lib.Setup(x,y,'multiclass_counter').cost_func
#count_history = [counting_cost(v) for v in weight_history]  # compute misclassification history

'\nw = 0.1*np.random.randn(x.shape[0] + 1,10)\ng = multiclass_perceptron\nmax_its = 300\nalpha_choice = 2\nweight_history,cost_history = gradient_descent(g,alpha_choice,max_its,w)\n'

## Here we should notice that alpha = 1 is the right anwser. 

In [39]:
def plot(w_hist,max_its):
    nu=[]
    soft=[]
    for w in w_hist :
        num=multiclass_counting_cost(w)
        so=multiclass_softmax(w)
        nu.append(num)
        soft.append(so)
    x_vals = np.linspace(0,max_its,max_its+1)
    fig, ax = plt.subplots(1, 2, figsize=(10,4)) 
    print(nu[len(nu)-1])
    ax[0].plot(x_vals,soft)
    ax[1].plot(x_vals,nu,color = 'r')
    plt.show()

# result
plot(weight_history_1, 300)
plot(weight_history_2, 300)


<IPython.core.display.Javascript object>

690.0


<IPython.core.display.Javascript object>

216.0
