# Deep learning from scratch: homework 3

### General instructions

Complete the exercises listed below in this Jupyter notebook - leaving all of your code in Python cells in the notebook itself.  Feel free to add any necessary cells.  

Included with the notebook are 

- a custom utilities file called `custom_utils.py` that provides various plotting functionalities (for unit tests to help you debug) as well as some other processing code


- datasets for exercises: `unnorm_linregress_data.csv`, `highdim_multirange_linregress.csv`, `student_debt.csv`, and  `noisy_sin_sample.csv`

be sure you have these files located in the same directory where you put this notebook to work!

### When submitting this homework:
    
**Make sure all output is present in your notebook prior to submission**

In [217]:
# import autograd functionality
import autograd.numpy as np
from autograd.util import flatten_func
from autograd import grad as compute_grad   

# import custom utilities
import custom_utilities as util

# import various other libraries
import copy
import matplotlib.pyplot as plt

# this is needed to compensate for %matplotl+ib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%matplotlib notebook
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Feel free to use the following ``gradient_descent`` function below for this exercise.

In [218]:
# gradient descent function
def gradient_descent(g,w,alpha,max_its,beta,version):    
    # flatten the input function, create gradient based on flat function
    g_flat, unflatten, w = flatten_func(g, w)
    grad = compute_grad(g_flat)

    # record history
    w_hist = []
    w_hist.append(unflatten(w))

    # start gradient descent loop
    z = np.zeros((np.shape(w)))      # momentum term
    
    # over the line
    for k in range(max_its):   
        # plug in value into func and derivative
        grad_eval = grad(w)
        grad_eval.shape = np.shape(w)

        ### normalized or unnormalized descent step? ###
        if version == 'normalized':
            grad_norm = np.linalg.norm(grad_eval)
            if grad_norm == 0:
                grad_norm += 10**-6*np.sign(2*np.random.rand(1) - 1)
            grad_eval /= grad_norm
            
        # take descent step with momentum
        z = beta*z + grad_eval
        w = w - alpha*z

        # record weight update
        w_hist.append(unflatten(w))

    return w_hist

#### <span style="color:#a50e3e;">Exercise 2. </span>  Normalizing the input of a student debt dataset

The cell below loads in and visualizes a student debt dataset.  Here the input is in increments of time, and the output is the corresponding total amount of student debt held in the United States.

In [219]:
# load data
data = np.loadtxt('student_debt.csv',delimiter = ',')
x = data[:,:-1]
y = data[:,-1:]

# make copy of input and output (for later)
x_orig = copy.deepcopy(x)
y_orig = copy.deepcopy(y)

# plot everything
demo = util.Visualizer()
demo.plot_data_fit(x,y)

<IPython.core.display.Javascript object>

**TO DO**

Compare the performance of gradient descent in tuning the Least Squares cost function on this dataset when you use the raw dataset versus when you normalize the input.  Use only $25$ iterations of gradient descent in each instance, and in each instance use the largest steplength value $\alpha$ of the form $10^{-\gamma}$ (where $\gamma$ is a positive integer) that produces convergence.  Use an initial point $\mathbf{w}^0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$.

**You should turn in:**
    
**1)** a contour plot in each instance with gradient descent path drawn on top (as shown in ``Exercise 1``


**2)** a cost function plot for each run of gradient descent



**3)** a short explanation summarizing what input normalization has done in this instance in terms of speeding up gradient descent

**Hint:**

Feel free to steal useful code chunks from the previous exercise!

In [220]:
# make our predict function
def predict(x,w):
    return w[0] + x*w[1]

# make predictions for the entire set of inputs simultaneously
w = np.random.randn(2,1)   # make random weights for our prediction
print ('predictions for all our poitns')
print (predict(x,w).T)

# least squares
least_squares = lambda w: np.sum((predict(x,w) - y)**2)

# plot contour of cost
demo.draw_setup(least_squares, num_contours = 7,xmin = -2,xmax = 4,ymin = 7,ymax = 12)

predictions for all our poitns
[[ 2583.07579693  2583.46248921  2583.8491815   2584.10697636
   2584.36477121  2584.7514635   2585.13815579  2585.39595064  2585.6537455
   2586.04043778  2586.42713007  2586.68492493  2586.94271978
   2587.32941207  2587.71610436  2587.97389921  2588.23169407
   2588.61838636  2589.00507864  2589.2628735   2589.52066835
   2589.90736064  2590.29405293  2590.55184778  2590.80964264
   2591.19633493  2591.58302721  2591.84082207  2592.09861693
   2592.48530921  2592.8720015   2593.12979635  2593.38759121  2593.7742835
   2594.16097578  2594.41877064  2594.6765655   2595.06325778
   2595.44995007  2595.70774492]]


<IPython.core.display.Javascript object>

In [221]:
# run gradient descent initialized at 0
alpha = 10**(-9)
max_its = 25
w_init = np.zeros((2,1))

# run gradient descent
weight_history = gradient_descent(least_squares,w_init,alpha,max_its,beta = 0,version = 'unnormalized')

# plot cost function history
cost_history = [least_squares(v) for v in weight_history]
histories = [cost_history]
demo.compare_regression_histories(histories)

<IPython.core.display.Javascript object>

In [222]:
# plot history on contours
demo.draw_setup(least_squares,num_contours = 7,weight_history = weight_history, xmin = -3,xmax = 7,ymin = -1,ymax = 12)

<IPython.core.display.Javascript object>

In [223]:
# the original data and best fit line learned from our gradient descent run
w = weight_history[-1]  # take the final weight learned from our history
demo.plot_data_fit(x,y,predict = predict,weights = w)

<IPython.core.display.Javascript object>

In [224]:
# compute the mean and standard deviation of the input
x_mean = np.mean(x)
x_std = np.std(x)

# a normalization function 
def normalize(data,data_mean,data_std):
    normalized_data = (data - data_mean)/data_std
    return normalized_data

# cache a copy of the original input, then normalize
x_orig = copy.deepcopy(x)
x = normalize(x,x_mean,x_std)

# show contour plot
demo.draw_setup(least_squares,num_contours = 7, xmin = -1,xmax = 10,ymin = -1,ymax = 7)

<IPython.core.display.Javascript object>

In [225]:
# run gradient descent initialized at 0
alpha = 10**(-2)
max_its = 25
w_init = np.zeros((2,1))

# run gradient descent
weight_history = gradient_descent(least_squares,w_init,alpha,max_its,beta = 0,version = 'unnormalized')

# plot cost function history
cost_history = [least_squares(v) for v in weight_history]
histories = [cost_history]
demo.compare_regression_histories(histories)

<IPython.core.display.Javascript object>

In [226]:
# plot history on contours
demo.draw_setup(least_squares,num_contours = 7,weight_history = weight_history, xmin = -1,xmax = 10,ymin = -1,ymax = 7)

<IPython.core.display.Javascript object>

In [227]:
# the original data and best fit line learned from our gradient descent run
w = weight_history[-1]  # take the final weight learned from our history
demo.plot_data_fit(x,y,predict = predict,weights = w)

<IPython.core.display.Javascript object>

In [228]:
# a short explanation summarizing what input normalization has done in this instance in terms of speeding up gradient descent

# Answer
# Input-normalization has forced the cost function to treat the slope w1 and bias w0 more equivalently 
# (it is no longer more sensitive to the value of one over the other). 
# By ridding ourselves of elliptical contours we have rid ourselves of the long narrow valley we had before,
# so gradient descent will have a much easier time finding the minimum of this adjusted cost function.
# Before the normalization, as it's elliptical contours, it takes longer time to converge
# (meaning it takes more iterations to reach the minimum cost value)
# After the normalization, as it's circular contours, it takes less time to converge
# (meaning it takes less iterations to reach th eminimum cost value)
