# HW01

## Getting started

Throughout the semester, most of the homeworks will be a blend between mathematical analysis and coding implementations. 

To run this homework, as well as future homeworks, you will need to have Python 3, Jupyter notebook, NumPy, and a few other packages. The easiest way to get this installed and setup on your computer is to install Anaconda, which will automatically install everything you need. This can be found at https://www.anaconda.com/download/. Additionally, homeworks will require some knowledge of Markdown notation in Jupyter notebook.

In addition to the technical content of this homework, the goal of this assignment is to familiarize you with Python, NumPy, Jupyter notebooks, Markdown notation and the LaTeX commands used within, as well as the Gradescope interface for homework submissions.

If you have any questions about installing Anaconda or running the Jupyter notebooks, or about Markdown notation and the Gradescope interface, please post the questions on Piazza. For convenience, here is a link to this course's Piazza: https://piazza.com/class/jc6motxzn7s1qy.

First, we're going to implement the evaluation of a cost function. Then, we will calculate the gradient analytically. From this point on, we will implement the calculation of the gradient at a point, and verify this with a finite differences approximation. Next, we will implement gradient descent, and finally, we will calculate necessary optimality conditions for these problems analytically. This will be done for two common cost functions.

## Linear regression

Suppose we're given data points $(x_i,y_i)$ for $i = 1,\dots,N$. Here, $x_i \in \mathbb{R}^n$ and $y_i \in \mathbb{R}$.

We want to form a linear approximation for $y_i$ given $x_i$, so we'll take functions of the form $f(x;\theta) = \theta^T x$. We'll suppose that the classification cost for each data point is:
$$
\ell(x_i,y_i;\theta) = 
\frac{1}{2} \| f(x_i;\theta) - y_i \|_2^2 = 
\frac{1}{2} \| \theta^T x_i - y_i\|_2^2
$$

Thus, the cost for the classification error across the whole dataset is given by:
$$
\sum_{i = 1}^N \ell(x_i,y_i;\theta) = 
\sum_{i = 1}^N \frac{1}{2} \| \theta^T x_i - y_i\|_2^2
$$

Let's define a matrix $X$ such that the $i$th row of $X$ is $x_i^\top$. Thus, $X$ is a $N \times n$ matrix. Similarly, define a vector $Y$ where the $i$th entry is $y_i$, so $Y$ is an $N \times 1$ vector. Then the classification error can be written:
$$
\ell(X,Y;\theta) =
\sum_{i = 1}^N \ell(x_i,y_i;\theta) = 
\frac{1}{2} \| X \theta - Y \|_2^2
$$
This is often referred to as the *empirical risk*.

When we are given the dataset, each particular $\theta$ incurs some cost. First, write code to evaluate the empirical risk for the provided dataset.

### Cost evaluation implementation

You will be implementing a function which evaluates the empirical risk. 

First, run the '`Load packages.`' cell to load packages. The code has already been written for you. (This only needs to be done when you first open the Jupyter notebook shell.) 

Next, run the '`Generate dummy data.`' cell to generate dummy data. Time-permitting, it's always best to simulate data when first implementing and testing machine learning algorithms. Afterward, real data can be fed in and one can separate bugs in the implementation of the algorithm from quirks of the dataset more easily.

The '`Function definition.`' cell for the function definition has the function interface defined, but needs to be implemented.

The '`Call the linear regression cost evaluation function.`' cell evaluates your function on a test point and outputs the value. Run this after running the cell defining the function definition.

In [1]:
# Load packages.

import numpy as np
import csv

In [2]:
# Generate dummy data.

N = 50
n = 5

np.random.seed(15)

X = np.random.rand(N,n)
true_theta = np.random.rand(n)
Y = np.dot( X, true_theta ) + (np.random.randn(N) / 100)

random_theta = np.random.rand(n)

# Task 1: Complete the function implementation below.

In [3]:
# Function definition.

def linear_regression_cost(X,Y,theta):
    
    return np.linalg.norm(X@theta - Y)/2

In [4]:
# Call the linear regression cost evaluation function.

print('cost at true theta:')
print(linear_regression_cost(X,Y,true_theta))

print('cost at a random theta:')
print(linear_regression_cost(X,Y,random_theta))

cost at true theta:
0.0334452721755
cost at a random theta:
1.0882985358


### Gradient calculation

Now, let's calculate the gradient. Please type in the gradient of $\ell(X,Y;\theta)$ with respect to $\theta$ in the Markdown cell below. Note that $X$ and $Y$ are the data, which is seen as fixed in this context.

The gradient must be typed in LaTeX notation. LaTeX is a typesetting language which is standard for usage in mathematical typesetting. If you have not used LaTeX before, this Jupyter Notebook itself has most of the notation required for this problem, so double-click any Markdown cell to see the syntax that generates different equations. Then, Run the Markdown cell to format it.

# Task 2: Calculate the gradient

Modify the LaTeX equation below to express the gradient of the loss function with respect to parameters $\theta$. (You can edit this Markdown cell by double clicking.)

$$
\nabla_\theta \ell(X,Y;\theta) = 2 \theta^T \theta X - 2 \theta^T Y
$$

### Gradient implementation

Now that you've explicitly calculated an expression for the gradient, let's write a function that evaluates this explicit expression.

# Task 3: Implement the gradient

In [5]:
# Function definition. 

def linear_regression_gradient(X,Y,theta):
        
    return X.T@X@theta - Y.T@X

In [6]:
# Call the linear regression gradient function.

print('gradient at true theta:')
print(linear_regression_gradient(X,Y,true_theta))

print('gradient at a random theta:')
print(linear_regression_gradient(X,Y,random_theta))

gradient at true theta:
[-0.00534415  0.00533884  0.00657914 -0.01886039 -0.04082041]
gradient at a random theta:
[ 1.63013637  1.83725472  2.57619965  4.72305271 -0.14001314]


### Finite differences approximations of the gradient

For more general machine learning problems, it's often difficult to analytically calculate the gradient. (Linear regression is one of the easiest in this regard.) In these settings, a common trick is to use a finite-difference approximation of the gradient.

Additionally, even when the gradient can be calculated, finite difference implementations can help validate that there were no errors in implementation.

Recall that the gradient $\nabla f(x)$ has $\frac{\partial}{\partial x_i} f(x)$ as its $i$th element. By definition:
$$
\frac{\partial}{\partial x_i} f(x) = \lim_{h \downarrow 0} \frac{f(x + h e_i) - f(x)}{h}
$$
Recall that $e_i$ is the vector which has 1 in the $i$th coordinate and 0 everywhere else. So, if we do this for $i = 1,\dots,n$, we can approximate each element of the gradient vector.

There are a few methods to estimate finite differences, but usually the centered difference method is best. We will implement this finite-differences method now.
So, we take $h$ very small and calculate:
$$
\frac{\partial}{\partial x_i} f(x) \approx \frac{f(x + \frac{1}{2}h e_i) - f(x - \frac{1}{2}h e_i)}{h}
$$

# Task 4: Implement the finite differences gradient approximation

In [7]:
# Function definition. 

def finite_differences(f, x, h):
    
    return (f(x+h*np.ones(x.shape)/2)-f(x-h*np.ones(x.shape)/2))/h

In [8]:
# Compare the finite differences method with the analytic gradient.

print('analytic gradient at true theta:')
print(linear_regression_gradient(X,Y,true_theta))

print('finite difference approximation at true theta:')
print(finite_differences((lambda theta : linear_regression_cost(X,Y,theta)), true_theta, 1e-3))

print('analytic gradient at random theta:')
print(linear_regression_gradient(X,Y,random_theta))

print('finite difference approximation at random theta:')
print(finite_differences((lambda theta : linear_regression_cost(X,Y,theta)), random_theta, 1e-3))

analytic gradient at true theta:
[-0.00534415  0.00533884  0.00657914 -0.01886039 -0.04082041]
finite difference approximation at true theta:
-0.393593444872
analytic gradient at random theta:
[ 1.63013637  1.83725472  2.57619965  4.72305271 -0.14001314]
finite difference approximation at random theta:
2.44109268283


### Implement gradient descent

In class, we discussed gradient descent algorithms. Let us implement the gradient update now. Recall that the gradient step is simply:
$$
x^+ = x - \alpha \nabla f(x)
$$

# Task 5: Implement gradient descent

In [9]:
# Function definition.

def gradient_step( x, gradient, alpha ):
    
    return x - alpha*gradient

In [10]:
# Use the gradient_step function to iterate several times and perform gradient descent.

theta_s = random_theta
print('initial theta:')
print(theta_s)

for t in range(500):
    theta_s = gradient_step(theta_s, linear_regression_gradient(X,Y,theta_s),0.01)

print('final theta:')
print(theta_s)

print('true theta:')
print(true_theta)

print('cost at true theta:')
print(linear_regression_cost(X,Y,true_theta))

print('cost at gradient descent theta:')
print(linear_regression_cost(X,Y,theta_s))

initial theta:
[ 0.83041891  0.2196513   0.43810492  0.88428638  0.04987925]
final theta:
[ 0.92450041  0.25729156  0.0781832   0.05390196  0.81262707]
true theta:
[ 0.92547864  0.26051745  0.08258404  0.05352811  0.80512636]
cost at true theta:
0.0334452721755
cost at gradient descent theta:
0.0320942329426


\pagebreak

### Calculate necessary optimality conditions

Recall that the first-order necessary condition for optimality is that $\nabla f(x^{opt}) = 0$. 

In our previous example, suppose $X^\top X$ is invertible. What is the optimal $\theta$ then?

# Task 6: Calculate necessary condition

Modify the LaTeX equation below to express the first-order necessary condition for optimality. (You can edit this Markdown cell by double clicking.)

The necessary condition becomes:
$$
\nabla f(x) = \vec{0}
$$

### Implement first-order condition to find optimum

We've found an expression for the optimal $\theta$, so now implement this in code and compare it with the results from gradient descent.

Note that inverting a matrix is frequently a very expensive operation, and, in practice, it is often computationally cheaper to left-divide or right-divide by a matrix. 

# Task 7: Implement necessary condition

In [11]:
def solve_linear_regression(X,Y):
    
    return np.linalg.inv(X.T@X)@X.T@Y

In [12]:
print('cost at true theta:')
print(linear_regression_cost(X,Y,true_theta))

print('cost at gradient descent theta:')
print(linear_regression_cost(X,Y,theta_s))

opt_theta = solve_linear_regression(X,Y)
print('optimal theta:')
print(opt_theta)

print('cost at optimal theta:')
print(linear_regression_cost(X,Y,opt_theta))

cost at true theta:
0.0334452721755
cost at gradient descent theta:
0.0320942329426
optimal theta:
[ 0.92450044  0.25729161  0.07818307  0.05390184  0.81262722]
cost at optimal theta:
0.0320942329419


# Task 8: Submit your HW

Now that you've finished, please click "File -> Download as -> PDF via LaTeX", and submit the PDF file on Gradescope. The Gradescope for this course can be found at https://gradescope.com/courses/14561. Homeworks will be collected through Gradescope throughout this semester, so take this opportunity to familiarize yourself with the interface.