In [None]:
#source : http://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise2.ipynb

# Logistic Regression

This notebook covers a Python-based solution for the second programming exercise of the machine learning class on Coursera. Please refer to the [exercise text](doc/tp4_0_doc.pdf) for detailed descriptions and equations.

[J.Brajard's course (fr)](doc/tp4_0_cours_J.Brajard.pdf) is helpful here. [Andrew Ng's one (en)](doc/tp4_0_more_explainations[cs229].pdf) is interesting too.


In this exercise we'll implement logistic regression and apply it to a classification task.  We'll also improve the robustness of our implementation by adding regularization to the training algorithm. and testing it on a more difficult problem.

## Logistic regression

In the first part of this exercise, we'll build a logistic regression model to predict whether a student gets admitted to a university.  Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression.  For each training example, you have the applicant's scores on two exams and the admissions decision.  To accomplish this, we're going to build a classification model that estimates the probability of admission based on the exam scores.

Let's start by examining the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import os
path = 'data' + os.sep + 'tp4_0_LogiReg_data.txt'
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitted'])
data.head()

Let's create a scatter plot of the two scores and use color coding to visualize if the example is positive (admitted) or negative (not admitted).

It looks like there is a clear decision boundary between the two classes.  Now we need to implement logistic regression so we can train a model to predict the outcome.  The equations implemented in the following code samples are detailed in "ex2.pdf" in the "exercises" folder.

First we need to create a sigmoid function.  The code for this is pretty simple.

In [None]:
def sigmoid(z):
    # TODO

Let's do a quick sanity check to make sure the function is working.

In [None]:
#TODO : plot the sigmoid function 
nums = np.arange(-10, 10, step=1)
#...

Excellent!  Now we need to write the cost function to evaluate a solution.

In [None]:
def cost(theta, X, y):
    #TODO

Now we need to do some setup, similar to what we did in TP3 for linear regression.

In [None]:
# add a ones column - this makes the matrix multiplication work out easier
#TODO

# set X (training data) and y (target variable)
#TODO

# convert to numpy arrays and initalize the parameter array theta
#TODO

Let's quickly check the shape of our arrays to make sure everything looks good.

In [None]:
X.shape, theta.shape, y.shape

Now let's compute the cost for our initial solution (0 values for theta).

In [None]:
#TODO

Looks good.  Next we need a function to compute the gradient (parameter updates) given our training data, labels, and some parameters theta.

In [None]:
def gradient(theta, X, y):
    #TODO
    
    return grad

Note that we don't actually perform gradient descent in this function - we just compute a single gradient step.  In the exercise, an Octave function called "fminunc" is used to optimize the parameters given functions to compute the cost and the gradients.  Since we're using Python, we can use SciPy's "optimize" namespace to do the same thing.

Let's look at a single call to the gradient method using our data and initial paramter values of 0.

In [None]:
gradient(theta, X, y)

Now we can use SciPy's truncated newton (TNC) implementation to find the optimal parameters.

In [None]:
import scipy.optimize as opt
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
result

Let's see what the our cost looks like with this solution.

In [None]:
#TODO

Next we need to write a function that will output predictions for a dataset X using our learned parameters theta.  We can then use this function to score the training accuracy of our classifier.

In [None]:
def predict(theta, X):
#TODO - see Julien's course p.6

In [None]:
theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(map(int, correct)) % len(correct))
print 'accuracy = {0}%'.format(accuracy)

Our logistic regression classifer correctly predicted if a student was admitted or not 89% of the time.  Not bad!  Keep in mind that this is training set accuracy though.  We didn't keep a hold-out set or use cross-validation to get a true approximation of the accuracy so this number is likely higher than its true perfomance (this topic is covered in a later exercise).