# Programming Assignment 3 - Logistic Regression (50 points)

For this assignment, you will be using the Breast Cancer Wisconsin  dataset to create a classifier that can help diagnose patients. 

You task is to determine if you can use logistic regression with the features to predict if a tumor is benign or malignant.  This is an important task, as if save lives.

``Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.''

The ten real-valued features compute different measurements on the cell nucleus.  The official documentation describes the features as: 

*  radius (mean of distances from center to points on the perimeter)
* texture (standard deviation of gray-scale values)
* perimeter
* area
* smoothness (local variation in radius lengths)
* compactness (perimeter^2 / area - 1.0)
* concavity (severity of concave portions of the contour)
* concave points (number of concave portions of the contour)
* symmetry
* fractal dimension ("coastline approximation" - 1)

## Before you start

For this semester, the teaching staff of this course will be using Autograder to grade programming assignment. Here are three things we would like you to know before starting. `PLEASE READ CAREFULLY.` Otherwise, you might lose points on some questions.

* If you see any blocks containing statements like `grader.check("Qxx")`, please `do not modify` them. You can add new cells to the notebook, but just make sure there is `no other cells` between the answer cells containing tag `# TODO Qxx` and grading cells like 'grader.check("Qxx")`. 

* If the instructions say that you are required to use certain names for output variables, please `follow the instructions`, and you are not supposed to change the names of any given variables. You can still create new variables, but don't forget to `assign the output variables to correct values`. If the `type` of a output variable is specified, make sure the type of the variable is correct.

* You can use print statements to print out results through out the notebook. However, if you have any `print statements within functions`, please make sure putting them `in comments` before you submit.

* Please note for questions that require you to plot, please **_DO NOT MODIFY_** statements like `plt.show(block=False)`. Changing the statement would block the execution of autograder and you might lose points on that question.

* Please `APPEND YOUR NYU NETID` to the name your submission (for example, name your submission as "HW1_prog_abc12345.ipynb" when you submit on Gradescope, and replace <abc1234> with your NYU NetID). 

Good luck with programming assignment 4!

# Start by importing the libraries

In [None]:
# Import Important Libraries
import sklearn
from sklearn.datasets import load_breast_cancer # taking included data set from Sklearn http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
from sklearn.linear_model import LogisticRegression # importing Sklearn's logistic regression's module
from sklearn import preprocessing # preprossing is what we do with the data before we run the learning algorithm
from sklearn.model_selection import train_test_split 
import numpy as np
# import math

import matplotlib.pyplot as plt
%matplotlib inline

# Load the data set.

In the below code cell, you will load the data from sklearn using the method given. Check import statements and use the given function

In [None]:
# TODO Q01
cancer = ...   # type in load_breast_cancer()

In [None]:
grader.check("Q01")

In [None]:
# VERIFY - Print the shape of data and target
print('Q01 - cancer.target.shape: ', cancer.target.shape)
print('Q01 - cancer.data.shape: ', cancer.data.shape)

In [None]:
# Read through the description of the dataset by uncommenting the line of code below
#print(cancer.DESCR)

# Data Pre-Processing
Scale after splitting the data into train and test since we will be using gradient ascent. 
* Use `train_test_split` to split the data (`75% train` and `25% test`) to `X_train`, `X_test`, `y_train`, `y_test` with `random_state` of 42
* Reshape `y_train` into 2D array `y_2d_train` and `y_test` into 2D array `y_2d_test`
* Use `preprocessing` to scale the data.  Remember to scale the training data first and then using the same method scale the test dataset.
* Augment the dataset with a column of ones

In [None]:
# TODO Q02
from sklearn.preprocessing import StandardScaler

...  

y_2d_train = ...
y_2d_test = ...

In [None]:
grader.check("Q02")

In [None]:
# VERIFY - Print the shape of X_train and y_2d_train
print('Q02 - X_train.shape: ', X_train.shape)
print('Q02 - y_2d_train.shape: ', y_2d_train.shape)

In [None]:
# VERIFY - Printing the names of all the features
print('Q02 - cancer.feature_names: ', cancer.feature_names)

In [None]:
# TODO Q03
# Append a column of ones to X_train
# ones is a  vector of shape n,1
ones = ...
# Append a column of ones in the beginning of X_train an save in variable X_train_1(<np.ndarray>).
X_train_1 = ...

In [None]:
grader.check("Q03")

In [None]:
# VERIFY
print('Q03 - X_train_1.shape: ', X_train_1.shape)
print('Q03 - X_train_1: ', X_train_1)

# Implementing Logistic Regression Using Gradient Ascent

You will perform the following steps:
* write the sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$
* initialize ${\bf w}$
* prediction: write the function to compute the probability of every example in $X$ belonging to class one
* write the log likelihood function (see lecture notes for the formula)
* write the gradient ascent algorithm
* plot the likelihood v/s the number of iterations
* predict the class label (i.e. $0,1$) for every example in $X$ for a given ${\bf w}$ and $t$
* Evaluate your hypothesis by using your hypothesis to predict the class of the examples in the test set.  Using these predicted value you will then determine the precision, recall and F1 score of the test set


### Sigmoid function

In [None]:
# TODO Q04
# Write the sigmoid function
def sigmoid(z):
    ...
    return ...

In [None]:
grader.check("Q04")

In [None]:
# VERIFY - Sigmoid of 0 should be equal to half
print('Q04 - sigmoid(0): ', sigmoid(0))

### Initialize ${\bf w}$

In [None]:
# TODO Q05
# Initialize w_init to a zero matrix with shape (X_train_1.shape[1],1)
w_init = ...

In [None]:
grader.check("Q05")

In [None]:
# VERIFY
print('Q05 - w_init.shape: ', w_init.shape)

### Prediction
Finish writing the function, `hypothesis`, that computes the probability of each example in $X$ belonging to class one.  (i.e. $\hat{\bf y}=\sigma(X{\bf w})$

In [None]:
# TODO Q06
# Write the hypothesis function which assumes the design matrix X is augmented with a column of ones
def hypothesis(X, w):
    ...
    return ...

In [None]:
grader.check("Q06")

In [None]:
# TODO Q07 
# Compute y_hat(<np.ndarray>) using your hypotheis function with arguments X_train_1 and w_init
y_hat_init = hypothesis(X_train_1,w_init)

In [None]:
grader.check("Q07")

In [None]:
# VERIFY
# print('Q07 - y_hat_init: ', y_hat_init)

### Log Likelihood Function
Write the code to calculate the log likelihood as discussed in the class.

In [None]:
# TODO Q08
# Write the log likelihood function
def log_likelihood(X, y, w):
 
    ...
    return ...

In [None]:
grader.check("Q08")

In [None]:
# VERIFY - The value should be equal to -295.2806989185367 using X_train_1, y_2d_train, w, X_train_1.shape[0].
print('Q08 - likelihood: ', log_likelihood(X_train_1, y_2d_train, w_init))

### Gradient Ascent

In [None]:
# TODO Q09
# Write the gradient ascent function
def Gradient_Ascent(X, y, learning_rate, num_iters):
    # We assume X has been augmented with a column of ones
    
    # Number of training examples.
    N = X.shape[0]
    
    # Initialize w(<np.ndarray>). Zeros vector of shape X.shape[1],1
    w = np.zeros((X.shape[1],1))
    
    # Initiating list to store values of likelihood(<list>) after few iterations.
    log_likelihood_values = []
    
    # Gradient Ascent - local optimization technique
    ...
        
        # Computing log likelihood of seeing examples for current value of w
        if (i % 10) == 0:
            log_likelihood_values.append(log_likelihood(X, y, w))
            print(log_likelihood(X, y, w))
        
    return w, log_likelihood_values

In [None]:
grader.check("Q09")

In [None]:
# Please try many different values for the learning rate (including very small values).
learning_rate = ...
num_iters = ...
# Calculate w and likelihood values using Gradient_Ascent with X_train_1, y_2d_train
w, log_likelihood_values = Gradient_Ascent(X_train_1, y_2d_train, learning_rate, num_iters)
print(w, log_likelihood_values)

In [None]:
grader.check("Q10")

### Plotting Likelihood v/s Number of Iterations.

In [None]:
# Run this cell to plot Likelihood v/s Number of Iterations.
iters = np.array(range(0,num_iters,10))
plt.plot(iters,log_likelihood_values,'.-',color='green')
plt.xlabel('Number of iterations')
plt.ylabel('Log-Likelihood')
plt.title("Log-Likelihood vs Number of Iterations.")
plt.grid()
plt.show(block=False)

You should see the likelihood increasing as number of Iterations increase.

### Predict the class label for every example in $X$ for a given ${\bf w}$ and $t$

In [None]:
# TODO - Given a set of examples write the function to compute predicted which class for each example: 0 if the probability of belonging to class  is < t and returns 1 otherwise) - 10 points
def predict_class(X, w, t):
    ...
    return ...

In [None]:
grader.check("Q11")

### Precision, recall and F1: Evaluating your hypothesis using the test dataset

In [None]:
# TODO Q12
# Preidct the class y_hat using X_test and w you just calculated if the threshold is t = 0.5

# First augment the test dataset with a column of ones.
ones = ...
X_test_1 = ...
# Now predict the label of each example in your test set
y_hat = ...

In [None]:
grader.check("Q12")

In [None]:
# TODO Q13
# Write the precision_recall function by first calculating: false_pos, false_neg and true_pos.  Using these numbers compute the precision and recall
def precision_recall(y_hat, y, threshold):  

    # Calculate false positive and false negative
    # HINT: if done correctly, false_pos should be 1 and false_neg should be 1
    false_pos = ...
    false_neg = ...

    # Calculate true positive and true negatives
    # HINT: if done correctly, true_pos should be 88
    true_pos = ...

    precision = ...
    recall = ...
    return precision,recall

In [None]:
grader.check("Q13")

In [None]:
# TODO Q14
# Calculate precision and recall using on the test data where the threshold is 0.5

precision, recall = ...

print('Q14 - precision: ', precision)
print('Q14 - recall: ', recall)

In [None]:
grader.check("Q14")

In [None]:
# TODO Q15
# Write the F1_score function
def f1_score(precision, recall):
    ...
    return ...

In [None]:
# Computing the F1 score on the test data set using the precision and recall you computed above.
f1_score(precision, recall)

In [None]:
grader.check("Q15")

# Sklearn's implementation of Logistic regression

Next, use Sklearn's implementation of Logistic regression.  Once you have your hypothesis you will use your model on the test data and then evaluate how well it did using Sklearn's built in functions to compute the accuracy, precision, recall and F1 score.

### Fitting Model using Sklearn Library. 

In [None]:
# TODO - Create object of logistic regression model. So we don't use any regularization, we can set the penalty to `none` or set C to a very large value (for example, C = 100000000), 
# to make lambda (C = 1/lambda) nearly 0.
from sklearn import linear_model
logreg = linear_model.LogisticRegression(penalty = 'none')

In [None]:
grader.check("Q16")

In [None]:
# TODO Q17
# Fit the model
# Don't use matrix X_train_1. Instead, use X_train.
logreg.fit(..., ... )

In [None]:
grader.check("Q17")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: Q19
manual: true
points:
  each: 1
-->

In [None]:
# TODO Q18
# Print out all the coefficients
w_logreg = ...
intercept_logreg = ...

In [None]:
grader.check("Q19")

<!-- END QUESTION -->

In [None]:
# VERIFY - Compare the parameters computed by logreg model and gradient ascent. They should be nearly same.
print('Q18 - w_logreg: ', w_logreg)
print('Q18 - intercept_logreg: ', intercept_logreg)

### Performance measure: accuracy

In [None]:
# TODO Q19
# Find the predicted values on test set (X_test not X_test_1) using logreg.predict
y_hat_logreg = ...

# Find the accuracy achieved on test set using logreg.score and y_test 
acc_logreg = ...

print("Q19 - Accuracy on training data = %f" % acc_logreg)

In [None]:
grader.check("Q18")

### Performance Metrics: precision, recall, F1 score


In [None]:
from sklearn.metrics import precision_recall_fscore_support
# TODO Q20
# Find Precision, recall and fscore using precision_recall_fscore_support nethod of sklearn
# Using y_test and y_hat_logreg
prec, recal, fscore, sup = ...

In [None]:
grader.check("Q20")

In [None]:
# VERIFY
print('Q20 - prec: ', prec)
print('Q20 - recal: ', recal)
print('Q20 - fscore: ', fscore)

# Experiment!  Run your gradient ascent algorithm without scaling the training dataset.  
What did you notice.  Describe the best hyperparamters  you found (i.e. `learning_rate`, and `num_iters`)