In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab13.ipynb")

# Lab 13: Logistic Regression

In this lab you will build a logistic regression model and evaluate the performance of your model.

### Due Date

This assignment is due on **Tuesday, November 23, at 11:59PM PDT**.

### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about
the homework, we ask that you **write your solutions individually**. If you do
discuss the assignments with others please **include their names** at the top
of your solution.


**Collaborators:** *list names here*

In [1]:
# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import sklearn
import sklearn.datasets
import sklearn.linear_model
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.offline as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf


%matplotlib inline
sns.set()
sns.set_context("talk")
py.init_notebook_mode(connected=False)
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

In this lab we will be working with the breast cancer dataset. This dataset can be loaded using the `sklearn.datasets.load_breast_cancer()` method.  

In [2]:
data = sklearn.datasets.load_breast_cancer()
# data is actually a dictionnary
print(data.keys())
print(data.DESCR)

The data format is not a `pandas.DataFrame` so we will need to do some preprocessing to create a new DataFrame from it.

In [3]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

Let us try to fit a simple model with only one feature.

In [4]:
# Define our features/target
X = df[["mean radius"]]
# Target data['target'] = 0 is malignant, 1 is benign
Y = (data.target == 0)


In [5]:
# Create a 75-25 train-test split
from sklearn.model_selection import train_test_split
x_train, x_test,y_train,y_test = train_test_split(X,Y, test_size=0.25, random_state=42)

print(f"Training Data Size: {len(x_train)}")
print(f"Test Data Size: {len(x_test)}")

### Question 1

Let's first fit a logistic regression model using the training set. 

For this problem, we will use the existing `LogisticRegression` implementation in sklearn.

Fill in the code below to compute the training and testing accuracy, defined as:

$$
\text{Training Accuracy} = \frac{1}{n_{train\_set}} \sum_{i \in {train\_set}} {\mathbb{1}_{y_i == \hat{y_i}}}
$$

$$
\text{Testing Accuracy} = \frac{1}{n_{test\_set}} \sum_{i \in {test\_set}} {\mathbb{1}_{y_i == \hat{y_i}}}
$$

where $\hat{y_i}$ is the prediction of our model, $y_i$ the true value, and $\mathbb{1}_{y_i == \hat{y_i}}$ an indicator function where $\mathbb{1}_{y_i == \hat{y_i}} = 1$ if ${y_i} = \hat{y_i}$, and $\mathbb{1}_{y_i == \hat{y_i}} = 0$ if ${y_i} \neq \hat{y_i}$

<!--
BEGIN QUESTION
name: q1
-->

In [6]:
lr = sklearn.linear_model.LogisticRegression(fit_intercept=True, solver = 'lbfgs')

lr.fit(x_train,y_train) 
train_accuracy = ...
test_accuracy = ...

print(f"Train accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")

In [None]:
grader.check("q1")

### Question 2
It seems we can get a very high test accuracy. How about precision and recall?  
- Precision (also called positive predictive value) is the fraction of true positives among the total number of data points predicted as positive.  
- Recall (also known as sensitivity) is the fraction of true positives among the total number of data points with positive labels.

Precision measures the ability of our classifier to not predict negative samples as positive, while recall is the ability of the classifier to find all the positive samples.

To understand the link between precision and recall, it's useful to create a confusion matrix of our predictions. Luckily, `sklearn.metrics` provides us with such a function!

In [10]:
from sklearn.metrics import confusion_matrix

cnf_matrix = confusion_matrix(y_test, lr.predict(x_test))

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    import itertools
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
class_names = ['False', 'True']
# Plot non-normalized confusion matrix
plt.figure()
plt.grid(False)
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plt.grid(False)
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

Mathematically, Precision and Recall are defined as:
$$
\text{Precision} = \frac{n_{true\_positives}}{n_{true\_positives} + n_{false\_positives}}
$$

$$
\text{Recall} = \frac{n_{true\_positives}}{n_{true\_positives} + n_{false\_negatives}}
$$

Below is a graphical illustration of precision and recall:
![precision_recall](precision_recall.png)

Now let's compute the precision and recall for the test set using the model we got from Question 1.  

**Do not** use `sklearn.metrics` for this computation.

<!--
BEGIN QUESTION
name: q2
-->

In [11]:
y_pred = lr.predict(x_test) 

precision = ...
recall = ...

print(f'precision = {precision:.4f}')
print(f'recall = {recall:.4f}')

In [None]:
grader.check("q2")

Our precision is fairly high while our recall is a bit lower. Why might we observe these results? Please consider the following plots, which display the distribution of the target variable in the training and testing sets. 

In [14]:
fig, axes = plt.subplots(1, 2)
sns.countplot(x=y_train, ax=axes[0]);
sns.countplot(x=y_test, ax=axes[1]);

axes[0].set_title('Train')
axes[1].set_title('Test')
plt.tight_layout();

_Type your answer here, replacing this text._

###  Question 3
Now let's try to analyze the cross entropy loss from logistic regression. The average loss across our entire dataset is:

$$R(\theta) = -\frac{1}{n} \sum_{i=1}^n \left( y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i})  \right) $$

where $\hat{y_i} = \sigma(X_i^T \theta)$. Here, $X_i$ is the i'th row of our design matrix $X$, $\theta$ is our weight vector $[\theta_1, \theta_2]^T$ where $\theta_1$ corresponds to the weight for the mean radius feature and $\theta_2$ corresponds to the bias term, and $\sigma$ is the sigmoid activation function defined below:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

**Note**: In this class, when performing linear algebra operations we interpret both rows and columns as column vectors. So if we wish to calculate the dot product between row $X_i$ and a vector $v$, we would write $X_i^Tv$.

In [15]:
theta = np.array([lr.coef_[0][0],
                  lr.intercept_[0]])
X_new = np.hstack([X,
                 np.ones([len(X), 1])]) # This is adding a coefficient of 1 for the intercept term
print(theta, '\n')
print(X_new)

In [16]:
def sigmoid(t):
    return 1 / (1 + np.exp(-t))

def lr_loss(theta, X, Y):
    '''
    Compute the cross entropy loss using X, Y, and theta. You should not need to use a for loop. 
    Hint: The notation B @ v means: compute the matrix multiplication Bv.

    Args:
        theta: The model parameters. 
        X: The design matrix
        Y: The label 

    Return:
        The cross entropy loss.
    '''
    ...

In [None]:
grader.check("q3")

Below is a plot showing the cross-entropy loss for various values of $\theta_1$ and $\theta_2$ (note that they are represented as x and y in the graph).

In [19]:
with np.errstate(invalid='ignore', divide='ignore'):
    uvalues = np.linspace(-8,8,70)
    vvalues = np.linspace(-5,5,70)
    (u,v) = np.meshgrid(uvalues, vvalues)
    thetas = np.vstack((u.flatten(),v.flatten()))
    lr_loss_values = np.array([lr_loss(t, X_new, Y) for t in thetas.T])
    lr_loss_surface = go.Surface(name="Logistic Regression Loss",
            x=u, y=v, z=np.reshape(lr_loss_values,(len(uvalues), len(vvalues))),
            contours=dict(z=dict(show=True, color="gray", project=dict(z=True)))
        )
    fig = go.Figure(data=[lr_loss_surface])
    fig.update_layout(
        scene = dict(
            xaxis_title='theta_1',
            yaxis_title='theta_2',
            zaxis_title='Loss'),
            width=700,
            margin=dict(r=20, l=10, b=10, t=10))
    py.iplot(fig)

Describe one interesting observation about the plot.

_Type your answer here, replacing this text._

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)