# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2018-CS109A/blob/master/content/styles/iacs.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 6: Classification and Logistic Regression

**Harvard University**<br>
**Summer 2020**<br>
**Instructors:** Kevin Rader<br>
**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, Kevin Rader

---

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Logistic Regression</li> 
<li> More Logistic Regression </li> 
<li> Classification Accuracy</li>

    

## Learning Goals

This Jupyter notebook accompanies Lecture 6. By the end of this lecture, you should be able to:

- Fit, plot, and interpret logistic regression models and their coefficients
- Determine classification boundaries for logistic regression models
- Know how to evaluate classification methods via miclassification rate, ROC curves, and AUC.


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn as sk
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

## Part 0: Reading the data 

In this notebook, we will be using the Heart dataset from lecture.  The variables we will be using today include:

- `AHD`: whether or not the patient presents atherosclerotic heart disease (a heart attack): `Yes` or `No`
- `Sex`: a binary indicator for whether the patient is male (Sex=1) or female (Sex=0)
- `Age`: age of patient, in years
- `MaxHR`: the maximum heart rate of patient based on exercise testing
- `RestBP`: the resting systolic blood pressure of the patient
- `Chol`: the HDL cholesterol level of the patient

For further information on the dataset, please see the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease).

In [None]:
df_heart = pd.read_csv('../data/Heart.csv')

# Force the response into a binary indicator:
df_heart['AHD'] = 1*(df_heart['AHD'] == "Yes")

print(df_heart.shape)
df_heart.head()

**Q0.1** Do some EDA to see how each of the 3 predictors relate to the response: `AHD`.  Consider looking at summary statistics, contingency tables, and relevant visuals comparing the two groups in the response variable.
Hint: [`pd.crosstab`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) could be very useful for creating contigency tables

In [None]:
######
# your code here
######

**Q0.2** Interpret your EDA in the previous part.  Which of the predictors would be most useful in a classification model to predict `AHD`?

*your answer here*

---

## Part 1: Logistic Regression Modeling

Below are both a linear regression model and a [logistic regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) fit using sklearn to predict `AHD` from `Age`.

In [None]:
data_x = df_heart[['Age']]
data_y = df_heart['AHD']

regress1 = LinearRegression(fit_intercept=True).fit(data_x, data_y)
logit1 = LogisticRegression(C=10000,fit_intercept=True).fit(data_x, data_y)

print("Linear Regression Estimated Betas:",regress1.intercept_,regress1.coef_)
print("Logistic Regression Estimated Betas:",logit1.intercept_,logit1.coef_)

Two different prediction commands can be used on a logistic regression model in sklearn (be careful of the form of the output from them):
- model.predict(X): to get the predicted classifications (0 or 1, here)
- model.predict_proba(X): to get the predicted probabiltiies of 'success'
        
**Q1.1** Calculate both types of predictions for the patients in the data set for `logit1`.  What do you notice?

In [None]:
######
# your code here
######


**Q1.2** Use the array of predicted probabilities to perform the classifications manually (feel free to check your answers with sklearn's classifications).  Determine the classification boundary mathematically (using the estimated coefficients): what range of values of `Age` would a patient be predicted to have a heart attack?

In [None]:
######
# your code here
######


**Q1.3** Determine the classification boundary mathematically (using the estimated coefficients): what range of values of `Age` would a patient be predicted to have a heart attack?

In [None]:
######
# your code here
######

Below is some code to plot the predictions from the linear regression model on the probability scale added to he scatterplot of points.

**Q1.4** Add the logistic curve for the predicted probabilities from `logit1`.  Which function is better to describe `AHD` from `Age`?  Why?

In [None]:
dummy_x=np.linspace(np.min(data_x)-10,np.max(data_x)+10)
yhat_regress = regress1.predict(dummy_x.reshape(-1,1))
plt.plot(data_x, data_y, 'o' ,alpha=0.2, label='Data')
plt.plot(dummy_x, yhat_regress, label = "OLS")

######
# your code here
######



plt.ylim(-0.01,1.01)
plt.show()

---

## Part 2: More Logistic Regression Modeling 

**Q2.1** Fit a logistic regression model (`logit2`) to predict `AHD` from `Sex`.  Confirm that these estimates are correct based on the contingency table .
Hint: What proportion of women had heart attacks in the dataset?  What proportion of men?

In [None]:
######
# your code here
######



**Q2.2** Fit two more logistic regression models:
- `logit3` to predict `AHD` from `Sex` and `Age`.  
- `logit4` to predict `AHD` from `Sex` and `Age` and the interaction between the two predictors.

What is the difference betwen these two modeling choices (aka, what does the interaction term allow for)?  

In [None]:
######
# your code here
######



*your answer here*

**Q2.3** From `logit4` plot the predicted probability of a heart attack as a function of age separately for females and males (2 separate curves).  What do you notice in these curves?

In [None]:
######
# your code here
######



*your answer here*

**Q2.5** Using `logit4`, at what ages will males be predicted to have a heart attack in a classification?  at what ages will females be predicted to have a heart attack?  Justify based on the plot above.

In [None]:
######
# your code here
######




*your answer here*

---

## Part 3: Classification Accuracy

We split the relevant data into train and test (67-33 split) below for you.  Use this to help score several models we suggest below.

In [None]:
df_heart['Sex_MaxHR']=df_heart['Sex']*df_heart['MaxHR']
df_heart['Age_MaxHR']=df_heart['Age']*df_heart['MaxHR']

X_data = df_heart[['Sex','Age','MaxHR','RestBP','Chol','Sex_Age','Sex_MaxHR','Age_MaxHR']]
y_data = df_heart['AHD']

X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X_data, y_data, test_size=0.33, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

**Q3.1** Fit an 'unregularized' logistic regression model (`logit5`) to predict `AHD` from all the predictors in the training set.  Determine the misclassification rate in both the train and test sets.

In [None]:
######
# your code here
######



**Q3.2** Fit a 'regularized' logistic regression model (`logit6`) to predict `AHD` from all the predictors in the training set (with `C=0.001`).  Compare the coefficient estimates in `logit6` to `logit5`, and determine the misclassification rate in both the train and test sets.  How have things changed?

In [None]:
######
# your code here
######



*your answer here*

**Q3.3** Calculate the confusion tables in the test set for `logit5` when the cut-off is the typical 0.5 and when it is 0.8.  Calculate the sensitivity and specificity of this classification algorithm fior each of these cut-offs.
Hint:  [sk.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) will be useful for this task.

In [None]:
yhat_test_logit5 = logit5.predict_proba(X_test)[:,1]
print('The average predicted probability is',np.mean(yhat_test_logit5))

######
# your code here
######



The ROC curve for `logit5` is shown below using 

In [None]:
fpr, tpr, thresholds = sk.metrics.roc_curve(y_test, yhat_test)

x=np.arange(0,100)/100
plt.plot(x,x,'--',color="gray",alpha=0.3)
plt.plot(fpr,tpr)
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.title("ROC Curve for Predicting AHD in a Logistic Regression Model")
plt.show()

**Q3.4** For `logit6`, determine the predicted probabilties in test and calculate and print the ROC Curve for this model (it's helpful if you plot both ROC curves from `logit5` and `logit6` together).

In [None]:
######
# your code here
######


**Q3.5** Use the ROC curves above to eyeball which of `logit5` and `logit6` is a better classification model based on Area Under the ROC Curve (AUC).  Then, calculate the actual AUC for these two models. 

Hint: use [sklearn.metrics.auc](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html)

In [None]:
######
# your code here
######


In [None]:
*your answer here*