<img align=right src="images/inmas.png" width=130x />

# Notebook 04b - Logistic Regression - Supplement

Material covered in this notebook:

This notebook follows along the notes [here](Notes/4_LogisticRegression.pdf)


### Prerequisite
Notebook 04a

------------------------------------

In [None]:
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

In [None]:
# Read in Data
uci_adult_df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
     sep = ",",
     header = None,
     na_values = ['NA','?']
  )

# Add all column names
uci_adult_df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

display(uci_adult_df)

## Splitting Data

The standard practice when modeling is to split a single data set into two:

- **Training**: Data that should be used to train the model.
- **Testing**: Data that should be used to evaluate the predictions made by the trained model.

Usually, the percentage is about an 80/20% split with 80% going to training and 20% going to testing.

In [None]:
# Load train_test_split function in the model_selection module for sklearn
from sklearn.model_selection import train_test_split

# Split the data with 20% in testing and 80% in training.
train, test = train_test_split(uci_adult_df, test_size=0.2)

train.info()
test.info()

In [None]:
## logistic regression expects 0s and 1s as the response variable
tempVec = pd.get_dummies(train['Income'])
tempVec

In [None]:
train['Income_Binary'] = tempVec[' >50K']

tempVec = pd.get_dummies(test['Income'])

test['Income_Binary']  = tempVec[' >50K']

Let's fit a model on the training data and see how well it does on the testing data.

In [None]:

# Specify the desired model of y regressing onto x
model_formula = smf.logit('Income_Binary ~ Age + Race + Gender + HoursPerWeek', data = train)

# Fit the model to the data
results = model_formula.fit()

results.params

In [None]:
## Test accuracy
pred = results.predict(test)
predV = results.predict(test)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
test_acc = (test['Income_Binary'] == pred).mean()
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(test_acc))

Let's break that down to decide in what ways we are wrong and right.

In [None]:
from sklearn.metrics import confusion_matrix       ## logistic regression report packages
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Confusion matrix
confusion_matrix = confusion_matrix(test['Income_Binary'], pred)
print(confusion_matrix)

The top left value means that the true value is 1 and we correctly predicted a 1 (true positive). The top right value means that the true value is 1 and we incorrectly predicted a 0 (false negative). The bottom left value means that the true value is 0 and we incorrectly predicted a 1 (false positive). The bottom right value means that the true value is 0 and we correctly predicted a 0 (true negative).

All of this has assumed that 0.5 is the appropriate cutoff value to turn predicted probabilities into binary predictions. What if we change this cutoff value? The ROC curve shows how well the model does under different cutoff values. Informally, we want the area under the curve to be large, so we want the curve to be as close to the upper left corner as possible. Read more about how to interpret this type of curve [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).



In [None]:
# ROC curve
logit_roc_auc = roc_auc_score(test['Income_Binary'], pred)
fpr, tpr, thresholds = roc_curve(test['Income_Binary'], predV)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

## Your Turn

Fit your own logistic regression model choosing a different binary response. Remember you will need to convert the categorical variable into a 0/1 variable before proceeding. Be sure to evaluate the fit of the model using a test dataset that the model has *not* been trained. on.