<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating Classification Models on Humor Styles Data

---

In this lab you will be practicing evaluating classification models (Logistic Regression in particular) on a "Humor Styles" survey.

This survey is designed to evaluate what "style" of humor subjects have. Your goal will be to classify gender using the responses on the survey.

## Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

    1. I usually don't laugh or joke with other people.
    2. If I feel depressed, I can cheer myself up with humor.
    3. If someone makes a mistake, I will tease them about it.
    4. I let people laugh at me or make fun of me at my expense more than I should.
    5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
    6. Even when I'm alone, I am often amused by the absurdities of life.
    7. People are never offended or hurt by my sense of humor.
    8. I will often get carried away in putting myself down if it makes family or friends laugh.
    9. I rarely make other people laugh by telling funny stories about myself.
    10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
    11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
    12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
    13. I laugh and joke a lot with my closest friends.
    14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
    15. I do not like it when people use humor as a way of criticizing or putting someone down.
    16. I don't often say funny things to put myself down.
    17. I usually don't like to tell jokes or amuse people.
    18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
    19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
    20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
    21. I enjoy making people laugh.
    22. If I am feeling sad or upset, I usually lose my sense of humor.
    23. I never participate in laughing at others even if all my friends are doing it.
    24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
    25. I donít often joke around with my friends.
    26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
    27. If I don't like someone, I often use humor or teasing to put them down.
    28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
    29. I usually can't think of witty things to say when I'm with other people.
    30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
    31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
    32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("likert scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an interger.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

In [6]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

### 1. Load the data and perform any EDA and cleaning you think is necessary.

It is worth reading over the description of the data columns above for this.

In [9]:
hsq = pd.read_csv('./humor_styles/hsq_data.csv')
df = pd.DataFrame(hsq)

FileNotFoundError: [Errno 2] File b'./humor_styles/hsq_data.csv' does not exist: b'./humor_styles/hsq_data.csv'

In [None]:
hsq.head()

In [None]:
hsq.isnull().sum() #no cleaning necessary 

### 2. Set up a predictor matrix to predict `gender` (only male vs. female)

Choice of predictors is up to you. Justify which variables you include.

In [None]:
hsq['gender'].unique()

In [None]:
#drop 'gender' = {3=other, 0=declined}

# df = hsq[hsq['gender'] != 0] or hsq['gender'] != 3]
df = hsq[hsq['gender'] != 0][hsq['gender'] != 3]

In [8]:
#choose features with correlation >= 0.1 with 'gender'

a_mask = abs(df.corr()['gender']) >= 0.1
df.corr()['gender'][a_mask]

NameError: name 'df' is not defined

In [10]:
#get a list of feature to be used
features = df.corr()['gender'][a_mask].index.tolist()

#plot a heatmap for correlation
corr = df[features].corr()

fig, ax = plt.subplots(figsize=(10,8))

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(corr, mask=mask, ax=ax, annot=True)

ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14)

plt.show()

NameError: name 'df' is not defined

In [None]:
#drop 'gender' from list 'features'
features.remove('gender')

#construct X, predictor matrix, and thier labels
X = df[features]
y = df['gender']

### 3. Fit a Logistic Regression model and compare your cross-validated accuracy to the baseline.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import  cross_val_score , cross_val_predict
from sklearn.model_selection import train_test_split

In [None]:
lr_model = LogisticRegression()

In [None]:
basline_score = y.value_counts()[1] / len(y)
basline_score

In [None]:
cv_score = cross_val_score(lr_model, X, y, cv=4)

In [None]:
cv_score > basline_score 
#the model is better the the baseline

### 4. Create a 50-50 train-test split. Fit the model on the training data and get the predictions and predicted probabilities on the test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, stratify= y, shuffle= True)
lr_model.fit(X_train, y_train)

In [None]:
lr_model.score(X_train, y_train)

In [None]:
lr_model.score(X_test, y_test)

In [None]:
y_preds =pd.Series(lr_model.predict(X))
y_preds.value_counts()

In [None]:
preds_prob = pd.DataFrame(lr_model.predict_proba(X_test), columns=['male','female']) 
preds_prob.head()

### 5. Manually calculate the true positives, false positives, true negatives, and false negatives.

In [None]:
tp = np.sum((y == 2) & (y_preds == 2))
fp = np.sum((y == 1) & (y_preds== 2))
tn = np.sum((y == 1) & (y_preds == 1))
fn = np.sum((y == 2) & (y_preds == 1))
print("true positives:", tp)
print("false positives:", fp)
print("true negatives:", tn) 
print("false negatives:", fn)
print("Number of classification errors:", fp+fn)

### 6. Construct the confusion matrix. 

In [None]:
from sklearn.metrics import confusion_matrix
C_M = confusion_matrix(y, y_preds)
print(C_M)

### 7. Print out the false positive count as you change your threshold for predicting label 1.

In [None]:
Y_pp = pd.DataFrame(lr_model.predict_proba(X_test), columns=['male','female']) 
Y_pp.head()

In [None]:
Y_pp['male'].mean(), Y_pp['female'].mean()

In [None]:
threshold = Y_pp['male'].mean()

Y_pp['predFP'] = [1 if i >= threshold - 0.1 else 0 for i in Y_pp['female'].values]

print('false positves:', Y_pp.predFP.sum())

### 8. Plot an ROC curve using your predicted probabilities on the test data.

Calculate the area under the curve.

> *Hint: go back to the lesson to find code for plotting the ROC curve.*

In [None]:
from sklearn.metrics import roc_curve, auc

In [None]:
# For class 1, find the area under the curve
fpr, tpr, threshold = roc_curve(y_test, Y_pp.male, pos_label=1)
roc_auc = auc(fpr, tpr)

# Plot of a ROC curve for class 1 (male)
plt.figure(figsize=[6,6])               # fig, ax = plt.figure(figsize=[6,6])  gives an error!!!
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('gender prediction from humor style data', fontsize=18)
plt.legend(loc="lower right")
plt.show()

### 9. Cross-validate a logistic regression with a Ridge penalty.

Logistic regression can also use the Ridge penalty. Sklearn's [`LogisticRegressionCV`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) class will help you cross-validate an appropriate regularization strength.

**Important `LogisticRegressionCV` arguments:**
- `penalty`: this can be one of `'l1'` or `'l2'`. L1 is the Lasso, and L2 is the Ridge.
- `Cs`: How many different (automatically-selected) regularization strengths should be tested.
- `cv`: How many cross-validation folds should be used to test regularization strength.
- `solver`: When using the lasso penalty, this should be set to `'liblinear'`

> **Note:** The `C` regularization strength is the *inverse* of alpha. That is to say, `C = 1./alpha`

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import RidgeClassifier

In [None]:
lr = LogisticRegressionCV(cv=10,penalty='l2')
lr.fit(X,y)

In [None]:
lr.score(X,y)

#### 9.A Calculate the predicted labels and predicted probabilities on the test set with the Ridge logisitic regression.

In [None]:
Y_pp['pred_Rlr'] = pd.Series(lr.predict(X_test))


#### 9.B Construct the confusion matrix for the Ridge LR.

In [None]:
confusion_matrix(y_test, Y_pp['pred_Rlr'])

### 10. Plot the ROC curve for the original and Ridge logistic regressions on the same plot.

Which performs better?

In [None]:
# For class 1, find the area under the curve
fpr_Rlr, tpr_Rlr, threshold = roc_curve(y_test, Y_pp.pred_Rlr, pos_label=1)
roc_auc_Rlr = auc(fpr_Rlr, tpr_Rlr)

# Plot of a ROC curve for class 1 (male)
plt.figure(figsize=[6,6])               # fig, ax = plt.figure(figsize=[6,6])  gives an error!!!
plt.plot(fpr, tpr, fpr_Rlr, tpr_Rlr,linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('gender prediction from humor style data', fontsize=18)
plt.legend(['ROC_original (area = %0.2f)' % roc_auc,'ROC_Ridge (area = %0.2f)' % roc_auc_Rlr], loc="lower right")
plt.show()

In [None]:
#the l2-regularized performs worse than the original

### 11. Cross-validate a Lasso logistic regression.

**Hint:**
- `penalty` must be set to `'l1'`
- `solver` must be set to `'liblinear'`

> **Note:** The lasso penalty can be considerably slower. You may want to try fewer Cs or use fewer cv folds.

In [None]:
Rlr_l1 = LogisticRegressionCV(cv=10,penalty='l1',solver='liblinear')
Rlr_l1.fit(X,y)
Rlr_l1.score(X,y)

In [None]:
cross_val_score(Rlr_l1, X, y, cv=4)

### 12. Make the confusion matrix for the Lasso model.

In [None]:
confusion_matrix(y, Rlr_l1.predict(X))

In [None]:
# A:

### 13. Plot all three logistic regression models on the same ROC plot.

Which is the best (if any)?

In [None]:
fpr_Rlr_l1, tpr_Rlr_l1, threshold = roc_curve(y_test, Rlr_l1.predict(X_test), pos_label=1)
roc_auc_Rlr_l1 = auc(fpr_Rlr_l1, tpr_Rlr_l1)

# Plot of a ROC curve for class 1 (male)
plt.figure(figsize=[6,6])               # fig, ax = plt.figure(figsize=[6,6])  gives an error!!!
plt.plot(fpr, tpr, fpr_Rlr, tpr_Rlr, fpr_Rlr_l1, tpr_Rlr_l1, linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', linewidth=2)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('gender prediction from humor style data', fontsize=18)
plt.legend(['ROC_original (area = %0.2f)' % roc_auc,'ROC_Ridge (area = %0.2f)' % roc_auc_Rlr,
            'ROC_Lasso (area = %0.2f)' % roc_auc_Rlr_l1], loc="lower right")
plt.show()

In [None]:
#the original model works best

### 14. Look at the coefficients for the Lasso logistic regression model. Which variables are the most important?

In [None]:
Rlr_l1.coef_

In [None]:
# Which variables are the most important?

variables = pd.DataFrame({'feature': features, 'l1_coefficient': Rlr_l1.coef_.ravel() })

variables['|l1_coefficient|'] = abs(variables['l1_coefficient'])

variables.sort_values(by='|l1_coefficient|', ascending=False, inplace=True)

variables

In [None]:
#the most important feature for predicting the gender is Q15!