In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

1. Start by defining your baseline model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

import prepare

In [2]:
train, validate, test = prepare.prep_titanic()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['age'] = imputer.transform(test[['age']])


### About X_ and y_

X_(train, validate, test) needs to be a dataframe
X_train can be defined with `train[[columns]]`, `train.drop(columns=[])` (or validate, or test)
y_train can be a series with `train[col]`, `train.col` (or validate, or test)

you only need to split it before you use it. 

In [None]:
demo_train = train[['age', 'survived']]
demo_train2 = train[['age', 'pclass', 'survived']]

X_demo_train = demo_train[['age']]
X_demo_train = demo_train.drop(columns=['survived'])
# below will not work in your algorithm! (e.g. LogisticRegresion.fit()) 
# X_demo_train = demo_train['age']
# bc it needs to be a dataframe for X
# ___________

X_demo_train2 = demo_train[['age', 'pclass']]
X_demo_train2 = demo_train.drop(columns=['survived'])


y_demo_train = demo_train[['survived']]
y_demo_train = demo_train['survived']
y_demo_train = demo_train.survived

In [None]:
train.info()

In [3]:
train.survived.value_counts(normalize=True)

0    0.617706
1    0.382294
Name: survived, dtype: float64

My baseline accuracy with no model: I will guess the passenger did not survive and on our training data, I would get 62% accuracy. 

In [23]:
logit1 = LogisticRegression()

X_train1 = train.drop(columns=['survived'])
y_train = train.survived

logit1 = logit1.fit(X_train1, y_train)
print(logit1.coef_)
print(X_train1.columns)

[[-0.73460016 -0.0160103  -0.44806166 -0.13603857  0.00431481 -0.75917746
  -2.25825746  0.50855606  0.24301123]]
Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'alone', 'sex_male',
       'embarked_Q', 'embarked_S'],
      dtype='object')




In [5]:
y_pred1 = logit1.predict(X_train1)

In [6]:
logit1.score(X_train1, y_train)

0.7987927565392354

Baseline Model: 80% accuracy, much better than guessing alone. 

2. Create another model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [7]:
X_train2 = train[['age', 'fare', 'pclass']]
y_train = train.survived

logit2 = LogisticRegression()
logit2 = logit2.fit(X_train2, y_train)

print(logit2.coef_)
print(X_train2.columns)

[[-0.02427755  0.00415002 -0.83379906]]
Index(['age', 'fare', 'pclass'], dtype='object')




In [8]:
logit2.score(X_train2, y_train)

0.7082494969818913

This one did not perform nearly as well as model 1, 71% vs. 80% accuracy. 

3. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [9]:
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
583,0,1,36.0,0,0,40.125,1,1,0,0
337,1,1,41.0,0,0,134.5,1,0,0,0
50,0,3,7.0,4,1,39.6875,0,1,0,1
218,1,1,32.0,0,0,76.2917,1,0,0,0
31,1,1,29.916875,1,0,146.5208,0,0,0,0


In [10]:
X_train3 = train[['age', 'fare', 'pclass', 'sex_male']]
y_train = train.survived

logit3 = LogisticRegression()
logit3 = logit3.fit(X_train3, y_train)

print(logit3.coef_)
print(X_train3.columns)
print(logit3.score(X_train3, y_train))

[[-0.01431495  0.00310367 -0.82578862 -2.29993317]]
Index(['age', 'fare', 'pclass', 'sex_male'], dtype='object')
0.7867203219315896




In [11]:
# SIDE NOTE
# demonstrating the weights with different values
# a weight of .003 with a $500 fare is not much different 
# than a weight of 2.3 with a 1 for sex_male

500*.003 
1*2.3

1.5

4. Try out other combinations of features and models.

In [15]:
X_train4 = train[['sex_male']]
y_train = train.survived

def my_logit(X_train):
    my_logit = LogisticRegression()
    my_logit = my_logit.fit(X_train, y_train)
    return my_logit, my_logit.coef_, my_logit.score(X_train, y_train)

In [17]:
logit4, coefs, accuracy = my_logit(X_train4)
print(coefs, accuracy)

[[-2.34811852]] 0.7847082494969819




In [19]:
train.columns

Index(['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'alone',
       'sex_male', 'embarked_Q', 'embarked_S'],
      dtype='object')

In [21]:
X_train5 = train[['sibsp', 'parch', 'alone', 'embarked_Q', 'embarked_S']]

logit5, coefs, accuracy = my_logit(X_train5)
print(coefs, accuracy)

[[-0.5552899  -0.06432213 -1.74410376  0.14466641 -0.32118676]] 0.6941649899396378




In [22]:
X_train6 = train[['alone', 'sex_male', 'pclass']]

logit6, coefs, accuracy = my_logit(X_train6)
print(coefs, accuracy)

[[-0.26318011 -2.27623838 -0.83010265]] 0.7847082494969819




5. Use you best 3 models to predict and evaluate on your validate sample.

We will use models 1 (all vars), 4 (sex_male), & 6 (alone, pclass, sex_male)

In [24]:
X_validate1 = validate.drop(columns=['survived'])
X_validate4 = validate[['sex_male']]
X_validate6 = validate[['alone', 'pclass', 'sex_male']]

y_validate = validate.survived

acc1 = logit1.score(X_validate1, y_validate)
acc4 = logit4.score(X_validate4, y_validate)
acc6 = logit6.score(X_validate6, y_validate)

print(acc1, acc4, acc6)

0.780373831775701 0.7663551401869159 0.7429906542056075


In [26]:
y1_pred = logit1.predict(X_validate1)
y4_pred = logit4.predict(X_validate4)
y6_pred = logit6.predict(X_validate6)

print("y1: All Vars:\n", classification_report(y_validate, y1_pred))

print("y4: Sex_male only:\n", classification_report(y_validate, y4_pred))

print("y6: Age, Pclass, Sex_male:\n", classification_report(y_validate, y6_pred))

y1: All Vars:
               precision    recall  f1-score   support

           0       0.80      0.86      0.83       132
           1       0.75      0.65      0.69        82

   micro avg       0.78      0.78      0.78       214
   macro avg       0.77      0.75      0.76       214
weighted avg       0.78      0.78      0.78       214

y4: Sex_male only:
               precision    recall  f1-score   support

           0       0.80      0.83      0.81       132
           1       0.71      0.67      0.69        82

   micro avg       0.77      0.77      0.77       214
   macro avg       0.75      0.75      0.75       214
weighted avg       0.76      0.77      0.77       214

y6: Age, Pclass, Sex_male:
               precision    recall  f1-score   support

           0       0.71      1.00      0.83       132
           1       1.00      0.33      0.50        82

   micro avg       0.74      0.74      0.74       214
   macro avg       0.85      0.66      0.66       214
weighted av

6. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

We will use the first model, the one with all X variables. 

In [27]:
X_test1 = test.drop(columns=['survived'])
y_test = test.survived

test_acc = logit1.score(X_test1, y_test)
y1_pred = logit1.predict(X_test1)

print(test_acc)
print("test report:\n", classification_report(y_test, y1_pred))

0.8202247191011236
test report:
               precision    recall  f1-score   support

           0       0.84      0.87      0.86       110
           1       0.78      0.74      0.76        68

   micro avg       0.82      0.82      0.82       178
   macro avg       0.81      0.80      0.81       178
weighted avg       0.82      0.82      0.82       178



Bonus1 How do different strategies for handling the missing values in the age column affect model performance?

Bonus2: How do different strategies for encoding sex affect model performance?

Bonus3: scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.