In [1]:
import pandas as pd
import prepare
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from acquire import get_titanic_data

### In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

1. Start by defining your baseline model.

In [2]:
train, validate, test = prepare.prep_titanic()
train

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,embarked_Q,embarked_S,sex_male
583,0,1,36.000000,0,0,40.1250,1,0,0,1
337,1,1,41.000000,0,0,134.5000,1,0,0,0
50,0,3,7.000000,4,1,39.6875,0,0,1,1
218,1,1,32.000000,0,0,76.2917,1,0,0,0
31,1,1,29.916875,1,0,146.5208,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
313,0,3,28.000000,0,0,7.8958,1,0,1,1
636,0,3,32.000000,0,0,7.9250,1,0,1,1
222,0,3,51.000000,0,0,8.0500,1,0,1,1
485,0,3,29.916875,3,1,25.4667,0,0,1,0


In [3]:
train.survived.value_counts(normalize=True)
# If you assume that everyone died without a baseline model you would get a 62% accuracy

0    0.617706
1    0.382294
Name: survived, dtype: float64

In [31]:
logit1 = LogisticRegression()

X_train1 = train.drop(columns=['survived'])
y_train = train.survived

logit1 = logit1.fit(X_train1, y_train)
print(logit1.coef_)
print(X_train1.columns)

[[-1.07859679e+00 -3.10510561e-02 -5.17601592e-01 -2.04025452e-01
   1.66729023e-03 -9.16177527e-01  8.99227745e-01  2.28830240e-01
  -2.42572095e+00]]
Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'alone', 'embarked_Q',
       'embarked_S', 'sex_male'],
      dtype='object')


In [32]:
y_pred1 = logit1.predict(X_train1)

In [33]:
logit1.score(X_train1, y_train)

0.8048289738430584

This baseline model has an 80% accuracy which is much more accurate than just assuming everyone died.

2. Create another model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [36]:
X_train2 = train[['age', 'fare', 'pclass']]

logit2 = LogisticRegression()
logit2 = logit2.fit(X_train2, y_train)

print(logit2.coef_)
print(X_train2.columns)

[[-0.03051881  0.00266519 -0.97983178]]
Index(['age', 'fare', 'pclass'], dtype='object')


In [37]:
logit2.score(X_train2, y_train)

0.716297786720322

This model only has a 72% accuracy which isn't as good as the baseline.

3. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [12]:
X_train3 = train[['age', 'fare', 'pclass', 'sex_male']]

logit3 = LogisticRegression()
logit3 = logit3.fit(X_train3, y_train)

print(logit3.coef_)
print(X_train3.columns)

[[-2.66594879e-02  9.02716903e-04 -1.11402368e+00 -2.45878213e+00]]
Index(['age', 'fare', 'pclass', 'sex_male'], dtype='object')


In [13]:
logit3.score(X_train3, y_train)

0.7987927565392354

This model has an accuracy of ~80% which is roughly on par with the baseline

4. Try out other combinations of features and models.

In [17]:
def my_logit(X_train):
    logreg = LogisticRegression()
    logreg = logreg.fit(X_train, y_train)
    return logreg, logreg.coef_, logreg.score(X_train, y_train)

In [18]:
X_train4 = train[['pclass', 'sex_male', 'alone']]

logit4, coefs, acc = my_logit(X_train4)

print(coefs, acc)

[[-0.95701015 -2.40744024 -0.30828946]] 0.7847082494969819


The accuracy of this model is only 78% which is still worse than the baseline model.

In [20]:
X_train5 = train[['pclass', 'sex_male', 'alone', 'age']]

logit5, coefs, acc = my_logit(X_train5)

print(coefs, acc)

[[-1.12720398 -2.41479961 -0.17176794 -0.02570129]] 0.7967806841046278


The accuracy of this model is also ~80% which is roughly on par with the baseline model.

In [22]:
X_train6 = train[['sex_male']]

logit6, coefs, acc = my_logit(X_train6)

print(coefs, acc)

[[-2.37681345]] 0.7847082494969819


The accuracy of this model is only 78% which isn't quite as good.

5. Use you best 3 models to predict and evaluate on your validate sample.

My three best models are model 1 (all vars) & 3 (pclass, sex_male, fare, age) & 5 (pclass, sex_male, alone, age).

In [45]:
X_validate1 = validate.drop(columns='survived')
X_validate3 = validate[['age', 'fare', 'pclass', 'sex_male']]
X_validate5 = validate[['pclass', 'sex_male', 'alone', 'age']]
y_validate = validate.survived

acc1 = logit1.score(X_validate1, y_validate)
acc3 = logit3.score(X_validate3, y_validate)
acc5 = logit5.score(X_validate5, y_validate)

print(acc1, acc3, acc5)

0.7990654205607477 0.780373831775701 0.7850467289719626


In [46]:
y1_pred = logit1.predict(X_validate1)
y3_pred = logit3.predict(X_validate3)
y5_pred = logit5.predict(X_validate5)

print('y1: All Vars:\n', classification_report(y_validate, y1_pred))
print('y3: pclass, sex_male, fare, age:\n', classification_report(y_validate, y3_pred))
print('y5: pclass, sex_male, alone, age:\n', classification_report(y_validate, y5_pred))

y1: All Vars:
               precision    recall  f1-score   support

           0       0.82      0.87      0.84       132
           1       0.77      0.68      0.72        82

    accuracy                           0.80       214
   macro avg       0.79      0.78      0.78       214
weighted avg       0.80      0.80      0.80       214

y3: pclass, sex_male, fare, age:
               precision    recall  f1-score   support

           0       0.81      0.83      0.82       132
           1       0.72      0.70      0.71        82

    accuracy                           0.78       214
   macro avg       0.77      0.76      0.77       214
weighted avg       0.78      0.78      0.78       214

y5: pclass, sex_male, alone, age:
               precision    recall  f1-score   support

           0       0.82      0.83      0.83       132
           1       0.72      0.71      0.72        82

    accuracy                           0.79       214
   macro avg       0.77      0.77      0.77 

6. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

The best model is model 1 which has all X variables.

In [50]:
X_test = test.drop(columns=['survived'])
y_test = test.survived

test_acc = logit1.score(X_test, y_test)
y1_pred = logit1.predict(X_test)

print(test_acc)
print('test report:\n', classification_report(y_test, y1_pred))

0.797752808988764
test report:
               precision    recall  f1-score   support

           0       0.84      0.84      0.84       110
           1       0.74      0.74      0.74        68

    accuracy                           0.80       178
   macro avg       0.79      0.79      0.79       178
weighted avg       0.80      0.80      0.80       178

