# Model Exercises

## Curiculum Model

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt

import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_titanic_data
from prepare import prep_titanic

df = get_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [2]:
# Handle missing values in the `age` column.
df.dropna(inplace=True)

In [3]:
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 127 entries, 123 to 540
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  127 non-null    int64  
 1   age     127 non-null    float64
 2   fare    127 non-null    float64
 3   sibsp   127 non-null    int64  
 4   parch   127 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 6.0 KB


In [4]:
# from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='saga')

In [5]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, class_weight={1: 2}, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [6]:
# Print the coefficients and intercept of the model
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[1.30411374e-02 8.72240193e-05 1.53779647e-02 5.48610411e-03
  1.65371660e-03]]
Intercept: 
 [0.00655794]


In [7]:
# Estimate whether or not a passenger would survive, using the training data
y_pred = logit.predict(X_train)

In [8]:
# Estimate the probability of a passenger surviving, using the training data
y_pred_proba = logit.predict_proba(X_train)

In [9]:
# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.64


In [10]:
# Create a confusion matrix
print(confusion_matrix(y_train, y_pred))

[[ 0 46]
 [ 0 81]]


In [11]:
# Compute Precision, Recall, F1-score, and Support
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.64      1.00      0.78        81

    accuracy                           0.64       127
   macro avg       0.32      0.50      0.39       127
weighted avg       0.41      0.64      0.50       127



Curiculum model = 64% accuracy

### My Baseline calculation

In [12]:
# split df
tdf = get_titanic_data()
tdf.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [13]:
train, validate, test = prep_titanic(tdf)

In [14]:
print(train.shape, validate.shape, test.shape)

(497, 11) (214, 11) (178, 11)


In [15]:
train.survived.mean()

0.3822937625754527

In [16]:
train.survived.value_counts()

0    307
1    190
Name: survived, dtype: int64

In [17]:
# died is the majority response - requires human intervention, but gives same result as Ryan's
# positive case = died
my_baseline_accuracy = 307/(307+190)
my_baseline_accuracy

0.6177062374245473

In [18]:
# Ryan's method - can be automated to function
train['baseline_prediction'] = 0
pd.crosstab(train.baseline_prediction, train.survived)

survived,0,1
baseline_prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
0,307,190


In [19]:
baseline_accuracy = (train.baseline_prediction == train.survived).mean()
baseline_accuracy

0.6177062374245473

Baseline accuracy = 62%

#### 1. Create another model that includes age in addition to fare and pclass. Does this model perform better than your previous one?

In [20]:
# understand the question to mean: create a model that has age, fare, and pclass as only features
logit = LogisticRegression()

In [21]:
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,baseline_prediction
583,0,1,36.0,0,0,40.125,Cherbourg,1,1,0,0,0
337,1,1,41.0,0,0,134.5,Cherbourg,1,0,0,0,0
50,0,3,7.0,4,1,39.6875,Southampton,0,1,0,1,0
218,1,1,32.0,0,0,76.2917,Cherbourg,1,0,0,0,0
31,1,1,28.0,1,0,146.5208,Cherbourg,0,0,0,0,0


In [22]:
# X_train = train.drop(columns=['low_tip_target'])
# y_train = train.low_tip_target

# X_validate = validate.drop(columns=['low_tip_target'])
# y_validate = validate.low_tip_target

# X_test = test.drop(columns=['low_tip_target'])
# y_test = test.low_tip_target

X_train_afp = train.drop(columns=['baseline_prediction', 'survived', 'sex_male', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_train_afp = train.survived

X_validate_afp = validate.drop(columns=['survived', 'sex_male', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_validate_afp = validate.survived

X_test_afp = test.drop(columns=['survived', 'sex_male', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_test_afp = test.survived

In [23]:
X_train_afp.head()

Unnamed: 0,pclass,age,fare
583,1,36.0,40.125
337,1,41.0,134.5
50,3,7.0,39.6875
218,1,32.0,76.2917
31,1,28.0,146.5208


In [24]:
y_train_afp.head()

583    0
337    1
50     0
218    1
31     1
Name: survived, dtype: int64

In [25]:
# Now fit to X_train, y_train for the attributes age, fare, pclass only
logit_afp = logit.fit(X_train_afp, y_train_afp)

In [26]:
print(logit_afp.coef_)


print(logit_afp.intercept_)

[[-0.98002535 -0.03012701  0.00269178]]
[2.50720857]


In [27]:
X_train_afp.columns

Index(['pclass', 'age', 'fare'], dtype='object')

In [28]:
# Predict values on X_train.
y_pred_afp = logit_afp.predict(X_train_afp)
y_pred_proba_afp = logit_afp.predict_proba(X_train_afp)

In [29]:
# model age, fare, pclass accuracy
logit_afp.score(X_train_afp, y_train_afp)

0.7142857142857143

In [30]:
# confusion matrix
print(confusion_matrix(y_train_afp, y_pred_afp))

[[265  42]
 [100  90]]


In [31]:
# classification report for Model afp
print(classification_report(y_train_afp, y_pred_afp))

              precision    recall  f1-score   support

           0       0.73      0.86      0.79       307
           1       0.68      0.47      0.56       190

    accuracy                           0.71       497
   macro avg       0.70      0.67      0.67       497
weighted avg       0.71      0.71      0.70       497



This model using age, fare, and pclass only has a 71% accuracy rating. 
Age in this model was filled using imputed values.  

Accuracy:   
So this model performs better than the 61% baseline

#### 2. Include sex in your model as well. Note that you'll need to encode this feature before including it in a model.


In [32]:
# understand the question to mean: create a model that has sex, age, fare, and pclass as features
logit = LogisticRegression()

In [33]:
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,baseline_prediction
583,0,1,36.0,0,0,40.125,Cherbourg,1,1,0,0,0
337,1,1,41.0,0,0,134.5,Cherbourg,1,0,0,0,0
50,0,3,7.0,4,1,39.6875,Southampton,0,1,0,1,0
218,1,1,32.0,0,0,76.2917,Cherbourg,1,0,0,0,0
31,1,1,28.0,1,0,146.5208,Cherbourg,0,0,0,0,0


In [79]:
# X_train = train.drop(columns=['low_tip_target'])
# y_train = train.low_tip_target

# X_validate = validate.drop(columns=['low_tip_target'])
# y_validate = validate.low_tip_target

# X_test = test.drop(columns=['low_tip_target'])
# y_test = test.low_tip_target

X_train_safp = train.drop(columns=['baseline_prediction', 'survived', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_train_safp = train.survived

X_validate_safp = validate.drop(columns=['survived', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_validate_safp = validate.survived

X_test_safp = test.drop(columns=['survived', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_test_safp = test.survived

In [35]:
X_train_safp.head()

Unnamed: 0,pclass,age,fare,sex_male
583,1,36.0,40.125,1
337,1,41.0,134.5,0
50,3,7.0,39.6875,1
218,1,32.0,76.2917,0
31,1,28.0,146.5208,0


In [80]:
# Now fit to X_train, y_train for the attributes age, fare, pclass only
logit_safp = logit.fit(X_train_safp, y_train_safp)

In [81]:
print(logit_safp.coef_)


print(logit_safp.intercept_)

[[-1.11442057e+00 -2.63670192e-02  9.26636404e-04 -2.45962126e+00]]
[4.28890524]


In [82]:
X_train_safp.columns

Index(['pclass', 'age', 'fare', 'sex_male'], dtype='object')

In [83]:
# Predict values on X_train.
y_pred_safp = logit_safp.predict(X_train_safp)
y_pred_proba_safp = logit_safp.predict_proba(X_train_safp)

In [84]:
# model sex, age, fare, pclass accuracy
logit_safp.score(X_train_safp, y_train_safp)

0.7927565392354124

This model using sex, age, fare, and pclass only has a 79% accuracy rating.  
Age in this model was filled using imputed values.  

Accuracy:   
So this model performs better than the 61% baseline and better than the model without sex which was 71%

#### 3. Try out other combinations of features and models.

In [41]:
logit = LogisticRegression()

In [42]:
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,baseline_prediction
583,0,1,36.0,0,0,40.125,Cherbourg,1,1,0,0,0
337,1,1,41.0,0,0,134.5,Cherbourg,1,0,0,0,0
50,0,3,7.0,4,1,39.6875,Southampton,0,1,0,1,0
218,1,1,32.0,0,0,76.2917,Cherbourg,1,0,0,0,0
31,1,1,28.0,1,0,146.5208,Cherbourg,0,0,0,0,0


In [43]:
# Model pclass as only attribute
X_train_p = train.drop(columns=['baseline_prediction', 'survived', 'age', 'fare', 'sex_male', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_train_p = train.survived

X_validate_p = validate.drop(columns=['survived', 'age', 'fare', 'sex_male',  'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_validate_p = validate.survived

X_test_p = test.drop(columns=['survived', 'age', 'fare', 'sex_male',  'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_test_p = test.survived

In [44]:
# verify pclass is only attribute
X_train_p.head()

Unnamed: 0,pclass
583,1
337,1
50,3
218,1
31,1


In [45]:
# Now fit to X_train, y_train for the attribute pclass only
logit_p = logit.fit(X_train_p, y_train_p)

In [46]:
print(logit_p.coef_)
print(logit_p.intercept_)

[[-0.87487095]]
[1.47257252]


In [47]:
# Predict values on X_train.
y_pred_p = logit_p.predict(X_train_p)
y_pred_proba_p = logit_p.predict_proba(X_train_p)

In [48]:
# model sex, age, fare, pclass accuracy
logit_p.score(X_train_p, y_train_p)

0.682092555331992

This model using pclass only has a 68% accuracy rating.  
Accuracy:   
Baseline = 61%  
Age, Fare, pclass = 71%  
Sex, Age, Fare, pclass = 79%  
pclass = 68%  

In [49]:
# Model age as only attribute
X_train_a = train.drop(columns=['baseline_prediction', 'survived', 'pclass', 'fare', 'sex_male', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_train_a = train.survived

X_validate_a = validate.drop(columns=['survived', 'pclass', 'fare', 'sex_male',  'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_validate_a = validate.survived

X_test_a = test.drop(columns=['survived', 'pclass', 'fare', 'sex_male',  'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_test_a = test.survived

In [50]:
# verify age is only attribute
X_train_a.head()

Unnamed: 0,age
583,36.0
337,41.0
50,7.0
218,32.0
31,28.0


In [51]:
# Now fit to X_train, y_train for the attribute age only
logit_a = logit.fit(X_train_a, y_train_a)

In [52]:
print(logit_a.coef_)
print(logit_a.intercept_)

[[-0.00517596]]
[-0.32747912]


In [53]:
# model age accuracy
logit_a.score(X_train_a, y_train_a)

0.6177062374245473

This model using age only has a 61% accuracy rating. Which matches the baseline.  
Age in this model was filled using imputed values.  

Accuracy:   
Baseline = 61%  
Age, Fare, pclass = 71%  
Sex, Age, Fare, pclass = 79%  
pclass = 68%    
Age = 61%  

In [73]:
# Model sex as only attribute
X_train_s = train.drop(columns=['baseline_prediction', 'survived', 'pclass', 'fare', 'age', 'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_train_s = train.survived

X_validate_s = validate.drop(columns=['survived', 'pclass', 'fare', 'age',  'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_validate_s = validate.survived

X_test_s = test.drop(columns=['survived', 'pclass', 'fare', 'age',  'alone', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_test_s = test.survived

In [74]:
# verify sex_male is only attribute
X_train_s.head()

Unnamed: 0,sex_male
583,1
337,0
50,1
218,0
31,0


In [75]:
# Now fit to X_train, y_train for the attribute sex_male only
logit_s = logit.fit(X_train_s, y_train_s)

In [76]:
print(logit_s.coef_)
print(logit_s.intercept_)

[[-2.37681345]]
[1.01638592]


In [58]:
# model sex_male accuracy
logit_s.score(X_train_s, y_train_s)

0.7847082494969819

This model using sex_male only has a 78% accuracy rating.  
Accuracy:   
Baseline = 61%  
Age, Fare, pclass = 71%  
Sex, Age, Fare, pclass = 79%  
pclass = 68%    
Age = 61%  
sex_male = 78%  

In [59]:
# Model alone as only attribute
X_train_al = train.drop(columns=['baseline_prediction', 'survived', 'pclass', 'fare', 'age', 'sex_male', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_train_al = train.survived

X_validate_al = validate.drop(columns=['survived', 'pclass', 'fare', 'age',  'sex_male', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_validate_al = validate.survived

X_test_al = test.drop(columns=['survived', 'pclass', 'fare', 'age',  'sex_male', 'sibsp', 'parch', 'embark_town', 'embark_town_Queenstown', 'embark_town_Southampton'])
y_test_al = test.survived

In [60]:
# verify alone is only attribute
X_train_al.head()

Unnamed: 0,alone
583,1
337,1
50,0
218,1
31,0


In [61]:
# Now fit to X_train, y_train for the attribute alone only
logit_al = logit.fit(X_train_al, y_train_al)

In [62]:
print(logit_al.coef_)
print(logit_al.intercept_)

[[-1.0525267]]
[0.13244919]


In [63]:
# model alone accuracy
logit_al.score(X_train_al, y_train_al)

0.647887323943662

This model using alone only has a 64% accuracy rating.  
Accuracy:   
Baseline = 61%  
Age, Fare, pclass = 71%  
Sex, Age, Fare, pclass = 79%  
pclass = 68%    
Age = 61%  
sex_male = 78%  
alone = 64%  

#### 4. Choose you best model and evaluate it on the test dataset. Is it overfit?

In [64]:
# editing this question to add validate step. Validate on 2 best models = sex_male only and sex, age, fare, pclass

In [71]:
# model sex, age, fare, pclass validate data
print("model_safp\n", logit_safp.score(X_validate_safp, y_validate_safp))

model_safp
 0.780373831775701


In [77]:
# model sex_male validate accuracy
logit_s.score(X_validate_s, y_validate_s)

0.7663551401869159

Base on perfomance on the validate data, conclude model with sex, age, fare, and pclass performs the best.  
Run that on the test data

In [85]:
# model sex, age, fare, pclass validate data
print("model_safp\n", logit_safp.score(X_test_safp, y_test_safp))

model_safp
 0.8033707865168539


The accuracy for this model is 80% on the test data.

#### 5. Bonus How do different strategies for handling the missing values in the age column affect model performance?

#### 6. Bonus: How do different strategies for encoding sex affect model performance?

#### 7. Bonus: scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.

Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.

C
=
.01
,
.1
,
1
,
10
,
100
,
1000


#### Bonus Bonus: how does scaling the data interact with your choice of C?