# Modeling Exercises
## Logistic Regression

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

import warnings
warnings.filterwarnings("ignore")

import acquire
from prepare import titanic_split, prep_titanic

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression

In [2]:
# First I will make import titanic.csv into a dataframe
# In the same step, I will tidy the data for a first time

df = prep_titanic()
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,1
3,1,1,35.0,1,0,53.1,0,0,0,1
4,0,3,35.0,0,0,8.05,1,1,0,1


In [3]:
# Now to check for nulls

df.isnull().sum(axis=0)

# There no nulls and I see that my imputer in my prepare.py has sufficiently tidied the data for me now.

survived      0
pclass        0
age           0
sibsp         0
parch         0
fare          0
alone         0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

In [4]:
# Now, before I do anything else with this data, I will split it into train, validate, and test.

X1 = df[['pclass','fare']]
y1 = df[['survived']]

X1_train_validate, X1_test, y1_train_validate, y1_test = train_test_split(X1, y1, test_size = .20, random_state = 666)

X1_train, X1_validate, y1_train, y1_validate = train_test_split(X1_train_validate, y1_train_validate, test_size = .30, random_state = 666)

print("train: ", X1_train.shape, ", validate: ", X1_validate.shape, ", test: ", X1_test.shape)
print("train: ", y1_train.shape, ", validate: ", y1_validate.shape, ", test: ", y1_test.shape)

train:  (497, 2) , validate:  (214, 2) , test:  (178, 2)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


### 1. Start by defining your baseline model.

In [5]:
# For my own reference:

# accuracy = (tp + tn) / (tp + tn + fp + fn)
# recall = tp / (tp + fn)
# precision = tp / (tp + fp)

In [6]:
# Find the baseline and it's accuracy

y1_train.survived.value_counts()

0    299
1    198
Name: survived, dtype: int64

In [7]:
# Baseline (and our positive case) will that be a passenger did Not Survive.

baseline_model = pd.DataFrame(y1_train)
baseline_model.head(3)

Unnamed: 0,survived
88,1
386,0
459,0


In [8]:
baseline_model["baseline"] = baseline_model.survived.value_counts().index[0]
baseline_model = baseline_model.rename(columns={'survived': 'actual'})
baseline_model.head(3)

Unnamed: 0,actual,baseline
88,1,0
386,0,0
459,0,0


In [12]:
pd.crosstab(baseline_model.actual, baseline_model.baseline)

baseline,0
actual,Unnamed: 1_level_1
0,299
1,198


In [17]:
# Positive is Not Survived

tp = 299
tn = 0
fp = 198
fn = 0

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)
print("-------------")

accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)

print("Accuracy of baseline model is", round(accuracy, 3))
#print("Recall is", round(recall, 3))
#print("Precision is", round(precision, 3))

True Positives: 299
False Positives: 198
False Negatives: 0
True Negatives: 0
-------------
Accuracy of baseline model is 0.602


> Now I know that in order to beat my baseline model's accuracy,
> I must build a model with over 60% accuracy in prediction.

### 2. Create another model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [18]:
# I will create a new model adding age on the X.

X2 = df[['pclass','fare', 'age']]
y2 = df[['survived']]

X2_train_validate, X2_test, y2_train_validate, y2_test = train_test_split(X2, y2, test_size = .20, random_state = 666)

X2_train, X2_validate, y2_train, y2_validate = train_test_split(X2_train_validate, y2_train_validate, test_size = .30, random_state = 666)

print("train: ", X2_train.shape, ", validate: ", X2_validate.shape, ", test: ", X2_test.shape)
print("train: ", y2_train.shape, ", validate: ", y2_validate.shape, ", test: ", y2_test.shape)

train:  (497, 3) , validate:  (214, 3) , test:  (178, 3)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [19]:
# Create, Fit, & Predict

# Create the logistic regression object
logit = LogisticRegression(C=1, random_state=666)

In [20]:
# Fit the model to the training data

logit.fit(X2_train, y2_train)

LogisticRegression(C=1, random_state=666)

In [21]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.88326135  0.00685142 -0.03761418]]
Intercept: 
 [2.42863895]


In [22]:
# Estimate whether or not a passenger would survive, using the training data

y2_pred = logit.predict(X2_train)
#y2_pred
#above commented out unless you want a bunch of zeros and ones on your screen

In [24]:
# Estimate the probability of a passenger surviving, using the training data

y2_pred_proba = logit.predict_proba(X2_train)

In [26]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X2_train, y2_train)))

Accuracy of Logistic Regression classifier on training set: 0.69


> 69% accuracy is better than our baseline model accuracy of 60% without age and without using a logistic regression model.

### 3. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [30]:
# My model already has sex encoded as sex_male (1 meaning the passenger is male), so I will leave that alone for now.
# I will create a new model adding sex on the X.

X3 = df[['pclass','fare', 'age', 'sex_male']]
y3 = df[['survived']]

X3_train_validate, X3_test, y3_train_validate, y3_test = train_test_split(X3, y3, test_size = .20, random_state = 666)

X3_train, X3_validate, y3_train, y3_validate = train_test_split(X3_train_validate, y3_train_validate, test_size = .30, random_state = 666)

print("train: ", X3_train.shape, ", validate: ", X3_validate.shape, ", test: ", X3_test.shape)
print("train: ", y3_train.shape, ", validate: ", y3_validate.shape, ", test: ", y3_test.shape)

train:  (497, 4) , validate:  (214, 4) , test:  (178, 4)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [31]:
# Create, Fit, & Predict

# Create the logistic regression object
logit2 = LogisticRegression(C=1, random_state=666)

In [32]:
# Fit the model to the training data

logit2.fit(X3_train, y3_train)

LogisticRegression(C=1, random_state=666)

In [33]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit2.coef_)
print('Intercept: \n', logit2.intercept_)

Coefficient: 
 [[-1.11869479  0.00322321 -0.03718728 -2.67597258]]
Intercept: 
 [4.5650524]


In [35]:
# Estimate whether or not a passenger would survive, using the training data

y3_pred = logit2.predict(X3_train)

In [37]:
# Estimate the probability of a passenger surviving, using the training data

y3_pred_proba = logit2.predict_proba(X3_train)

In [38]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit2.score(X3_train, y3_train)))

Accuracy of Logistic Regression classifier on training set: 0.81


> Accuracy of this model is 81%, making it the best model so far.

### 4. Try out other combinations of features and models.

In [39]:
# For this next model I will take out fare and add in alone as a variable.

X4 = df[['pclass', 'age', 'sex_male', 'alone']]
y4 = df[['survived']]

X4_train_validate, X4_test, y4_train_validate, y4_test = train_test_split(X4, y4, test_size = .20, random_state = 666)

X4_train, X4_validate, y4_train, y4_validate = train_test_split(X4_train_validate, y4_train_validate, test_size = .30, random_state = 666)

print("train: ", X4_train.shape, ", validate: ", X4_validate.shape, ", test: ", X4_test.shape)
print("train: ", y4_train.shape, ", validate: ", y4_validate.shape, ", test: ", y4_test.shape)

train:  (497, 4) , validate:  (214, 4) , test:  (178, 4)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [40]:
# Create, Fit, & Predict

# Create the logistic regression object
logit3 = LogisticRegression(C=1, random_state=666)

In [41]:
# Fit the model to the training data

logit3.fit(X4_train, y4_train)

LogisticRegression(C=1, random_state=666)

In [42]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit3.coef_)
print('Intercept: \n', logit3.intercept_)

Coefficient: 
 [[-1.20732769 -0.03708502 -2.66698316 -0.14040183]]
Intercept: 
 [4.93553074]


In [43]:
# Estimate whether or not a passenger would survive, using the training data

y4_pred = logit3.predict(X4_train)

In [44]:
# Estimate the probability of a passenger surviving, using the training data

y4_pred_proba = logit3.predict_proba(X4_train)

In [47]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit3.score(X4_train, y4_train)))

Accuracy of Logistic Regression classifier on training set: 0.81


> This model has 81% accuracy, same as the last model.

In [48]:
# For the next model I'll just try adding all of the features I think are most relevant.
# This might be a little overkill on the features, but I am curious to see if it improves accuracy.

X5 = df[['pclass', 'fare', 'age', 'sex_male', 'alone', 'sibsp', 'parch']]
y5 = df[['survived']]

X5_train_validate, X5_test, y5_train_validate, y5_test = train_test_split(X5, y5, test_size = .20, random_state = 666)

X5_train, X5_validate, y5_train, y5_validate = train_test_split(X5_train_validate, y5_train_validate, test_size = .30, random_state = 666)

print("train: ", X5_train.shape, ", validate: ", X5_validate.shape, ", test: ", X5_test.shape)
print("train: ", y5_train.shape, ", validate: ", y5_validate.shape, ", test: ", y5_test.shape)

train:  (497, 7) , validate:  (214, 7) , test:  (178, 7)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [49]:
# Create, Fit, & Predict

# Create the logistic regression object
logit4 = LogisticRegression(C=1, random_state=666)

In [50]:
# Fit the model to the training data

logit4.fit(X5_train, y5_train)

LogisticRegression(C=1, random_state=666)

In [51]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit4.coef_)
print('Intercept: \n', logit4.intercept_)

Coefficient: 
 [[-1.0395119   0.00347852 -0.04213193 -2.67400127 -0.76392875 -0.50163724
  -0.16848647]]
Intercept: 
 [5.28808292]


In [52]:
# Estimate whether or not a passenger would survive, using the training data

y5_pred = logit4.predict(X5_train)

In [53]:
# Estimate the probability of a passenger surviving, using the training data

y5_pred_proba = logit4.predict_proba(X5_train)

In [54]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit4.score(X5_train, y5_train)))

Accuracy of Logistic Regression classifier on training set: 0.82


> 82% is the accuracy of this model, so it is the best model, and the one with the most included features.

### 5. Use you best 3 models to predict and evaluate on your validate sample.

In [58]:
y_pred1 = logit2.predict(X3_validate)
y_pred2 = logit3.predict(X4_validate)
y_pred3 = logit4.predict(X5_validate)

print('Model 1 will be the model with features pclass, fare, age, and sex_male.')
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit2.score(X3_validate, y3_validate)))
print("Confusion matrix:\n", confusion_matrix(y3_validate, y_pred1))
print("Classification report:\n", classification_report(y3_validate, y_pred1))

print("\n------------------------------------------------------------------\n")

print('Model 2 will be the model with features pclass, age, sex_male, and alone.')
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit3.score(X4_validate, y4_validate)))
print("Confusion matrix:\n", confusion_matrix(y4_validate, y_pred2))
print("Classification report:\n", classification_report(y4_validate, y_pred2))

print("\n------------------------------------------------------------------\n")

print('Model 3 will be the model with features pclass, fare, age, sex_male, alone, sibsp, and parch.')
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit4.score(X5_validate, y5_validate)))
print("Confusion matrix:\n", confusion_matrix(y5_validate, y_pred3))
print("Classification report:\n", classification_report(y5_validate, y_pred3))

Model 1 will be the model with features pclass, fare, age, and sex_male.
Accuracy of Logistic Regression classifier on validation set: 0.79
Confusion matrix:
 [[126  15]
 [ 31  42]]
Classification report:
               precision    recall  f1-score   support

           0       0.80      0.89      0.85       141
           1       0.74      0.58      0.65        73

    accuracy                           0.79       214
   macro avg       0.77      0.73      0.75       214
weighted avg       0.78      0.79      0.78       214


------------------------------------------------------------------

Model 2 will be the model with features pclass, age, sex_male, and alone.
Accuracy of Logistic Regression classifier on validation set: 0.79
Confusion matrix:
 [[126  15]
 [ 31  42]]
Classification report:
               precision    recall  f1-score   support

           0       0.80      0.89      0.85       141
           1       0.74      0.58      0.65        73

    accuracy               

### 6. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [59]:
# Model 3 performed best so I will use that one on the test dataset.

y_pred4 = logit4.predict(X5_test)

print('Model 3 is the model with features pclass, fare, age, sex_male, alone, sibsp, and parch.')
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit4.score(X5_test, y5_test)))
print("Confusion matrix:\n", confusion_matrix(y5_test, y_pred4))
print("Classification report:\n", classification_report(y5_test, y_pred4))

Model 3 is the model with features pclass, fare, age, sex_male, alone, sibsp, and parch.
Accuracy of Logistic Regression classifier on test set: 0.76
Confusion matrix:
 [[90 19]
 [24 45]]
Classification report:
               precision    recall  f1-score   support

           0       0.79      0.83      0.81       109
           1       0.70      0.65      0.68        69

    accuracy                           0.76       178
   macro avg       0.75      0.74      0.74       178
weighted avg       0.76      0.76      0.76       178



> Model 3 performed on the test dataset with an accuracy of 76%, lower than the 80% accuracy it achieved on the validate dataset. The f1-score on test is 81%, compared to 86% on validate. On the train dataset, Model 3's accuracy was 82%.