# Modeling Exercises
## Logistic Regression

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

import warnings
warnings.filterwarnings("ignore")

import acquire
from prepare import titanic_split, prep_titanic

In [63]:
# First I will make import titanic.csv into a dataframe
# In the same step, I will tidy the data for a first time

df = prep_titanic()
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,1
3,1,1,35.0,1,0,53.1,0,0,0,1
4,0,3,35.0,0,0,8.05,1,1,0,1


In [3]:
# Now to check for nulls

df.isnull().sum(axis=0)

# There no nulls and I see that my imputer in my prepare.py has sufficiently tidied the data for me now.

survived      0
pclass        0
age           0
sibsp         0
parch         0
fare          0
alone         0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

In [4]:
# Now, before I do anything else with this data, I will split it into train, validate, and test.

X1 = df[['pclass','fare']]
y1 = df[['survived']]

X1_train_validate, X1_test, y1_train_validate, y1_test = train_test_split(X1, y1, test_size = .20, random_state = 666)

X1_train, X1_validate, y1_train, y1_validate = train_test_split(X1_train_validate, y1_train_validate, test_size = .30, random_state = 666)

print("train: ", X1_train.shape, ", validate: ", X1_validate.shape, ", test: ", X1_test.shape)
print("train: ", y1_train.shape, ", validate: ", y1_validate.shape, ", test: ", y1_test.shape)

train:  (497, 2) , validate:  (214, 2) , test:  (178, 2)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


### 1. Start by defining your baseline model.

In [5]:
# For my own reference:

# accuracy = (tp + tn) / (tp + tn + fp + fn)
# recall = tp / (tp + fn)
# precision = tp / (tp + fp)

In [6]:
# Find the baseline and it's accuracy

y1_train.survived.value_counts()

0    299
1    198
Name: survived, dtype: int64

In [7]:
# Baseline (and our positive case) will that be a passenger did Not Survive.

baseline_model = pd.DataFrame(y1_train)
baseline_model.head(3)

Unnamed: 0,survived
88,1
386,0
459,0


In [8]:
baseline_model["baseline"] = baseline_model.survived.value_counts().index[0]
baseline_model = baseline_model.rename(columns={'survived': 'actual'})
baseline_model.head(3)

Unnamed: 0,actual,baseline
88,1,0
386,0,0
459,0,0


In [12]:
pd.crosstab(baseline_model.actual, baseline_model.baseline)

baseline,0
actual,Unnamed: 1_level_1
0,299
1,198


In [17]:
# Positive is Not Survived

tp = 299
tn = 0
fp = 198
fn = 0

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)
print("-------------")

accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)

print("Accuracy of baseline model is", round(accuracy, 3))
#print("Recall is", round(recall, 3))
#print("Precision is", round(precision, 3))

True Positives: 299
False Positives: 198
False Negatives: 0
True Negatives: 0
-------------
Accuracy of baseline model is 0.602


> Now I know that in order to beat my baseline model's accuracy,
> I must build a model with over 60% accuracy in prediction.

### 2. Create another model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [18]:
# I will create a new model adding age on the X.

X2 = df[['pclass','fare', 'age']]
y2 = df[['survived']]

X2_train_validate, X2_test, y2_train_validate, y2_test = train_test_split(X2, y2, test_size = .20, random_state = 666)

X2_train, X2_validate, y2_train, y2_validate = train_test_split(X2_train_validate, y2_train_validate, test_size = .30, random_state = 666)

print("train: ", X2_train.shape, ", validate: ", X2_validate.shape, ", test: ", X2_test.shape)
print("train: ", y2_train.shape, ", validate: ", y2_validate.shape, ", test: ", y2_test.shape)

train:  (497, 3) , validate:  (214, 3) , test:  (178, 3)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [19]:
# Create, Fit, & Predict

# Create the logistic regression object
logit = LogisticRegression(C=1, random_state=666)

In [20]:
# Fit the model to the training data

logit.fit(X2_train, y2_train)

LogisticRegression(C=1, random_state=666)

In [21]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.88326135  0.00685142 -0.03761418]]
Intercept: 
 [2.42863895]


In [22]:
# Estimate whether or not a passenger would survive, using the training data

y2_pred = logit.predict(X2_train)
#y2_pred
#above commented out unless you want a bunch of zeros and ones on your screen

In [24]:
# Estimate the probability of a passenger surviving, using the training data

y2_pred_proba = logit.predict_proba(X2_train)

In [26]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X2_train, y2_train)))

Accuracy of Logistic Regression classifier on training set: 0.69


> 69% accuracy is better than our baseline model accuracy of 60% without age and without using a logistic regression model.

### 3. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [30]:
# My model already has sex encoded as sex_male (1 meaning the passenger is male), so I will leave that alone for now.
# I will create a new model adding sex on the X.

X3 = df[['pclass','fare', 'age', 'sex_male']]
y3 = df[['survived']]

X3_train_validate, X3_test, y3_train_validate, y3_test = train_test_split(X3, y3, test_size = .20, random_state = 666)

X3_train, X3_validate, y3_train, y3_validate = train_test_split(X3_train_validate, y3_train_validate, test_size = .30, random_state = 666)

print("train: ", X3_train.shape, ", validate: ", X3_validate.shape, ", test: ", X3_test.shape)
print("train: ", y3_train.shape, ", validate: ", y3_validate.shape, ", test: ", y3_test.shape)

train:  (497, 4) , validate:  (214, 4) , test:  (178, 4)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [31]:
# Create, Fit, & Predict

# Create the logistic regression object
logit2 = LogisticRegression(C=1, random_state=666)

In [32]:
# Fit the model to the training data

logit2.fit(X3_train, y3_train)

LogisticRegression(C=1, random_state=666)

In [33]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit2.coef_)
print('Intercept: \n', logit2.intercept_)

Coefficient: 
 [[-1.11869479  0.00322321 -0.03718728 -2.67597258]]
Intercept: 
 [4.5650524]


In [35]:
# Estimate whether or not a passenger would survive, using the training data

y3_pred = logit2.predict(X3_train)

In [37]:
# Estimate the probability of a passenger surviving, using the training data

y3_pred_proba = logit2.predict_proba(X3_train)

In [38]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit2.score(X3_train, y3_train)))

Accuracy of Logistic Regression classifier on training set: 0.81


> Accuracy of this model is 81%, making it the best model so far.

### 4. Try out other combinations of features and models.

In [39]:
# For this next model I will take out fare and add in alone as a variable.

X4 = df[['pclass', 'age', 'sex_male', 'alone']]
y4 = df[['survived']]

X4_train_validate, X4_test, y4_train_validate, y4_test = train_test_split(X4, y4, test_size = .20, random_state = 666)

X4_train, X4_validate, y4_train, y4_validate = train_test_split(X4_train_validate, y4_train_validate, test_size = .30, random_state = 666)

print("train: ", X4_train.shape, ", validate: ", X4_validate.shape, ", test: ", X4_test.shape)
print("train: ", y4_train.shape, ", validate: ", y4_validate.shape, ", test: ", y4_test.shape)

train:  (497, 4) , validate:  (214, 4) , test:  (178, 4)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [40]:
# Create, Fit, & Predict

# Create the logistic regression object
logit3 = LogisticRegression(C=1, random_state=666)

In [41]:
# Fit the model to the training data

logit3.fit(X4_train, y4_train)

LogisticRegression(C=1, random_state=666)

In [42]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit3.coef_)
print('Intercept: \n', logit3.intercept_)

Coefficient: 
 [[-1.20732769 -0.03708502 -2.66698316 -0.14040183]]
Intercept: 
 [4.93553074]


In [43]:
# Estimate whether or not a passenger would survive, using the training data

y4_pred = logit3.predict(X4_train)

In [44]:
# Estimate the probability of a passenger surviving, using the training data

y4_pred_proba = logit3.predict_proba(X4_train)

In [47]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit3.score(X4_train, y4_train)))

Accuracy of Logistic Regression classifier on training set: 0.81


> This model has 81% accuracy, same as the last model.

In [48]:
# For the next model I'll just try adding all of the features I think are most relevant.
# This might be a little overkill on the features, but I am curious to see if it improves accuracy.

X5 = df[['pclass', 'fare', 'age', 'sex_male', 'alone', 'sibsp', 'parch']]
y5 = df[['survived']]

X5_train_validate, X5_test, y5_train_validate, y5_test = train_test_split(X5, y5, test_size = .20, random_state = 666)

X5_train, X5_validate, y5_train, y5_validate = train_test_split(X5_train_validate, y5_train_validate, test_size = .30, random_state = 666)

print("train: ", X5_train.shape, ", validate: ", X5_validate.shape, ", test: ", X5_test.shape)
print("train: ", y5_train.shape, ", validate: ", y5_validate.shape, ", test: ", y5_test.shape)

train:  (497, 7) , validate:  (214, 7) , test:  (178, 7)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [49]:
# Create, Fit, & Predict

# Create the logistic regression object
logit4 = LogisticRegression(C=1, random_state=666)

In [50]:
# Fit the model to the training data

logit4.fit(X5_train, y5_train)

LogisticRegression(C=1, random_state=666)

In [51]:
# Print the coefficients and intercept of the model

print('Coefficient: \n', logit4.coef_)
print('Intercept: \n', logit4.intercept_)

Coefficient: 
 [[-1.0395119   0.00347852 -0.04213193 -2.67400127 -0.76392875 -0.50163724
  -0.16848647]]
Intercept: 
 [5.28808292]


In [52]:
# Estimate whether or not a passenger would survive, using the training data

y5_pred = logit4.predict(X5_train)

In [53]:
# Estimate the probability of a passenger surviving, using the training data

y5_pred_proba = logit4.predict_proba(X5_train)

In [54]:
# Evaluate Model on Train

# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit4.score(X5_train, y5_train)))

Accuracy of Logistic Regression classifier on training set: 0.82


> 82% is the accuracy of this model, so it is the best model, and the one with the most included features.

### 5. Use you best 3 models to predict and evaluate on your validate sample.

In [58]:
y_pred1 = logit2.predict(X3_validate)
y_pred2 = logit3.predict(X4_validate)
y_pred3 = logit4.predict(X5_validate)

print('Model 1 will be the model with features pclass, fare, age, and sex_male.')
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit2.score(X3_validate, y3_validate)))
print("Confusion matrix:\n", confusion_matrix(y3_validate, y_pred1))
print("Classification report:\n", classification_report(y3_validate, y_pred1))

print("\n------------------------------------------------------------------\n")

print('Model 2 will be the model with features pclass, age, sex_male, and alone.')
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit3.score(X4_validate, y4_validate)))
print("Confusion matrix:\n", confusion_matrix(y4_validate, y_pred2))
print("Classification report:\n", classification_report(y4_validate, y_pred2))

print("\n------------------------------------------------------------------\n")

print('Model 3 will be the model with features pclass, fare, age, sex_male, alone, sibsp, and parch.')
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit4.score(X5_validate, y5_validate)))
print("Confusion matrix:\n", confusion_matrix(y5_validate, y_pred3))
print("Classification report:\n", classification_report(y5_validate, y_pred3))

Model 1 will be the model with features pclass, fare, age, and sex_male.
Accuracy of Logistic Regression classifier on validation set: 0.79
Confusion matrix:
 [[126  15]
 [ 31  42]]
Classification report:
               precision    recall  f1-score   support

           0       0.80      0.89      0.85       141
           1       0.74      0.58      0.65        73

    accuracy                           0.79       214
   macro avg       0.77      0.73      0.75       214
weighted avg       0.78      0.79      0.78       214


------------------------------------------------------------------

Model 2 will be the model with features pclass, age, sex_male, and alone.
Accuracy of Logistic Regression classifier on validation set: 0.79
Confusion matrix:
 [[126  15]
 [ 31  42]]
Classification report:
               precision    recall  f1-score   support

           0       0.80      0.89      0.85       141
           1       0.74      0.58      0.65        73

    accuracy               

### 6. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [59]:
# Model 3 performed best so I will use that one on the test dataset.

y_pred4 = logit4.predict(X5_test)

print('Model 3 is the model with features pclass, fare, age, sex_male, alone, sibsp, and parch.')
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit4.score(X5_test, y5_test)))
print("Confusion matrix:\n", confusion_matrix(y5_test, y_pred4))
print("Classification report:\n", classification_report(y5_test, y_pred4))

Model 3 is the model with features pclass, fare, age, sex_male, alone, sibsp, and parch.
Accuracy of Logistic Regression classifier on test set: 0.76
Confusion matrix:
 [[90 19]
 [24 45]]
Classification report:
               precision    recall  f1-score   support

           0       0.79      0.83      0.81       109
           1       0.70      0.65      0.68        69

    accuracy                           0.76       178
   macro avg       0.75      0.74      0.74       178
weighted avg       0.76      0.76      0.76       178



> Model 3 performed on the test dataset with an accuracy of 76%, lower than the 80% accuracy it achieved on the validate dataset. The f1-score on test is 81%, compared to 86% on validate. On the train dataset, Model 3's accuracy was 82%.

# Modeling Exercises Cont.
## Decision Trees

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

Continue working in your model file. Add, commit, and push your changes.

In [27]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import graphviz
from graphviz import Graph

from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support

In [105]:
df = prep_titanic()
df.head(5)

# Survived is 1, not survived is 0.

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,1
3,1,1,35.0,1,0,53.1,0,0,0,1
4,0,3,35.0,0,0,8.05,1,1,0,1


### 1. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [106]:
# Now, before I do anything else with this data, I will split it into train, validate, and test.

X1 = df.drop(['survived'],axis=1)
y1 = df[['survived']]

X1_train_validate, X1_test, y1_train_validate, y1_test = train_test_split(X1, y1, test_size = .20, random_state = 666)

X1_train, X1_validate, y1_train, y1_validate = train_test_split(X1_train_validate, y1_train_validate, test_size = .30, random_state = 666)

print("train: ", X1_train.shape, ", validate: ", X1_validate.shape, ", test: ", X1_test.shape)
print("train: ", y1_train.shape, ", validate: ", y1_validate.shape, ", test: ", y1_test.shape)

train:  (497, 9) , validate:  (214, 9) , test:  (178, 9)
train:  (497, 1) , validate:  (214, 1) , test:  (178, 1)


In [107]:
X1_train.head(3)

Unnamed: 0,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
88,1,23.0,3,2,263.0,0,0,0,1
386,3,1.0,5,2,46.9,0,1,0,1
459,3,29.642093,0,0,7.75,1,1,1,0


In [108]:
# Train Model

# Create the Decision Tree Object
# for classification you can change the algorithm to gini or entropy (information gain).  Default is gini.
clf = DecisionTreeClassifier(max_depth=3, random_state=666)

In [109]:
# Fit the model to the training data
clf.fit(X1_train, y1_train)

DecisionTreeClassifier(max_depth=3, random_state=666)

In [110]:
y1_pred = clf.predict(X1_train)
y1_pred

array([1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,

In [111]:
y1_pred_proba = clf.predict_proba(X1_train)
y1_pred_proba

array([[0.0462963 , 0.9537037 ],
       [0.3       , 0.7       ],
       [0.94117647, 0.05882353],
       [0.3       , 0.7       ],
       [0.0462963 , 0.9537037 ],
       [0.94117647, 0.05882353],
       [0.0462963 , 0.9537037 ],
       [0.94117647, 0.05882353],
       [0.94117647, 0.05882353],
       [0.41428571, 0.58571429],
       [0.94117647, 0.05882353],
       [0.86666667, 0.13333333],
       [0.0462963 , 0.9537037 ],
       [0.41428571, 0.58571429],
       [0.72727273, 0.27272727],
       [0.41428571, 0.58571429],
       [0.41428571, 0.58571429],
       [0.0462963 , 0.9537037 ],
       [0.72727273, 0.27272727],
       [0.0462963 , 0.9537037 ],
       [0.72727273, 0.27272727],
       [0.94117647, 0.05882353],
       [0.41428571, 0.58571429],
       [0.94117647, 0.05882353],
       [0.72727273, 0.27272727],
       [0.72727273, 0.27272727],
       [0.0462963 , 0.9537037 ],
       [0.94117647, 0.05882353],
       [0.94117647, 0.05882353],
       [0.41428571, 0.58571429],
       [0.

### 2. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [112]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X1_train, y1_train)))
print("Confusion matrix:\n", confusion_matrix(y1_train, y1_pred))
print("Classification report:\n", classification_report(y1_train, y1_pred))

Accuracy of Decision Tree classifier on training set: 0.84
Confusion matrix:
 [[262  37]
 [ 45 153]]
Classification report:
               precision    recall  f1-score   support

           0       0.85      0.88      0.86       299
           1       0.81      0.77      0.79       198

    accuracy                           0.84       497
   macro avg       0.83      0.82      0.83       497
weighted avg       0.83      0.84      0.83       497



### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [113]:
y1_pred
y1_train.size

497

In [114]:
tn, fp, fn, tp = confusion_matrix(y1_train, y1_pred).ravel()

In [115]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specificity= (tn / (tn + fp))

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)

print("-------------")

print("Accuracy is", round(accuracy, 3))
print("Recall is", round(recall, 3))
print("Precision is", round(precision, 3))
print("Specificity is", round(specificity, 3))
print("f1-score is", round(f1_score(y1_train, y1_pred), 3))
print("Support is", precision_recall_fscore_support(y1_train, y1_pred)[-1])

True Positives: 153
False Positives: 37
False Negatives: 45
True Negatives: 262
-------------
Accuracy is 0.835
Recall is 0.773
Precision is 0.805
Specificity is 0.876
f1-score is 0.789
Support is [299 198]


### 4. Run through steps 2-4 using a different max_depth value.

In [116]:
clf2 = DecisionTreeClassifier(max_depth=7, random_state=666)

In [117]:
clf2.fit(X1_train, y1_train)

DecisionTreeClassifier(max_depth=7, random_state=666)

In [118]:
y1_pred = clf2.predict(X1_train)
y1_pred

array([1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,

In [119]:
y1_pred_proba = clf2.predict_proba(X1_train)
y1_pred_proba

array([[0.        , 1.        ],
       [1.        , 0.        ],
       [0.95121951, 0.04878049],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.8974359 , 0.1025641 ],
       [0.        , 1.        ],
       [0.8974359 , 0.1025641 ],
       [0.8974359 , 0.1025641 ],
       [1.        , 0.        ],
       [0.8974359 , 0.1025641 ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.94594595, 0.05405405],
       [0.35      , 0.65      ],
       [0.73333333, 0.26666667],
       [0.        , 1.        ],
       [0.94594595, 0.05405405],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.73333333, 0.26666667],
       [1.        , 0.        ],
       [0.94594595, 0.05405405],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.73333333, 0.26666667],
       [0.

In [120]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf2.score(X1_train, y1_train)))
print("Confusion matrix:\n", confusion_matrix(y1_train, y1_pred))
print("Classification report:\n", classification_report(y1_train, y1_pred))

Accuracy of Decision Tree classifier on training set: 0.91
Confusion matrix:
 [[288  11]
 [ 36 162]]
Classification report:
               precision    recall  f1-score   support

           0       0.89      0.96      0.92       299
           1       0.94      0.82      0.87       198

    accuracy                           0.91       497
   macro avg       0.91      0.89      0.90       497
weighted avg       0.91      0.91      0.90       497



### 5. Which model performs better on your in-sample data?

> Model 2, with a max_depth of 7, outperformed Model 1 (max_depth of 3).

### 6. Which model performs best on your out-of-sample data, the validate set?

In [121]:
y1_pred = clf.predict(X1_validate)
y2_pred = clf2.predict(X1_validate)

print('Model 1 will be the model with max_depth of 3.')
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'
     .format(clf.score(X1_validate, y1_validate)))
print("Confusion matrix:\n", confusion_matrix(y1_validate, y1_pred))
print("Classification report:\n", classification_report(y1_validate, y1_pred))

print("\n------------------------------------------------------------------\n")

print('Model 2 will be the model with max_depth of 7.')
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'
     .format(clf2.score(X1_validate, y1_validate)))
print("Confusion matrix:\n", confusion_matrix(y1_validate, y2_pred))
print("Classification report:\n", classification_report(y1_validate, y2_pred))

Model 1 will be the model with max_depth of 3.
Accuracy of Decision Tree classifier on validation set: 0.81
Confusion matrix:
 [[128  13]
 [ 27  46]]
Classification report:
               precision    recall  f1-score   support

           0       0.83      0.91      0.86       141
           1       0.78      0.63      0.70        73

    accuracy                           0.81       214
   macro avg       0.80      0.77      0.78       214
weighted avg       0.81      0.81      0.81       214


------------------------------------------------------------------

Model 2 will be the model with max_depth of 7.
Accuracy of Decision Tree classifier on validation set: 0.78
Confusion matrix:
 [[127  14]
 [ 34  39]]
Classification report:
               precision    recall  f1-score   support

           0       0.79      0.90      0.84       141
           1       0.74      0.53      0.62        73

    accuracy                           0.78       214
   macro avg       0.76      0.72     

> Model 1 outperformed Model 2 on the validation dataset. My takeaway from this is that creating a max_depth of 7 overfitted it, as it greatly performed better on the train data than on the validate data.

# Modeling Exercises Cont.
## Random Forest

Continue working in your model file. Be sure to add, commit, and push your changes.

After making a few models, which one has the best performance (or closest metrics) on both train and validate?

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

### 1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 20.

In [7]:
df = prep_titanic()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    889 non-null    int64  
 1   pclass      889 non-null    int64  
 2   age         889 non-null    float64
 3   sibsp       889 non-null    int64  
 4   parch       889 non-null    int64  
 5   fare        889 non-null    float64
 6   alone       889 non-null    int64  
 7   sex_male    889 non-null    uint8  
 8   embarked_Q  889 non-null    uint8  
 9   embarked_S  889 non-null    uint8  
dtypes: float64(2), int64(5), uint8(3)
memory usage: 58.2 KB


In [8]:
df.isna().sum()

survived      0
pclass        0
age           0
sibsp         0
parch         0
fare          0
alone         0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

In [11]:
X = df[['pclass','age','fare','sibsp','parch']]
y = df[["survived"]]

X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size = .3, random_state = 666, stratify=y.survived)
X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size=0.3, random_state = 666, stratify=y_train_validate)

X_train.head()

Unnamed: 0,pclass,age,fare,sibsp,parch
192,3,19.0,7.8542,1,0
392,3,28.0,7.925,2,0
44,3,19.0,7.8792,0,0
458,2,50.0,10.5,0,0
755,2,0.67,14.5,1,1


In [12]:
# Train the model

rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=1,
                            n_estimators=100,
                            max_depth=20, 
                            random_state=666)

rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=20, random_state=666)

In [13]:
# Print feature importances

print(rf.feature_importances_)

[0.07253392 0.37328878 0.42761791 0.06672805 0.05983134]


In [14]:
# Estimate prediction of survival

y_pred = rf.predict(X_train)

# Estimate probability of survival

y_pred_proba = rf.predict_proba(X_train)

### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [22]:
print('Accuracy of random forest classifier on train set: {:.2f}'
     .format(rf.score(X_train, y_train)))

Accuracy of random forest classifier on train set: 0.98


In [19]:
print("Confusion matrix:\n", confusion_matrix(y_train, y_pred))

Confusion matrix:
 [[263   6]
 [  4 162]]


In [20]:
print("Classification report:\n", classification_report(y_train, y_pred))

Classification report:
               precision    recall  f1-score   support

           0       0.99      0.98      0.98       269
           1       0.96      0.98      0.97       166

    accuracy                           0.98       435
   macro avg       0.97      0.98      0.98       435
weighted avg       0.98      0.98      0.98       435



In [48]:
# Now to evaluate on Validate

print('Accuracy of random forest classifier on validate set: {:.2f}'
     .format(rf.score(X_validate, y_validate)))

Accuracy of random forest classifier on validate set: 0.65


> The model has been overfit to the train data.

### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [24]:
y_pred
y_train.size

435

In [25]:
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()

In [28]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specificity= (tn / (tn + fp))

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)

print("-------------")

print("Accuracy is", round(accuracy, 3))
print("Recall is", round(recall, 3))
print("Precision is", round(precision, 3))
print("Specificity is", round(specificity, 3))
print("f1-score is", round(f1_score(y_train, y_pred), 3))
print("Support is", precision_recall_fscore_support(y_train, y_pred)[-1])

True Positives: 162
False Positives: 6
False Negatives: 4
True Negatives: 263
-------------
Accuracy is 0.977
Recall is 0.976
Precision is 0.964
Specificity is 0.978
f1-score is 0.97
Support is [269 166]


### 4. Run through steps increasing your min_samples_leaf to 5 and decreasing your max_depth to 3.

In [36]:
# Train the model

rf2 = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=5,
                            n_estimators=100,
                            max_depth=3, 
                            random_state=666)

rf2.fit(X_train, y_train)

RandomForestClassifier(max_depth=3, min_samples_leaf=5, random_state=666)

In [37]:
# Print feature importances

print(rf2.feature_importances_)

[0.21912829 0.2098166  0.4227064  0.08356851 0.0647802 ]


In [38]:
# Estimate prediction of survival

y_pred2 = rf2.predict(X_train)

# Estimate probability of survival

y_pred_proba2 = rf2.predict_proba(X_train)

In [39]:
print('Accuracy of random forest classifier on train set: {:.2f}'
     .format(rf2.score(X_train, y_train)))

Accuracy of random forest classifier on train set: 0.74


In [43]:
print("Confusion matrix:\n", confusion_matrix(y_train, y_pred2))

Confusion matrix:
 [[234  35]
 [ 80  86]]


In [44]:
print("Classification report:\n", classification_report(y_train, y_pred2))

Classification report:
               precision    recall  f1-score   support

           0       0.75      0.87      0.80       269
           1       0.71      0.52      0.60       166

    accuracy                           0.74       435
   macro avg       0.73      0.69      0.70       435
weighted avg       0.73      0.74      0.73       435



In [49]:
# Now to evaluate on Validate

print('Accuracy of random forest classifier on validate set: {:.2f}'
     .format(rf2.score(X_validate, y_validate)))

Accuracy of random forest classifier on validate set: 0.73


In [46]:
tn, fp, fn, tp = confusion_matrix(y_train, y_pred2).ravel()

In [47]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specificity= (tn / (tn + fp))

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)

print("-------------")

print("Accuracy is", round(accuracy, 3))
print("Recall is", round(recall, 3))
print("Precision is", round(precision, 3))
print("Specificity is", round(specificity, 3))
print("f1-score is", round(f1_score(y_train, y_pred2), 3))
print("Support is", precision_recall_fscore_support(y_train, y_pred2)[-1])

True Positives: 86
False Positives: 35
False Negatives: 80
True Negatives: 234
-------------
Accuracy is 0.736
Recall is 0.518
Precision is 0.711
Specificity is 0.87
f1-score is 0.599
Support is [269 166]


### 5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

**Model 1 on train:**

True Positives: 162, False Positives: 6, False Negatives: 4, True Negatives: 263

Accuracy is 0.977

Recall is 0.976

Precision is 0.964

Specificity is 0.978

f1-score is 0.97

Support is [269 166]



**Model 2 on train:**

True Positives: 86, False Positives: 35, False Negatives: 80, True Negatives: 234

Accuracy is 0.736

Recall is 0.518

Precision is 0.711

Specificity is 0.87

f1-score is 0.599

Support is [269 166]

In [52]:
print("Model 1 on Validate:")

print('Accuracy: {:.2f}'
     .format(rf.score(X_validate, y_validate)))
    
print("Model 2 on Validate:")

print('Accuracy: {:.2f}'
     .format(rf2.score(X_validate, y_validate)))

Model 1 on Validate:
Accuracy: 0.65
Model 2 on Validate:
Accuracy: 0.73


> Model 1 was overfit on the training set because it had too few leaves and way too high max depth.
> 
> Best model on both train and validate was Model 2.

# Modeling Exercises Cont.
## KNN

Continue working in your model notebook or python script.

1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)
2. Evaluate your results using the model score, confusion matrix, and classification report.
3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
4. Run through steps 2-4 setting k to 10
5. Run through setps 2-4 setting k to 20
6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?
7. Which model performs best on our out-of-sample data from validate?

In [12]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support

from acquire import get_titanic_data
from prepare import prep_titanic

df = prep_titanic()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    889 non-null    int64  
 1   pclass      889 non-null    int64  
 2   age         889 non-null    float64
 3   sibsp       889 non-null    int64  
 4   parch       889 non-null    int64  
 5   fare        889 non-null    float64
 6   alone       889 non-null    int64  
 7   sex_male    889 non-null    uint8  
 8   embarked_Q  889 non-null    uint8  
 9   embarked_S  889 non-null    uint8  
dtypes: float64(2), int64(5), uint8(3)
memory usage: 58.2 KB


### 1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [13]:
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size = .30, random_state = 123, stratify=y.survived)
X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size = .30, random_state = 123, stratify=y_train_validate.survived)

X_train.head()

Unnamed: 0,pclass,age,fare,sibsp,parch
203,3,45.5,7.225,0,0
669,1,29.642093,52.0,1,0
27,1,19.0,263.0,3,2
187,1,45.0,26.55,0,0
383,1,35.0,52.0,1,0


In [14]:
# create knn object
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

# fit model to train data
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [15]:
# survival prediction
y_pred = knn.predict(X_train)

# survival probability prediction
y_pred_proba = knn.predict_proba(X_train)

### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [16]:
# evaluate
print('Accuracy of Model 1 on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

print('Confusion matrix:\n', confusion_matrix(y_train, y_pred))

print('Classification report:\n', classification_report(y_train, y_pred))

Accuracy of KNN classifier on training set: 0.80
Confusion matrix:
 [[224  45]
 [ 43 123]]
Classification report:
               precision    recall  f1-score   support

           0       0.84      0.83      0.84       269
           1       0.73      0.74      0.74       166

    accuracy                           0.80       435
   macro avg       0.79      0.79      0.79       435
weighted avg       0.80      0.80      0.80       435



### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [17]:
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()

In [18]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specificity= (tn / (tn + fp))

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)

print("-------------")

print("Accuracy is", round(accuracy, 3))
print("Recall is", round(recall, 3))
print("Precision is", round(precision, 3))
print("Specificity is", round(specificity, 3))
print("f1-score is", round(f1_score(y_train, y_pred), 3))
print("Support is", precision_recall_fscore_support(y_train, y_pred)[-1])

True Positives: 123
False Positives: 45
False Negatives: 43
True Negatives: 224
-------------
Accuracy is 0.798
Recall is 0.741
Precision is 0.732
Specificity is 0.833
f1-score is 0.737
Support is [269 166]


### 4. Run through steps 2-4 setting k to 10

In [19]:
# create knn object
knn2 = KNeighborsClassifier(n_neighbors=10, weights='uniform')

# fit model to train data
knn2.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

In [20]:
# survival prediction
y_pred2 = knn2.predict(X_train)

# survival probability prediction
y_pred_proba2 = knn2.predict_proba(X_train)

In [21]:
# evaluate
print('Accuracy of Model 2 on training set: {:.2f}'
     .format(knn2.score(X_train, y_train)))

print('Confusion matrix:\n', confusion_matrix(y_train, y_pred2))

print('Classification report:\n', classification_report(y_train, y_pred2))

Accuracy of Model 2 on training set: 0.76
Confusion matrix:
 [[237  32]
 [ 72  94]]
Classification report:
               precision    recall  f1-score   support

           0       0.77      0.88      0.82       269
           1       0.75      0.57      0.64       166

    accuracy                           0.76       435
   macro avg       0.76      0.72      0.73       435
weighted avg       0.76      0.76      0.75       435



In [23]:
tn, fp, fn, tp = confusion_matrix(y_train, y_pred2).ravel()

In [24]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specificity= (tn / (tn + fp))

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)

print("-------------")

print("Accuracy is", round(accuracy, 3))
print("Recall is", round(recall, 3))
print("Precision is", round(precision, 3))
print("Specificity is", round(specificity, 3))
print("f1-score is", round(f1_score(y_train, y_pred2), 3))
print("Support is", precision_recall_fscore_support(y_train, y_pred2)[-1])

True Positives: 94
False Positives: 32
False Negatives: 72
True Negatives: 237
-------------
Accuracy is 0.761
Recall is 0.566
Precision is 0.746
Specificity is 0.881
f1-score is 0.644
Support is [269 166]


### 5. Run through setps 2-4 setting k to 20

In [25]:
# create knn object
knn3 = KNeighborsClassifier(n_neighbors=20, weights='uniform')

# fit model to train data
knn3.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=20)

In [26]:
# survival prediction
y_pred3 = knn3.predict(X_train)

# survival probability prediction
y_pred_proba3 = knn3.predict_proba(X_train)

In [27]:
# evaluate
print('Accuracy of Model 3 on training set: {:.2f}'
     .format(knn3.score(X_train, y_train)))

print('Confusion matrix:\n', confusion_matrix(y_train, y_pred3))

print('Classification report:\n', classification_report(y_train, y_pred3))

Accuracy of Model 3 on training set: 0.74
Confusion matrix:
 [[238  31]
 [ 84  82]]
Classification report:
               precision    recall  f1-score   support

           0       0.74      0.88      0.81       269
           1       0.73      0.49      0.59       166

    accuracy                           0.74       435
   macro avg       0.73      0.69      0.70       435
weighted avg       0.73      0.74      0.72       435



In [28]:
tn, fp, fn, tp = confusion_matrix(y_train, y_pred3).ravel()

In [29]:
accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specificity= (tn / (tn + fp))

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)

print("-------------")

print("Accuracy is", round(accuracy, 3))
print("Recall is", round(recall, 3))
print("Precision is", round(precision, 3))
print("Specificity is", round(specificity, 3))
print("f1-score is", round(f1_score(y_train, y_pred3), 3))
print("Support is", precision_recall_fscore_support(y_train, y_pred3)[-1])

True Positives: 82
False Positives: 31
False Negatives: 84
True Negatives: 238
-------------
Accuracy is 0.736
Recall is 0.494
Precision is 0.726
Specificity is 0.885
f1-score is 0.588
Support is [269 166]


### 6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

> Accuracy of Model 1 on training set: 0.80
> 
> Accuracy of Model 2 on training set: 0.76
> 
> Accuracy of Model 3 on training set: 0.74
>
> Model 1 performed best on the in-sample data because it had the lowest k value.

### 7. Which model performs best on our out-of-sample data from validate?

In [30]:
print('Accuracy of Model 1 on validation set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

print('Accuracy of Model 2 on validation set: {:.2f}'
     .format(knn2.score(X_validate, y_validate)))

print('Accuracy of Model 3 on validation set: {:.2f}'
     .format(knn3.score(X_validate, y_validate)))

Accuracy of Model 1 on validation set: 0.67
Accuracy of Model 2 on validation set: 0.71
Accuracy of Model 3 on validation set: 0.67


> Model 2 performs best on the validation data.