In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import prepare
import acquire
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Decision Tree Exercises
## Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

In [2]:
df= acquire.get_titanic_data()
df.head()

Unnamed: 0.1,Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [3]:
titanic= prepare.prep_titanic(df)
titanic.head(2)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0,0


In [4]:
# Deleting the missing values
titanic.dropna(inplace=True)

### What is your baseline prediction? 

In [5]:
titanic.survived.value_counts()

0    424
1    288
Name: survived, dtype: int64

My baseline prediction is did not survive because 549 passengers did not survive.

### What is your baseline accuracy? <br><br>

In [6]:
baseline_accuracy = (1 - titanic.survived.mean())
baseline_accuracy

0.5955056179775281

### Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [7]:
X = titanic[['pclass','age','fare','sibsp','parch']]
y = titanic[['survived']]

X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size = .20, random_state = 719)

X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size = .30, random_state = 719)

print("X train: ", X_train.shape, ", X validate: ", X_validate.shape, ", X test: ", X_test.shape)
print("Y train: ", y_train.shape, ", Y validate: ", y_validate.shape, ", Y test: ", y_test.shape)

X train:  (398, 5) , X validate:  (171, 5) , X test:  (143, 5)
Y train:  (398, 1) , Y validate:  (171, 1) , Y test:  (143, 1)


In [8]:
X_train

Unnamed: 0,pclass,age,fare,sibsp,parch
689,1,15.0,211.3375,0,1
794,3,25.0,7.8958,0,0
467,1,56.0,26.5500,0,0
737,1,35.0,512.3292,0,0
741,1,36.0,78.8500,1,0
...,...,...,...,...,...
715,3,19.0,7.6500,0,0
699,3,42.0,7.6500,0,0
785,3,25.0,7.2500,0,0
203,3,45.5,7.2250,0,0


In [9]:
#Baseline for our model
survival_rate = (1- y_train.survived.mean())
survival_rate

0.5904522613065326

### Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [10]:
# Create the Decision Tree Object
clf = DecisionTreeClassifier(max_depth=3, random_state=719)
# Fit the model to the training data
clf = clf.fit(X_train, y_train)

In [11]:
# make prediction on train obeservations
y_pred = clf.predict(X_train)
y_pred[0:5] 

array([1, 0, 0, 1, 1])

In [12]:
y_pred_prob = clf.predict_proba(X_train)
y_pred_prob[0:5]

array([[0.24107143, 0.75892857],
       [0.76923077, 0.23076923],
       [0.5952381 , 0.4047619 ],
       [0.24107143, 0.75892857],
       [0.24107143, 0.75892857]])

In [13]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
      .format(clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.76


In [14]:
# confusion matrix
confusion_matrix(y_train, y_pred)

array([[208,  27],
       [ 69,  94]])

In [15]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.89      0.81       235
           1       0.78      0.58      0.66       163

    accuracy                           0.76       398
   macro avg       0.76      0.73      0.74       398
weighted avg       0.76      0.76      0.75       398



### Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

### Run through steps 2-4 using a different max_depth value.

### Which model performs better on your in-sample data?

### Which model performs best on your out-of-sample data, the validate set?

# Random Forest Exercises

### Continue working in your model file with titanic data to do the following:

### Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

In [16]:
train, validate, test = prepare.titanic_tvt(df)


### Evaluate your results using the model score, confusion matrix, and classification report.

### Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

### Run through steps increasing your min_samples_leaf and decreasing your max_depth.

### What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

### After making a few models, which one has the best performance (or closest metrics) on both train and validate?

# K-Nearest Neighbor Exercises

Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

Evaluate your results using the model score, confusion matrix, and classification report.

Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

Run through steps 2-4 setting k to 10

Run through setps 2-4 setting k to 20

What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

Which model performs best on our out-of-sample data from validate?

# Logistic Regression Exercises

Create a model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

Try out other combinations of features and models.

Use you best 3 models to predict and evaluate on your validate sample.

Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

Bonus1 How do different strategies for handling the missing values in the age column affect model performance?

Bonus2: How do different strategies for encoding sex affect model performance?

Bonus3: scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.
C =.01,.1,1,10,100,1000

Bonus Bonus: how does scaling the data interact with your choice of C?