# Exercises

Using the titanic data, in your classification-exercises repository, create a notebook, `model.ipynb` where you will do the following:

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

5. Run through steps 2-4 using a different max_depth value.

6. Which model performs better on your in-sample data?

7. Which model performs best on your out-of-sample data, the validate set?

8. Work through these same exercises using the Telco dataset.

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sklearn.metrics as metrics
from env import get_db_url
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree


import aquire
import prepare

In [2]:
df = pd.read_sql('SELECT * FROM passengers', get_db_url('titanic_db'))


In [3]:
titanic_train, titanic_validate, titanic_test = prepare.split_titanic_data(prepare.clean_titanic_data(aquire.get_titanic_data()))


In [4]:
titanic_train.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,0,1,male,36.0,0,0,40.125,Cherbourg,1,1,0,0
165,1,3,male,9.0,0,2,20.525,Southampton,0,1,0,1
50,0,3,male,7.0,4,1,39.6875,Southampton,0,1,0,1
259,1,2,female,50.0,0,1,26.0,Southampton,0,0,0,1
306,1,1,female,,0,0,110.8833,Cherbourg,1,0,0,0


What is your baseline prediction?
- that everyone dies. 

In [5]:
titanic_train.survived.value_counts()

0    307
1    191
Name: survived, dtype: int64

In [6]:
X = titanic_train.drop(columns = "survived")
Y = titanic_train.survived

baseline = 307/(307+191)
print (baseline)

0.6164658634538153


In [35]:
titanic_dt = DecisionTreeClassifier(max_depth=3, random_state=123)

In [36]:
titanic_dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3, random_state=123)

In [38]:
plt.figure(figsize=(13, 7))
plot_tree(titanic_dt, feature_names=X_train.columns, class_names=titanic_dt.classes_, rounded=True)

NameError: name 'clf' is not defined

<Figure size 936x504 with 0 Axes>

# Random Forest Exercises

Continue working in your `model` file with titanic data to do the following: 

1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

1. Evaluate your results using the model score, confusion matrix, and classification report.

1. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

1. Run through steps increasing your min_samples_leaf and decreasing your max_depth. 

1. What are the differences in the evaluation metrics?  Which performs better on your in-sample data?  Why?

After making a few models, which one has the best performance (or closest metrics) on both train and validate?

In [7]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


from pydataset import data

In [8]:
titanic_df = aquire.get_titanic_data()

In [9]:
train, validate, test = prepare.prep_titanic_data(titanic_df)

In [10]:
train


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,0,1,male,36.000000,0,0,40.1250,Cherbourg,1,1,0,0
165,1,3,male,9.000000,0,2,20.5250,Southampton,0,1,0,1
50,0,3,male,7.000000,4,1,39.6875,Southampton,0,1,0,1
259,1,2,female,50.000000,0,1,26.0000,Southampton,0,0,0,1
306,1,1,female,29.678105,0,0,110.8833,Cherbourg,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
313,0,3,male,28.000000,0,0,7.8958,Southampton,1,1,0,1
636,0,3,male,32.000000,0,0,7.9250,Southampton,1,1,0,1
222,0,3,male,51.000000,0,0,8.0500,Southampton,1,1,0,1
485,0,3,female,29.678105,3,1,25.4667,Southampton,0,0,0,1


In [11]:

# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['survived', 'embark_town', 'sex'])
y_train = train.survived

X_validate = validate.drop(columns=['survived', 'embark_town', 'sex'])
y_validate = validate.survived

X_test = test.drop(columns=['survived', 'embark_town', 'sex'])
y_test = test.survived

In [12]:
rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=1,
                            n_estimators=100,
                            max_depth=10, 
                            random_state=123)

In [13]:
rf.fit(X_train, y_train)


RandomForestClassifier(max_depth=10, random_state=123)

In [14]:
y_pred = rf.predict(X_train)
y_pred

array([0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

In [16]:
rf.classes_

array([0, 1])

In [17]:
print(rf.score(X_train, y_train))

0.9698795180722891


In [18]:
print(confusion_matrix(y_train, y_pred))


[[307   0]
 [ 15 176]]


In [19]:
print(classification_report(y_train, y_pred))


              precision    recall  f1-score   support

           0       0.95      1.00      0.98       307
           1       1.00      0.92      0.96       191

    accuracy                           0.97       498
   macro avg       0.98      0.96      0.97       498
weighted avg       0.97      0.97      0.97       498



#### 3. print and clearly lable: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [39]:
TN, FP, FN, TP = confusion_matrix(y_train, y_pred).ravel()
ALL = TP+TN+FP+FN

TP, TN, FP, FN, ALL

(132, 280, 27, 59, 498)

#### 4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.

#### Second attempt

In [44]:
rf2 = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=3,
                            n_estimators=100,
                            max_depth=8, 
                            random_state=123)

In [45]:
rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=3, min_samples_leaf=8, random_state=123)

In [46]:
y_pred = rf.predict(X_train)

In [47]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.91      0.87       307
           1       0.83      0.69      0.75       191

    accuracy                           0.83       498
   macro avg       0.83      0.80      0.81       498
weighted avg       0.83      0.83      0.82       498



#### Third attempt

In [48]:
rf3 = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=5,
                            n_estimators=100,
                            max_depth=5, 
                            random_state=123)

In [49]:
rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=3, min_samples_leaf=8, random_state=123)

In [50]:
y_pred = rf.predict(X_train)

In [51]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.91      0.87       307
           1       0.83      0.69      0.75       191

    accuracy                           0.83       498
   macro avg       0.83      0.80      0.81       498
weighted avg       0.83      0.83      0.82       498



#### Fourth Attempt

In [52]:
rf4 = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=8,
                            n_estimators=100,
                            max_depth=3, 
                            random_state=123)

In [53]:
rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=3, min_samples_leaf=8, random_state=123)

In [54]:
y_pred = rf.predict(X_train)

In [55]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.91      0.87       307
           1       0.83      0.69      0.75       191

    accuracy                           0.83       498
   macro avg       0.83      0.80      0.81       498
weighted avg       0.83      0.83      0.82       498



#### 5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

In [43]:
actuals = model_prediction.survived
preds = model_prediction.drop(columns = 'survived')

for column in preds.columns:
    
    accuracy = (actuals == preds[column]).mean()
    
    print(f'{column} accuracy :{accuracy}')

NameError: name 'model_prediction' is not defined