# Decision Tree
## Titanic Data - Decision Tree
- Determine baseline and baseline accuracy.
- Split data, fit decision tree classifier to data.
- Make predictions.
- Get model score for training dataset.
- Print confusion matrix and classification report for predictions.
- Fit new model and run analysis for a different max_tree depth.
- Determine which model performs better on in-sample data.
- Determine which model performs better on out-of-sample data.

In [55]:
### Imports ###
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.tree import DecisionTreeClassifier

from env import get_db_url
import explore

In [3]:
### Pull titanic_db data ###
url = get_db_url(db_name='titanic_db')
df = pd.read_sql('SELECT * FROM passengers', url)
df.head(3)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


In [28]:
### Save a copy of dataframe before manipulations ###
df_original = df.copy()

In [6]:
### Split data ###
train_validate, test = train_test_split(df, test_size=.2, 
                                            random_state=123, 
                                            stratify=df.survived)
train, validate = train_test_split(train_validate, test_size=.3,
                                                   random_state=123,
                                                   stratify=train_validate.survived)
train.shape, validate.shape, test.shape

((498, 13), (214, 13), (179, 13))

In [11]:
### Baseline and Baseline Prediction ###
# train.survived.value_counts()
(train.survived == 0).mean()

0.6164658634538153

Data is tidy, continuing...

In [16]:
### Bivariate Exploration setup ###
cat_vars = ['pclass', 'sex', 'deck', 'embark_town', 'alone']
quant_vars = ['age', 'sibsp', 'parch', 'fare']
target = 'survived'

In [43]:
### Bivariate Exploration ###
# explore.explore_bivariate(train, target, cat_vars, quant_vars)

Candidates for model:
- Sex
- Fare
- Passenger Class
- Alone

Candidates contain no nulls, continuing...

In [41]:
### Set original subsets to enable rerun of following cells without notebook restart ###
train_original = train.copy()
validate_original = validate.copy()
test_original = test.copy()

In [48]:
### Set features, prepare for model ###
train, validate, test = train_original.copy(), validate_original.copy(), test_original.copy()

train = train[['survived','sex','pclass','alone','fare']]
validate = validate[['survived','sex','pclass','alone','fare']]
test = test[['survived','sex','pclass','alone','fare']]

map1 = {'male':0, 'female':1}
train['sex'] = train.sex.map(map1)
validate['sex'] = validate.sex.map(map1)
test['sex'] = test.sex.map(map1)

train.head(3)

Unnamed: 0,survived,sex,pclass,alone,fare
583,0,0,1,1,40.125
165,1,0,3,0,20.525
50,0,0,3,0,39.6875


In [50]:
### Set target ###
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

In [51]:
### Create Decision Tree Model ###
clf = DecisionTreeClassifier(max_depth=3, random_state=123)
clf = clf.fit(X_train, y_train)

In [54]:
### Show score, make predictions ###
print("Score:", clf.score(X_train, y_train))
y_pred = clf.predict(X_train)

Score: 0.8232931726907631


In [58]:
### Print confusion matrix and classification report ###
print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred, output_dict=True))
print(report)

Confusion Matrix
[[276  31]
 [ 57 134]] 

Classification Report
                    0           1  accuracy   macro avg  weighted avg
precision    0.828829    0.812121  0.823293    0.820475      0.822421
recall       0.899023    0.701571  0.823293    0.800297      0.823293
f1-score     0.862500    0.752809  0.823293    0.807654      0.820430
support    307.000000  191.000000  0.823293  498.000000    498.000000


In [59]:
### Try a model with different max_depth ###
clf_alt = DecisionTreeClassifier(max_depth=1, random_state=123)
clf_alt = clf_alt.fit(X_train, y_train)

print("Score:", clf_alt.score(X_train, y_train))
y_pred_alt = clf_alt.predict(X_train)

print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred_alt), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred_alt, output_dict=True))
print(report)

Score: 0.7991967871485943
Confusion Matrix
[[265  42]
 [ 58 133]] 

Classification Report
                    0           1  accuracy   macro avg  weighted avg
precision    0.820433    0.760000  0.799197    0.790217      0.797255
recall       0.863192    0.696335  0.799197    0.779764      0.799197
f1-score     0.841270    0.726776  0.799197    0.784023      0.797358
support    307.000000  191.000000  0.799197  498.000000    498.000000


Model 1 score (max_depth=3): 82.3% ----- Model 2 score (max_depth=1): 79.9%

In [60]:
# Run models against out-of-sample data
print("Model 1 score:", clf.score(X_validate, y_validate))
print("Model 2 score:", clf_alt.score(X_validate, y_validate))

Model 1 score: 0.7850467289719626
Model 2 score: 0.7616822429906542


## Telco Data - Decision Tree