# Decision Tree Classifier

An example showing how [SciKit Learn's Decision Trees](https://scikit-learn.org/stable/modules/tree.html) can be used to solve classification type problems by trying to predict survivors of Titanic.


In [1]:
# %pip install --quiet --upgrade pip 
# %pip install numpy --quiet
# %pip install PyArrow --quiet
# %pip install Pandas --quiet
# %pip install scikit-learn --quiet

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree

In [3]:
titanic_data = pd.read_csv("Data/titanic_train.csv")
titanic_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Women and children first

Our first hypothesis will be that woman and children were more likely to be given a place on the lifeboats and therefore will have survived. Based on this hypothesis we will use the Sex and Age columns as predictors of survival.

## Data Wrangling 

SciKit Learn's [Decision Tree](https://scikit-learn.org/stable/modules/tree.html) does not support categorical variables (see: [#5442](https://github.com/scikit-learn/scikit-learn/issues/5442)). We therefore need to one-hot encode any categorical variables we want to use as predictors.

In [4]:
def onehot_encode(df : pd.DataFrame, column_name: str) -> tuple[pd.DataFrame, list[str]]:
    categories = [f"{column_name}_{value}" for value in df[column_name].unique()]
    df = df.drop(categories, axis=1, errors="ignore") # remove the categorical variables (if we previous called onehot_encode)
    temp_column_name = f"{column_name}_Temp"
    df[temp_column_name] = df[column_name] # get_dummies will remove to column, so copy the data to temp column
    df = pd.get_dummies(df, prefix=column_name, columns=[temp_column_name], dtype=float)
    return df, categories

In [5]:

titanic_data, gender_categories = onehot_encode(titanic_data, "Sex")
titanic_data[gender_categories].value_counts()

Sex_male  Sex_female
1.0       0.0           577
0.0       1.0           314
Name: count, dtype: int64

Now split the data into training and validation sets so we can evaluate the success of our model.

In [6]:
train, validate = train_test_split(titanic_data, test_size=0.2, random_state=42)

## Creating the model

In [7]:
predictors = ["Age"] + gender_categories
prediction = "Survived"

x = train[predictors]
y = train[[prediction]].values

decision_tree = tree.DecisionTreeClassifier(max_depth=2, random_state=42)
decision_tree.fit(x, y)

print(tree.export_text(decision_tree, feature_names=predictors))

|--- Sex_male <= 0.50
|   |--- Age <= 21.50
|   |   |--- class: 1
|   |--- Age >  21.50
|   |   |--- class: 1
|--- Sex_male >  0.50
|   |--- Age <= 6.50
|   |   |--- class: 1
|   |--- Age >  6.50
|   |   |--- class: 0



## Evaluate the model

In [8]:
predictions = decision_tree.predict(validate[predictors])
actuals = validate[[prediction]].values

score = accuracy_score(actuals, predictions)
print(f'Simple "Women and children first" hypothesis has accuracy of: {score *100:.2f}%')

Simple "Women and children first" hypothesis has accuracy of: 78.21%


## Is class a factor?

Let's see if we can improve the accuracy of our decision tree by adding the ticket class into the model.
Our hypothesis here is that 1st class passengers are closer to the lifeboats and will more easily be able to reach them.

In [9]:
titanic_data, class_categories = onehot_encode(titanic_data, "Pclass")
titanic_data[class_categories].value_counts()

Pclass_3  Pclass_1  Pclass_2
1.0       0.0       0.0         491
0.0       1.0       0.0         216
          0.0       1.0         184
Name: count, dtype: int64

In [10]:
train, validate = train_test_split(titanic_data, test_size=0.2, random_state=42)
predictors = ["Age"] + gender_categories + class_categories
prediction = "Survived"

x = train[predictors]
y = train[[prediction]].values

decision_tree = tree.DecisionTreeClassifier(max_depth=3, random_state=42)
decision_tree.fit(x, y)

print(tree.export_text(decision_tree, feature_names=predictors))

|--- Sex_male <= 0.50
|   |--- Pclass_3 <= 0.50
|   |   |--- Age <= 2.50
|   |   |   |--- class: 0
|   |   |--- Age >  2.50
|   |   |   |--- class: 1
|   |--- Pclass_3 >  0.50
|   |   |--- Age <= 36.50
|   |   |   |--- class: 1
|   |   |--- Age >  36.50
|   |   |   |--- class: 0
|--- Sex_male >  0.50
|   |--- Age <= 6.50
|   |   |--- Pclass_3 <= 0.50
|   |   |   |--- class: 1
|   |   |--- Pclass_3 >  0.50
|   |   |   |--- class: 0
|   |--- Age >  6.50
|   |   |--- Pclass_1 <= 0.50
|   |   |   |--- class: 0
|   |   |--- Pclass_1 >  0.50
|   |   |   |--- class: 0



In [11]:
predictions = decision_tree.predict(validate[predictors])
actuals = validate[[prediction]].values

score = accuracy_score(actuals, predictions)
print(f'"Women and children first (as long as you are 1st class)" hypothesis has accuracy of: {score *100:.2f}%')

"Women and children first (as long as you are 1st class)" hypothesis has accuracy of: 80.45%
