# Decision Tree Classifier

An example showing how [SciKit Learn's Decision Trees](https://scikit-learn.org/stable/modules/tree.html) can be used to solve classification type problems by trying to predict survivors of Titanic.


In [None]:
# %pip install --quiet --upgrade pip 
# %pip install numpy --quiet
# %pip install PyArrow --quiet
# %pip install Pandas --quiet
# %pip install scikit-learn --quiet

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree

In [2]:
titanic_data = pd.read_csv("Data/titanic_train.csv")
titanic_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Women and children first

Our first hypothesis will be that woman and children were more likely to be given a place on the lifeboats and therefore will have survived. The Sex and Age columns will therefore be used as predictors.

## Data Wrangling 

SciKit Learn's [Decision Tree](https://scikit-learn.org/stable/modules/tree.html) does not support categorical variables (see: [#5442](https://github.com/scikit-learn/scikit-learn/issues/5442)). We therefore need to one-hot encode any categorical variables we want to use as predictors.

In [4]:

gender_categories = ["Sex_male", "Sex_female"]
titanic_data = titanic_data.drop(gender_categories, axis=1, errors="ignore") # remove the categorical variables (if we previous run this cell)
titanic_data["SexTemp"] = titanic_data["Sex"] # get_dummies will remove to column, so copy the data
titanic_data = pd.get_dummies(titanic_data, prefix="Sex", columns=["SexTemp"], dtype=float)
titanic_data[gender_categories].value_counts()

Sex_male  Sex_female
1.0       0.0           577
0.0       1.0           314
Name: count, dtype: int64

Now split the data into training and validation sets so we can evaluate the success of our model.

In [5]:
train, validate = train_test_split(titanic_data, test_size=0.2, random_state=42)