# Showcasing C4.5 with the Titanic dataset
This notebooks contains a C4.5 decision tree fitted on the Titanic dataset, currently only using the categorical features.

Additional packages necessary to run this notebook:
 - Pandas

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree

from decision_mining.core import c45

# Loading data
Titanic dataset. We're only using the columns "Pclass" and "Sex" as input, and "Survived" as output.
- Pclass is passenger class. This column contains the classes 1, 2 and 3.
- Sex is the gender listed for the passenger. This column contains the classes "male" and "female".
- Survived is if the passenger survived the disaster or not. It contains the classes 1 (Survived) and 0 (did not survive).

In [2]:
data = pd.read_csv(r"https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv", usecols=["Sex", "Pclass", "Survived", "Age"])
data = data[["Pclass", "Sex", "Age", "Survived"]]

In [3]:
data.head()

Unnamed: 0,Pclass,Sex,Age,Survived
0,3,male,22.0,0
1,1,female,38.0,1
2,3,female,26.0,1
3,1,female,35.0,1
4,3,male,35.0,0


In [4]:
data.describe(include="all")

Unnamed: 0,Pclass,Sex,Age,Survived
count,887.0,887,887.0,887.0
unique,,2,,
top,,male,,
freq,,573,,
mean,2.305524,,29.471443,0.385569
std,0.836662,,14.121908,0.487004
min,1.0,,0.42,0.0
25%,2.0,,20.25,0.0
50%,3.0,,28.0,0.0
75%,3.0,,38.0,1.0


In [5]:
X = data.drop("Survived", axis=1).to_numpy()
y = data["Survived"].to_numpy()

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [7]:
X_train[:,1] = X_train[:,1] == "female"
X_test[:,1] = X_test[:,1] == "female"

In [8]:
X_train = X_train.astype(int)
X_test = X_test.astype(int)

# Comparing Performance
We will be comparing the performance of C4.5 with SKlearn's CART.

#### C4.5's speed

In [9]:
%%timeit
predictor = c45.C45Classifier(np.array([2]))
predictor.fit(X_train, y_train)

3.2 s ± 87.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### C4.5's accuracy

In [10]:
predictor = c45.C45Classifier(np.array([2]))
predictor.fit(X_train, y_train)
predictor.score(X_test, y_test)

0.7567567567567568

#### CART's speed

In [11]:
%%timeit
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

2.17 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### CART's accuracy

In [12]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.7477477477477478

### So
#### Only categorical
SKlearn's CART is about 2.5 times faster than our C4.5 algorithm. 2. milliseconds is, however, still fast enough for use. It is also considerably faster than the C4.5 implementation from our predecessors. That implementation took, on average, 350 milliseconds.

Accuracy-wise, CART and C4.5 appear to give the exact same results
#### Categorical and continuous
SKlearn's CART is about 1500 times faster than our C4.5 algorithm when using both categorical and continuous values. This is fairly slow, but likely still fast enough for this use case.

The original implementation of continuous values by Ross Quinlan (the version we implemented), calculates the GainRatio equation `n-1` times per tree node (where `n` is the amount of samples in the subset). The implementation by our predecessors does this only once, instead of for each tree node. This *could* mean that our predecessors' implementation is faster, but likely also less accurate.

Accuracy-wise, our C4.5 implementation appears to yield better results.

### Who survives?
First of all, it appears that women, no matter the passenger class, survive

In [13]:
from INNO.core.dmn import rule_c45, rule, dmn_generation as dmn

In [14]:
cols = [["Pclass", "Sex", "Age", "Survived"]]
drd_objects = dmn.create_node_objects(cols)
decision_nodes = dmn.create_dependencies(cols, drd_objects)
rules = rule_c45.make_c45_rules([0, 1, 2], c45.traverse_c45(predictor))

decision_nodes[0].rules = rules
dmn.create_xml(drd_objects, decision_nodes)

<xml.etree.ElementTree.ElementTree at 0x1b38cecb6a0>

In [15]:
clf.predict([[1, 1, 20], [2, 1, 20], [3, 1, 5]])  # 1 == female

array([1, 1, 1], dtype=int64)

In [16]:
predictor.predict(np.array([[1, 1, 20], [2, 1, 20], [3, 1, 5]], dtype=int))  # 1 == female

array([1, 1, 1], dtype=int64)

Second of all, it appears that adult men, no matter the passenger class, do not survive. Male children do appear to survive.

In [17]:
predictor.predict(np.array([[1, 0, 20], [2, 0, 10], [3, 0, 3]], dtype=int))  # 0 == female

array([0, 1, 1], dtype=int64)

In [18]:
clf.predict([[1, 0, 20], [2, 0, 20], [3, 0, 20]])  # 0 == male

array([0, 0, 0], dtype=int64)