# Decision Trees
Adam Haile - 2/20/2025

Lastly, let's check out decision trees on a different dataset. For this one, we'll use another common toy dataset. The **Titanic dataset**

In [1]:
# Install these if you don't have them.
# %pip install numpy matplotlib scikit-learn pandas

## Import packages + data

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv('Titanic-Dataset.csv')
df.head()

Looking at our data, we can see a good amount of features, and our target variable. In this case, we want to use our features to determine **whether or not a passenger would survive**. 

Let's extract out some more numerical, and relevant features for our model to use.

In [None]:
# Extract features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare"]
df = df[features + ["Survived"]]

# Handle missing values (fill age with median)
df = df.assign(Age=df["Age"].fillna(df["Age"].median()))

# Encode categorical variables (Sex: male=1, female=0)
df["Sex"] = LabelEncoder().fit_transform(df["Sex"])

df.head()

Let's break down what we just did.
1. We extracted out 6 features for us to use.
    * Pclass: The class of the ticket of the passenger (1st, 2nd, 3rd)
    * Sex: The gender of the passenger (male, female)
    * Age: The age of the passenger
    * SibSp: Number of siblings/spouses that were also aboard the Titanic
    * Parch: Number of parameters/children that were also aboard the Titanic
    * Fare: The cost of their ticket.

    We also add back on our target variable (Survived)

2. We fill in any missing data that we have in our Age column with the median age (a very naive approach, but is fine enough for our example).
3. We convert our categorical values in sex to one-hot encoded values (0 or 1).

Now we'll split our data and train a decision tree!

In [7]:
X = df[features]
y = df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

Great! Let's see how well it does at predicting the survival of passengers

In [None]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

f"{accuracy * 100:.2f}%"

Awesome! We can see that our decision tree gets almost 80% on predicting survival rate! Let's visualize some of our results now and interpret how we got here. Let's start with seeing what our trained decision tree looks like.

In [None]:
plt.figure(figsize=(12, 6))
plot_tree(clf, feature_names=X.columns, class_names=["Not Survived", "Survived"], filled=True)
plt.show()

We can interpret our tree by reading the values on the nodes.

The first value of the non-leaf nodes tell us the feature the node analyzes, and the threshold of it. Because we are working with numerical values, all decisions are generally split on bases of less than some trained value.
Below that we see the gini index score from this node. This tells us how purely the node split the data. Remember that higher gini index scores, the more pure the split was (with 0.5 being a max Gini index value).
We can also see the number of samples that the node split on, and some example values that would be at the feature.

The leaf nodes show us as well the exact class that they predict an input results in when the data reaches to that point in the tree.

Let's check out another way we can visualize decision trees. We can look at the **decision boundaries** of the nodes in **feature space**.

In [None]:
# For this example, we'll only use the first two features for visualization.

# Select only the first two features for visualization
X_vis = X_train.iloc[:, 3:5].values
y_vis = y_train.values

# Train a new decision tree using only these two features
clf_vis = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_vis.fit(X_vis, y_vis)

# Generate a mesh grid for plotting decision boundaries
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))

# Predict class for each point in the grid
Z = clf_vis.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(12, 6))
plot_tree(clf_vis, feature_names=X_train.columns[3:5], class_names=["Not Survived", "Survived"], filled=True)
plt.show()

# Plot decision boundary
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
scatter = plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y_vis, edgecolors="k", cmap=plt.cm.coolwarm)
plt.xlabel(X.columns[3])
plt.ylabel(X.columns[4])
plt.title("Decision Tree Decision Boundaries (SibSp and Parch)")
plt.legend(handles=scatter.legend_elements()[0], labels=["Not Survived", "Survived"])
plt.show()

From this, we can see where the model chose to divide the data into the different classes based on the features of SibSp and Parch. Each color of the background shows a region that it identifed as most inputs that meet that given criteria will have this target.

Let's check out the point at (3, 0). This point in the feature space means this person had 3 siblings/spouse(s)(?) on board, and 0 parents/children on board. Our model has learned that generally anyone who had more than 2.5 siblings/spouses on board would survive. 

This logically also makes sense as someone in this position wouldn't have children to worry about getting off first, and they would have other adults to help them get off.

At this point, I would highly encourage you to try out swapping the normal decision tree model from SKLearn with some others. Maybe try a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or an [XGBoost](https://xgboost.readthedocs.io/en/stable/) model, see if you can get better results!