# Decision Trees model

Decision trees are a type of supervised learning algorithm that can be used for both classification and regression tasks. They work by splitting the data into subsets based on the value of input features, creating a tree-like structure where each node represents a feature and each branch represents a decision based on that feature. The final output is a leaf node that represents the predicted class or value. Decision trees are easy to interpret and visualize, making them a popular choice for many machine learning tasks.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

## Loading the data

In [None]:
data_path = 'cleaned_titanic_dataset.csv'

# Load the dataset
df = pd.read_csv(data_path)
df.reset_index(drop=True, inplace=True)

In [None]:
# Get feature and target variables
y = df['survived']
X = df.drop('survived', axis=1)

In [None]:
# Get categorical and numerical columns
categorical_cols = ['pclass', 'sex', 'embarked']
numeric_cols = [col for col in X.columns if col not in categorical_cols]

## Numbers are better than words

Decision trees have no way to understand words, so we need to convert them into numbers. We use OneHotEncoding to convert categorical variables into a format that can be provided to ML algorithms to do a better job in prediction.

OneHotEncoding creates a new binary column for each category in the original column. For example, if we have a column "Color" with three categories: "Red", "Green", and "Blue", OneHotEncoding will create three new columns: "Color_Red", "Color_Green", and "Color_Blue". Each row will have a value of 1 in the column corresponding to its original category and 0 in the others.

In [None]:
# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ],
    remainder='drop'
)

In [None]:
# Create a pipeline with preprocessing and classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42, max_depth=3))
])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train the model
model = pipeline.fit(X_train, y_train)

# Metrics
## Some terms
- **True Positive (TP)**: The model correctly predicted the positive class.
- **True Negative (TN)**: The model correctly predicted the negative class.
- **False Positive (FP)**: The model incorrectly predicted the positive class (Type I error).
- **False Negative (FN)**: The model incorrectly predicted the negative class (Type II error).

## Some metrics
### Confusion Matrix
$$
\begin{bmatrix}
TP & FN \\
FP & TN
\end{bmatrix}
$$
### Accuracy
$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$
### Precision
$$
Precision = \frac{TP}{TP + FP}
$$
### Recall
$$
Recall = \frac{TP}{TP + FN}
$$
### F1 Score
$$
F1 Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$

# More Metrics
## Mean Squared Error (MSE)
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$
Where:
- \( n \) is the number of samples
- \( y_i \) is the true value
- \( \hat{y}_i \) is the predicted value

## Mean Absolute Error (MAE)
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$
Where:
- \( n \) is the number of samples
- \( y_i \) is the true value
- \( \hat{y}_i \) is the predicted value

## R-squared
$$
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$
Where:
- \( n \) is the number of samples
- \( y_i \) is the true value
- \( \hat{y}_i \) is the predicted value
- \( \bar{y} \) is the mean of the true values


In [None]:
# Plot metrics
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(3, 3))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

In [None]:
importances = model.named_steps['classifier'].feature_importances_
feature_names = model.named_steps['preprocessor'].get_feature_names_out()
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.title("Feature importances")
plt.bar(range(len(importances)), importances[indices], align="center")
plt.xticks(range(len(importances)), feature_names[indices], rotation=90)
plt.xlim([-1, len(importances)])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
plot_tree(model.named_steps['classifier'], feature_names=feature_names, filled=True, rounded=True)
plt.title("Decision Tree")
plt.show()

In [None]:
# Save the model
joblib.dump(model, 'titanic_model_DT.pkl')
# Load the model
loaded_model = joblib.load('titanic_model_DT.pkl')