# Decision Trees Machine Learning Project

In this project, we explore and implement a Decision Trees model using real-world data. We will work with the Titanic dataset to predict survival, perform exploratory data analysis, preprocess the data, and analyze the resulting model.

## Introduction

Decision Trees are a popular machine learning algorithm used for classification and regression tasks. They work by recursively splitting the dataset into subsets based on feature values. The simplicity and interpretability of decision trees make them a valuable tool for understanding data patterns and making predictions.

Objectives of this project:
- Understand the mechanism behind Decision Trees, including entropy, information gain, and Gini impurity.
- Perform exploratory data analysis on the dataset to uncover underlying patterns.
- Preprocess the data and prepare it for machine learning modeling.
- Train and evaluate a Decision Tree classifier using real-world data.
- Visualize the decision path and understand feature importances.

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
# Ignore all warning messages that would normally be displayed
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set plot style
sns.set(style='whitegrid')
# Display plots inline
%matplotlib inline

## Dataset Description

The dataset used in this project is the Titanic dataset from the Seaborn library. It contains comprehensive details about the passengers aboard the Titanic, which include the following key features:

- **pclass**: The passenger class (1st, 2nd, 3rd), representing the socio-economic status.
- **sex**: The gender of the passenger.
- **age**: The age of the passenger.
- **sibsp**: The number of siblings or spouses aboard the ship.
- **parch**: The number of parents or children aboard the ship.
- **fare**: The ticket fare paid by the passenger.
- **embarked**: The port where the passenger boarded (C = Cherbourg; Q = Queenstown; S = Southampton).

The target variable is **survived**, which indicates whether the passenger survived (1) or did not survive (0). This dataset provides valuable insights into the demographics and conditions that could have influenced survival, making it an excellent candidate for applying Decision Trees to classify the outcomes.

In [None]:
# Load the Titanic dataset from seaborn
titanic = sns.load_dataset('titanic')

In [None]:
# Display the first few rows of the raw dataset
titanic.head()

In [None]:
# Basic information about the dataset
titanic.info()

In [None]:
# Check for missing values
titanic.isnull().sum()

In [None]:
# Visualizing the distribution of continuous features
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(titanic['age'].dropna(), kde=True, bins=30)
plt.title('Age Distribution')

plt.subplot(1, 2, 2)
sns.histplot(titanic['fare'], kde=True, bins=30)
plt.title('Fare Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Count plot for survival
plt.figure(figsize=(6,4))
sns.countplot(x='survived', data=titanic)
plt.title('Survival Count')
plt.show()

## Data Preprocessing

In this section, we preprocess the data to prepare it for modeling. We handle missing values, encode categorical variables, and split the data into training and testing sets. For this project, we'll focus on the following features:
- pclass
- sex
- age
- sibsp
- parch
- fare
- embarked

Our target variable will be 'survived'.

In [None]:
# Select a subset of columns for the model
columns_to_use = ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
data = titanic[columns_to_use].copy()

# Handle missing values
data['age'].fillna(data['age'].median(), inplace=True)  # Fill missing ages with median
data['embarked'].fillna(data['embarked'].mode()[0], inplace=True)  # Fill missing embarked with mode (most frequent value)

# Convert categorical features into dummy/indicator variables
data_encoded = pd.get_dummies(data, columns=['sex', 'embarked'], drop_first=True)

# Display the first few rows of the preprocessed data
data_encoded.head()

In [None]:
# Define feature matrix X and target vector y
X = data_encoded.drop('survived', axis=1)
y = data_encoded['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('Training set size:', X_train.shape)
print('Testing set size:', X_test.shape)

## Mathematical Explanation of Decision Trees

Decision Trees classify data by splitting it based on feature values, creating branches until a decision (or leaf node) is reached. Some key concepts include:

- Entropy: A measure of the randomness or impurity in the data. It is defined as:
  $E = -Σ p(i) log₂ p(i)$

- Information Gain: The reduction in entropy after a dataset is split on an attribute. It is calculated as:
  $Information Gain = Entropy(Parent) - Σ (weighted average) Entropy(Child)$

- Gini Impurity: Another measure of impurity, calculated as:
  $Gini = 1 - Σ p(i)²$

The tree splitting is based on selecting the feature and threshold that best separates the data to reduce impurity (using entropy or Gini impurity).

In [None]:
# Initialize the Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the model
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred))

In [None]:
# Display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
# Optional: Hyperparameter tuning using GridSearchCV
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print('Best Parameters:', grid_search.best_params_)
print('Best Cross-Validation Accuracy:', grid_search.best_score_)

In [None]:
# Use the best estimator to predict on the test set
best_dt = grid_search.best_estimator_
y_pred_best = best_dt.predict(X_test)
print('\nAccuracy of Best Model:', accuracy_score(y_test, y_pred_best))
print('\nClassification Report (Best Model):\n', classification_report(y_test, y_pred_best))

In [None]:
# Display the confusion matrix
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
# Visualize the Decision Tree
plt.figure(figsize=(20,10))
plot_tree(best_dt, feature_names=X.columns, class_names=['Not Survived', 'Survived'], filled=True, rounded=True, fontsize=10)
plt.title('Optimized Decision Tree')
plt.show()

In [None]:
# Plot feature importances
importances = best_dt.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(8,6))
sns.barplot(x=importances[indices], y=X.columns[indices], palette='viridis')
plt.title('Feature Importances')
plt.show()

## Discussion

The Decision Tree model provides a straightforward approach to classification, with the interpretability of its decision rules being one of its main advantages.

Key observations:
- The model achieved reasonable accuracy and the classification report shows the precision, recall, and F1-score for each class.
- Hyperparameter tuning helped in finding the optimal complexity of the tree, balancing between underfitting and overfitting.
- Feature importance visualization shows which features contributed most to the prediction of survival.

Potential limitations:
- Decision Trees are prone to overfitting if not properly pruned or tuned.
- The model performance might be improved further with ensemble methods such as Random Forests or Gradient Boosted Trees.

Future improvements could include more advanced feature engineering, testing additional algorithms, and employing ensemble techniques.

## Conclusion

In this project, we successfully implemented a Decision Trees model to predict Titanic survival. We walked through the data loading, preprocessing, exploration, model training, hyperparameter tuning, and evaluation processes. Through effective visualization of the decision tree structure and feature importances, we gained insights into how the model makes predictions.

This project highlights the importance of data preprocessing, model tuning, and thorough analysis in building robust machine learning models.

## References

- Seaborn Titanic dataset: https://github.com/mwaskom/seaborn-data
- Scikit-learn documentation: https://scikit-learn.org/stable/
- Decision Trees concepts: https://en.wikipedia.org/wiki/Decision_tree_learning