# Titanic Survivors Prediction

## Data Engineering

### Data Exploration

In [None]:
import pandas as pd
from pyexpat import features

df = pd.read_csv('data/titanic_dataset.csv')
print(f'Dataset shape: {df.shape}')

In [None]:
df.head(5)

### Data Cleaning

In [None]:
# check for null values
print(df.isnull().sum())

In [None]:
# fill nulls in "Embarked" with mode
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

In [None]:
# recheck null values in "Embarked"
print(df['Embarked'].isnull().sum())

The categorical features e.g., "sex" is already encoded in our dataset. If it wasn't than we would have encoded it using label encoding, or one-hot encoding if there were other categorical features based on whether they were ordinal or non-ordinal data.

## Feature Selection

As we know that there are multiple columns which only contains zero, and will have no impact on predictions. We are selecting only non-zero columns for modeling.

In [None]:
# get all non-null columns
nonnull_cols = df.columns[~(df == 0).all()]

display(df[nonnull_cols].head())

In [None]:
# select important feature for modeling
features_ = ['Age', 'Fare', 'Sex', 'sibsp', 'Parch', 'Pclass', 'Embarked']

## Train/Test Split

We are splitting the data into training and testing datasets. Keeping 20% as test dataset and 80% as train dataset.

In [None]:
from sklearn.model_selection import train_test_split

X = df[features_]
y = df['2urvived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'\nX_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')

Perform hyperparameter tuning for a Decision Tree Classifier on the Titanic dataset, utilizing the `/content/train_and_test2.csv` file. The process involves importing necessary `sklearn` libraries, defining a parameter grid for `DecisionTreeClassifier`, using `GridSearchCV` to find optimal hyperparameters on the training data (`X_train`, `y_train`), training a `DecisionTreeClassifier` with these best hyperparameters, and finally evaluating its performance on the test set (`X_test`, `y_test`). The final output should include the best hyperparameters found and the model's performance metrics.

## Modeling
### Import Libraries

Import `DecisionTreeClassifier` for the model and `GridSearchCV` for hyperparameter tuning from `sklearn`.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

print('DecisionTreeClassifier and GridSearchCV imported successfully.')

### Define Parameter Grid

Create a dictionary specifying the hyperparameters and their potential values to be searched during tuning for the Decision Tree.


In [None]:
param_grid = {
    'max_depth': [None, 3, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

print('Parameter grid defined successfully.')

### Perform Hyperparameter Tuning

Utilize GridSearchCV to systematically search through the defined parameter grid and find the optimal hyperparameters for the Decision Tree Classifier using the training data.


In [None]:
dt_classifier = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
)

print('Fitting GridSearchCV to the training data...')
grid_search.fit(X_train, y_train)
print('GridSearchCV fitting complete.')

Now that GridSearchCV has completed fitting, I need to retrieve and display the best hyperparameters found and the corresponding best score.



In [None]:
print(f'Best hyperparameters: {grid_search.best_params_}')
print(f'Best accuracy score: {grid_search.best_score_:.4f}')

## Train Final Model and Evaluate

Train a `DecisionTreeClassifier` using the best hyperparameters found by `GridSearchCV` and evaluate its performance on the test set (`X_test`, `y_test`).

In [None]:
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, classification_report)

# Get the best hyperparameters from GridSearchCV
best_params = grid_search.best_params_

# Initialize a new DecisionTreeClassifier with the best hyperparameters
final_dt_model = DecisionTreeClassifier(**best_params, random_state=42)

# Train the final model on the entire training dataset
print('Training final Decision Tree model...')
final_dt_model.fit(X_train, y_train)
print('Final model training complete.')

# Make predictions on the test set
y_pred = final_dt_model.predict(X_test)

# Evaluate the model
print('\nModel Evaluation on Test Set:')
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')
print(f'Precision: {precision_score(y_test, y_pred):.4f}')
print(f'Recall: {recall_score(y_test, y_pred):.4f}')
print(f'F1-Score: {f1_score(y_test, y_pred):.4f}')

print('\nClassification Report:\n', classification_report(y_test, y_pred))


## Confusion Matrix


In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
print(cm)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

The confusion matrix above indicates the following,  
- True/Positive: 0
- False/Negative: 1

## Decision Tree

In [None]:
feature_imp = pd.Series(final_dt_model.feature_importances_, index=X.columns)
top_3_features = feature_imp.nlargest(3)

print('Top 3 most important features:')
print(top_3_features)

### How Decision Trees Split Data

Decision trees work by recursively partitioning the data into subsets based on the values of the input features. At each node of the tree, the algorithm selects the feature and the split point that best divides the data into distinct groups, aiming to maximize the 'purity' of the resulting subsets with respect to the target variable.

In our case, the `criterion` for splitting was 'entropy', meaning the tree aims to reduce the impurity (randomness) in the target variable as much as possible at each split. The tree iteratively asks 'if-else' questions about the features, such as 'Is Age <= 28.5?', leading to branches that eventually terminate in leaf nodes, each representing a predicted outcome (survived or not survived).

### Top 3 Most Important Features

Based on the feature importances from our `final_dt_model`, the top 3 most influential features in determining survival on the Titanic dataset are:

1.  **Sex**: The gender of the passenger.
2.  **Fare**: The fare paid by the passenger.
3.  **Age**: The age of the passenger.

These features are used at the higher levels of the decision tree to make the most impactful initial splits, contributing significantly to the model's predictive power.

In [None]:
from sklearn import tree

plt.figure(figsize=(20,20))
tree.plot_tree(
    final_dt_model, filled=True, feature_names=features_, class_names=[str(c) for c in y.unique()]
)
plt.show()