<a href="https://colab.research.google.com/github/BlvckSanek/The_Titanic_Problem/blob/main/The_Titanic_Problem_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Titanic Problem Part III

Now that we have preprocessed our dataset. We are ready to go ahead start developing our models. For starters, we will develop a base model with the preprocessed dataset.

## Loading the necessary python packages and dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
import os
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
import pickle


%matplotlib inline
rcParams["figure.figsize"] = 10,8
sns.set(style="whitegrid", palette="muted", rc={"figure.figsize": (15,10)})

In [None]:
train = pd.read_csv("processed_train.csv")
test = pd.read_csv("processed_test.csv")

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
        display(df)

In [None]:
display_all(train)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Family_Size,Embarked_C,Embarked_Q,Embarked_S,Title_Dr,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rev
0,0,3,1,22.0,1,0,7.25,1,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1,1,0,38.0,1,0,71.2833,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,3,0,26.0,0,0,7.925,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1,1,0,35.0,1,0,53.1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0,3,1,35.0,0,0,8.05,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0,3,1,29.0,0,0,8.4583,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,0,1,1,54.0,0,0,51.8625,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
7,0,3,1,2.0,3,1,21.075,4,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
8,1,3,0,27.0,0,2,11.1333,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
9,1,2,0,14.0,1,0,30.0708,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [None]:
display_all(train.describe(include="all").T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Sex,891.0,0.647587,0.47799,0.0,0.0,1.0,1.0,1.0
Age,891.0,29.358215,13.241715,0.42,22.0,29.0,36.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292
Family_Size,891.0,0.904602,1.613459,0.0,0.0,0.0,1.0,10.0
Embarked_C,891.0,0.189675,0.392264,0.0,0.0,0.0,0.0,1.0
Embarked_Q,891.0,0.08642,0.281141,0.0,0.0,0.0,0.0,1.0


## Gradient Boosting Algorithm

Gradient Boosting Classifier is a powerful machine learning algorithm that builds an ensemble of decision trees sequentially, where each subsequent tree corrects the errors of the previous ones. We will be using this algorithm as our base model and build on that later.

### Create Predictor Variables and Response Variable
Since the test dataset would be used only for submission it would be better to split our train dataset into train and test datasets so we can evaluate the model's performance. But before that we need to create feature variables and target variable.

In [None]:
# Create X and y variables
dep_var = "Survived"
X = train.drop(dep_var, axis=1)
y = train[dep_var]

# spit the train dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=79)

print(X_train.shape, X_test.shape, test.shape)

(712, 16) (179, 16) (418, 16)


### Train the Gradient Boosting Classifier

In [None]:
# Instantiate GradientBoosting
gbc = GradientBoostingClassifier()

# Train the model
gbc.fit(X_train, y_train)

### Evaluate the performance of the model

In [None]:
# Predictions
y_hat = gbc.predict(X_test)

# Calculating the metrics
accuracy = accuracy_score(y_test, y_hat)
precision = precision_score(y_test, y_hat)
recall = recall_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)

# Create a dataframe for the metrics
metrics = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Score': [accuracy, precision, recall, f1]
})
print(metrics)

      Metric     Score
0   Accuracy  0.837989
1  Precision  0.836364
2     Recall  0.696970
3   F1 Score  0.760331


Well that is not bad for starter, the base model with default settings performed really well with accuracy score of 83.8%, precision score of 83.6%, recall score of 69.7% and F1 score of 76%. Overall the model performed reasonably well. But yeah, we can do much better than this.

## Save the Base Model

In [None]:
with open("base_model_titanic.pkl", "wb") as bm:
    pickle.dump(gbc, bm)

## Improving the Model

There are a lot of hyperparameters available to GradientBoostingClassifiers but we went with the default settings for our base model. To get access to the hyparameters you can use the `get_params()` method associated with model to see all the available settings you can tune to make the model better.

In [None]:
GradientBoostingClassifier().get_params()

{'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'log_loss',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

The above is a dictionary of all the hyperparameters available to GradientBoostingClassifiers. I won't talk about all of them but I will only talk about some of the ones we will be using to tune our model. We will select `max_depth` which controls the complexity of the trees, `max_features` which controls the number of features to consider when splitting at a decision node, `n_estimators` controls the number of trees in the forest, `validation_fraction` controls what fraction of the training dataset should be used to evaluate the model.

### Grid Search Cross Validation

To use the selected hyperparameters, we need to find the best setting for each one them, this means experimenting different settings to the get the best setting. This would be very tedious to manually, but we can leverage on packages like sklearn's `GridSearchCV` to help us select the best settings of the selected hyperparameters.

In [None]:
# Standardize the dataset
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Create dictionary parameters for selected params
params = {
    "criterion": ["friedman_mse", "mse", "mae"],
    "max_depth": [1,2,3,5,7],
    "max_features": [0.5, "sqrt", "log2", None],
    "n_estimators": [100, 200, 300, 400, 500],
    "subsample": [0.1, 0.5, 0.7]
}

gbc = GradientBoostingClassifier()

# Create the grid search model
grid_search = GridSearchCV(gbc, params, scoring="accuracy", cv=5, verbose=1, n_jobs=-1)

# Train the grid search model
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 900 candidates, totalling 4500 fits


### Assessing the best Parameters

In [None]:
grid_search.best_params_

{'criterion': 'friedman_mse',
 'max_depth': 2,
 'max_features': None,
 'n_estimators': 100,
 'subsample': 0.5}

### Evaluating the GridSearch Best Model

In [None]:
# Predictions
X_test_scaled = scaler.transform(X_test)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)

# Calculating the metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Create a dataframe for the metrics
metrics = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Score': [accuracy, precision, recall, f1]
})
print(metrics)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

      Metric     Score
0   Accuracy  0.837989
1  Precision  0.836364
2     Recall  0.696970
3   F1 Score  0.760331

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       113
           1       0.84      0.70      0.76        66

    accuracy                           0.84       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.84      0.83       179



Well, the model did not really improve but it is now a robust model as compared to the base model.

## Combine the train and test sets

We now know our tuned model performs but the training was done on 80% of the training dataset provided. We need to combine the 80% and the 20% used to evaluate the model and now train the model on all the full training dataset.
From there, we can go ahead predict on the real test dataset which has no target variable(Survived).

In [None]:
# Combine X_train and X_test sets
X_train = pd.concat([X_train, X_test])
y_train = pd.concat([y_train, y_test])

In [None]:
print(X_train.shape, test.shape)

(891, 16) (418, 16)


## Train on the full set

In [None]:
# Create pipeline
pipeline = Pipeline([("scaler", StandardScaler()),
                    ("gbc", GradientBoostingClassifier())
                    ])

# parameters for model
grid_params = {
    "gbc__criterion": ["friedman_mse", "mse", "mae"],
    "gbc__max_depth": [1,2,3,5,7],
    "gbc__max_features": [0.5, "sqrt", "log2", None],
    "gbc__n_estimators": [100, 200, 300, 400, 500],
    "gbc__subsample": [0.1, 0.5, 0.7]
}

# Perform grid search
gridsearch = GridSearchCV(pipeline, grid_params, cv=5, verbose=2, n_jobs=-1)

### Fit on the Best Model

In [None]:
# Train on the best estimator
gridsearch.fit(X_train, y_train)


Fitting 5 folds for each of 900 candidates, totalling 4500 fits


## Evaluating the trained model

In [None]:
# Define your pipeline with the best estimator found by GridSearchCV
best_pipeline = gridsearch.best_estimator_

# Perform cross-validation
cv_scores = cross_val_score(best_pipeline, X_train, y_train, cv=5, scoring='accuracy')

# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)

# Calculate and print the mean and standard deviation of the cross-validation scores
print("Mean Accuracy:", cv_scores.mean())
print("Standard Deviation of Accuracy:", cv_scores.std())

Cross-Validation Scores: [0.82681564 0.83707865 0.86516854 0.80898876 0.84269663]
Mean Accuracy: 0.8361496453455528
Standard Deviation of Accuracy: 0.01849680697479674


The performance of the model is 83.6% on the full training set which is identical to the earlier base model and the tuned model.

In [None]:
with open("best_model_titanic.pkl", "wb") as bp:
    pickle.dump(best_pipeline, bp)

## Predicting on the test set

In [None]:
test[dep_var] = best_pipeline.predict(test)

# Save Prediction for Submission

In [None]:
submission = test[["PassengerId", dep_var]]

In [None]:
submission[dep_var] = submission[dep_var].apply(int)

In [None]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [None]:
submission.to_csv("submission.csv", index=False)

In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/kaggle.json

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle competitions submit -c titanic -f submission.csv -m "Great to Learn"

100% 2.77k/2.77k [00:00<00:00, 9.34kB/s]
Successfully submitted to Titanic - Machine Learning from Disaster

# That is us done. We will try Neural Networks next, watch out