Learning from Mistakes (Boosting)

Import Libraries & Load Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)


Data Cleaning & Feature Engineering

In [11]:
# Re-load the Titanic dataset to ensure a clean state for processing
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


No scaling needed (tree-based model)

Select Features & Target


In [12]:
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'FamilySize']]
y = df['Survived']


Train-Test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


Train Gradient Boosting Model

In [14]:
gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

# Impute NaN values in 'Sex' column with the mode from the training set
# This ensures both X_train and X_test are treated consistently and avoids data leakage.
sex_mode = X_train['Sex'].mode()[0]
X_train['Sex'].fillna(sex_mode, inplace=True)
X_test['Sex'].fillna(sex_mode, inplace=True)

gb.fit(X_train, y_train)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train['Sex'].fillna(sex_mode, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test['Sex'].fillna(sex_mode, inplace=True)


    n_estimators → number of trees

    learning_rate → how much each tree corrects errors

    max_depth → complexity of each tree

Make Predictions

In [15]:
y_pred_gb = gb.predict(X_test)


Evaluate Model

In [16]:
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_gb))


Accuracy: 0.8156424581005587

Confusion Matrix:
 [[93 12]
 [21 53]]

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.89      0.85       105
           1       0.82      0.72      0.76        74

    accuracy                           0.82       179
   macro avg       0.82      0.80      0.81       179
weighted avg       0.82      0.82      0.81       179



Gradient Boosting often beats Random Forest

Feature Importance

In [17]:
importance_gb = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gb.feature_importances_
}).sort_values(by='Importance', ascending=False)

importance_gb


Unnamed: 0,Feature,Importance
1,Sex,0.466828
3,Fare,0.183376
0,Pclass,0.15004
2,Age,0.134046
4,FamilySize,0.06571


    Compare Models (Final View)

    Approximate ranking:
    1️⃣ Gradient Boosting → 82–88%
    2️⃣ Random Forest → 80–85%
    3️⃣ Logistic Regression
    4️⃣ KNN
    5️⃣ Decision Tree

           i need to fix mistakes sequentially