# Ensemble Learning: From Random Forest to Gradient Boosting 🌳

**Ensemble learning** is a technique that combines multiple machine learning models to produce a more powerful and accurate model. Instead of relying on a single model, we leverage the "wisdom of the crowd" by aggregating the predictions of several base estimators.

This notebook will compare the performance of a single Decision Tree against two popular tree-based ensemble methods on the Titanic dataset:

1.  **Bagging (Random Forest):** Models are built independently and in parallel. Their predictions are then combined through a voting process. This method is excellent at reducing variance and preventing overfitting.
2.  **Boosting (Gradient Boosting):** Models are built sequentially. Each new model focuses on correcting the errors made by the previous ones, gradually improving the overall prediction.

---

## 1. Predicting Survival on the Titanic

Our goal is to predict whether a passenger survived the Titanic disaster based on features like their class, sex, and age.

First, we load and prepare the data. The `Name` column is dropped as it's not a useful feature, and the categorical `Sex` column is converted to numerical values.


In [7]:
import pandas as pd

df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [8]:
# Preprocessing
df.drop("Name", axis='columns', inplace=True)
df['Sex'] = df['Sex'].map({'male':1, 'female':2})
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,2,38.0,1,0,71.2833
2,1,3,2,26.0,0,0,7.925
3,1,1,2,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


Finally, we split our data into training and testing sets.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

X = df.drop('Survived', axis='columns')
y = df.Survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 2. Baseline Model: A Single Decision Tree

To start, let's train a single `DecisionTreeClassifier` to establish a baseline performance.

In [10]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
y_pred_dt = model_dt.predict(X_test)

print("--- Decision Tree Report ---")
print(classification_report(y_test, y_pred_dt))

--- Decision Tree Report ---
              precision    recall  f1-score   support

           0       0.83      0.82      0.82       166
           1       0.71      0.72      0.72       101

    accuracy                           0.78       267
   macro avg       0.77      0.77      0.77       267
weighted avg       0.78      0.78      0.78       267



The single decision tree achieves an accuracy of **77%**.

## 3. Ensemble Method 1: Random Forest (Bagging)

A **Random Forest** builds multiple decision trees in parallel on different subsets of the data (bagging) and with different subsets of features. The final prediction is made by averaging the predictions of all the individual trees. This process helps to reduce overfitting and improve generalization.

In [11]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=100)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)

print("--- Random Forest Report ---")
print(classification_report(y_test, y_pred_rf))

--- Random Forest Report ---
              precision    recall  f1-score   support

           0       0.80      0.85      0.82       166
           1       0.73      0.65      0.69       101

    accuracy                           0.78       267
   macro avg       0.76      0.75      0.76       267
weighted avg       0.77      0.78      0.77       267



The Random Forest improves the accuracy to **79%**.

## 4. Ensemble Method 2: Gradient Boosting (Boosting)

**Gradient Boosting** is another powerful ensemble technique, but it works sequentially. It starts by training a simple model (a "weak learner," usually a shallow decision tree). It then trains a second model to correct the errors of the first. A third model is trained to correct the errors of the second, and so on. Each new tree is an "expert" on the mistakes of the previous ones. This step-by-step process of learning from errors allows Gradient Boosting to create a single, highly accurate final model.

In [12]:
from sklearn.ensemble import GradientBoostingClassifier

model_gb = GradientBoostingClassifier(n_estimators=100)
model_gb.fit(X_train, y_train)
y_pred_gb = model_gb.predict(X_test)

print("--- Gradient Boosting Report ---")
print(classification_report(y_test, y_pred_gb))

--- Gradient Boosting Report ---
              precision    recall  f1-score   support

           0       0.82      0.91      0.86       166
           1       0.82      0.67      0.74       101

    accuracy                           0.82       267
   macro avg       0.82      0.79      0.80       267
weighted avg       0.82      0.82      0.82       267



The Gradient Boosting classifier achieves the highest accuracy of **82%**.

## 5. Conclusion

| Model | Accuracy |
|:--- |:--- |
| Single Decision Tree | 77% |
| Random Forest (Bagging) | 79% |
| **Gradient Boosting** | **82%** |

This comparison highlights the power of ensemble methods:
* **Random Forest** improves upon a single Decision Tree by averaging many trees to reduce variance.
* **Gradient Boosting** improves performance by building trees sequentially, with each tree correcting the errors of its predecessor.

While both are powerful, Gradient Boosting often achieves a higher level of accuracy by focusing on and correcting mistakes in an iterative fashion.