# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [3]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [20]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [23]:
#your code here

In [25]:
# Droping rows with missing values
spaceship_cleaned = spaceship.dropna()

In [27]:
# making it Cabin to its deck letter
spaceship_cleaned.loc[:, 'Cabin'] = spaceship_cleaned['Cabin'].apply(lambda x: x.split('/')[0])

In [29]:
# columns
print(spaceship_cleaned.columns)

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')


In [31]:
#  Droping PassengerId and Name
spaceship_cleaned = spaceship_cleaned.drop(columns=['PassengerId', 'Name'])


In [33]:
# categorical columns (excluding target 'Transported')
categorical_cols = spaceship_cleaned.select_dtypes(include=['object', 'bool']).columns.drop('Transported')
spaceship_encoded = pd.get_dummies(spaceship_cleaned, columns=categorical_cols, drop_first=True)

In [35]:
#  Separate features and target
X = spaceship_encoded.drop('Transported', axis=1)
y = spaceship_encoded['Transported']

In [37]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [39]:
# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Perform Train Test Split**

In [10]:
#your code here

In [47]:
from sklearn.model_selection import train_test_split

# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [49]:
# Cheking the shapes
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (5284, 19)
X_test shape: (1322, 19)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [11]:
#your code here

In [67]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and fit the model
bagging_model = BaggingClassifier(random_state=42)
bagging_model.fit(X_train_scaled, y_train)


In [69]:
# Making predictions
y_pred_bagging = bagging_model.predict(X_test_scaled)


In [73]:
# Evaluate
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f" Bagging Classifier Accuracy: {accuracy_bagging:.2%}")
print("\n Classification Report:")
print(classification_report(y_test, y_pred_bagging))
print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_bagging))


 Bagging Classifier Accuracy: 77.76%

 Classification Report:
              precision    recall  f1-score   support

       False       0.78      0.78      0.78       656
        True       0.78      0.78      0.78       666

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322


 Confusion Matrix:
[[510 146]
 [148 518]]


- Random Forests

In [None]:
#your code here

In [61]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and fit the model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_scaled, y_train)

In [63]:
# Making predictions
X = spaceship_encoded.drop('Transported', axis=1)
y = spaceship_encoded['Transported']

In [65]:
# Evaluating
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f" Random Forest Accuracy: {accuracy_rf:.2%}")
print("\n Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))


 Random Forest Accuracy: 77.99%

 Classification Report:
              precision    recall  f1-score   support

       False       0.78      0.78      0.78       656
        True       0.78      0.78      0.78       666

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322


 Confusion Matrix:
[[511 145]
 [146 520]]


- Gradient Boosting

In [None]:
#your code here

In [76]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initializing and fit the model
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train_scaled, y_train)

In [78]:
# Making predictions
y_pred_gb = gb_model.predict(X_test_scaled)

In [80]:
# Evaluate
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f" Gradient Boosting Accuracy: {accuracy_gb:.2%}")
print("\n Classification Report:")
print(classification_report(y_test, y_pred_gb))
print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_gb))


 Gradient Boosting Accuracy: 78.37%

 Classification Report:
              precision    recall  f1-score   support

       False       0.82      0.72      0.77       656
        True       0.75      0.85      0.80       666

    accuracy                           0.78      1322
   macro avg       0.79      0.78      0.78      1322
weighted avg       0.79      0.78      0.78      1322


 Confusion Matrix:
[[471 185]
 [101 565]]


- Adaptive Boosting

In [None]:
#your code here

In [82]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initializing and fit the model
ada_model = AdaBoostClassifier(random_state=42)
ada_model.fit(X_train_scaled, y_train)



In [84]:
# Makeing predictions
y_pred_ada = ada_model.predict(X_test_scaled)

In [86]:
# Evaluating
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f" AdaBoost Accuracy: {accuracy_ada:.2%}")
print("\n Classification Report:")
print(classification_report(y_test, y_pred_ada))
print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_ada))


 AdaBoost Accuracy: 78.74%

 Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.74      0.78       656
        True       0.77      0.83      0.80       666

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322


 Confusion Matrix:
[[488 168]
 [113 553]]


Which model is the best and why?

In [None]:
#comment here

In [90]:
print(" Gradient Boosting: the Best overall accuracy and balance across metrics — learns from previous errors.")

 Gradient Boosting: the Best overall accuracy and balance across metrics — learns from previous errors.


# Model Evaluation Summary

In [95]:
print(" Random Forest: Good baseline model, generally robust and easy to use.")
print(" Bagging Classifier: Similar to Random Forest, but may underperform if base estimators are too simple.")

print(" AdaBoost: Performs well on simple tasks, but may lag behind in complex datasets like this one.\n")

print(" Conclusion:")
print(" Gradient Boosting is the best performing model for this dataset, likely due to its ability to focus on difficult cases and reduce errors iteratively.")
print(" It achieved the highest accuracy and tends to offer the best balance between precision and recall.")


 Random Forest: Good baseline model, generally robust and easy to use.
 Bagging Classifier: Similar to Random Forest, but may underperform if base estimators are too simple.
 AdaBoost: Performs well on simple tasks, but may lag behind in complex datasets like this one.

 Conclusion:
 Gradient Boosting is the best performing model for this dataset, likely due to its ability to focus on difficult cases and reduce errors iteratively.
 It achieved the highest accuracy and tends to offer the best balance between precision and recall.
