# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
spaceship.shape

(8693, 14)

In [4]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [5]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

#### Handle missing Data (by Impute Missing Values):

Replace NaN with median/mean (for numbers) or mode (for categories):

In [6]:
# Separate numerical and categorical columns
# =========================================

# Identify numerical columns (all number-based features)
numerical_cols = spaceship.select_dtypes(include=['int64', 'float64']).columns.tolist()
print("Numerical columns:", numerical_cols)

# Identify categorical columns (all object/boolean features)
categorical_cols = spaceship.select_dtypes(include=['object', 'bool']).columns.tolist()
# Remove target variable 'Transported' from categorical cols
categorical_cols.remove('Transported') if 'Transported' in categorical_cols else None
print("Categorical columns:", categorical_cols)

# Create separate DataFrames for each type
numerical_data = spaceship[numerical_cols]
categorical_data = spaceship[categorical_cols]

# Impute missing values
# =====================
from sklearn.impute import SimpleImputer

# For numerical data: fill missing values with median
num_imputer = SimpleImputer(strategy='median')
numerical_data_imputed = pd.DataFrame(
    num_imputer.fit_transform(numerical_data),
    columns=numerical_data.columns
)

# For categorical data: fill missing values with most frequent category
cat_imputer = SimpleImputer(strategy='most_frequent')
categorical_data_imputed = pd.DataFrame(
    cat_imputer.fit_transform(categorical_data),
    columns=categorical_data.columns
)

# Recombine the data
# ==================
# Add back the target variable
clean_spaceship = pd.concat([
    numerical_data_imputed,
    categorical_data_imputed,
    spaceship['Transported']
], axis=1)

# Verify no missing values remain
print("\nMissing values after imputation:")
print(clean_spaceship.isnull().sum())

# Show the cleaned data
clean_spaceship.head()

Numerical columns: ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
Categorical columns: ['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Name']

Missing values after imputation:
Age             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
VIP             0
Name            0
Transported     0
dtype: int64


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Transported
0,39.0,0.0,0.0,0.0,0.0,0.0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,False,Maham Ofracculy,False
1,24.0,109.0,9.0,25.0,549.0,44.0,0002_01,Earth,False,F/0/S,TRAPPIST-1e,False,Juanna Vines,True
2,58.0,43.0,3576.0,0.0,6715.0,49.0,0003_01,Europa,False,A/0/S,TRAPPIST-1e,True,Altark Susent,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,0003_02,Europa,False,A/0/S,TRAPPIST-1e,False,Solam Susent,False
4,16.0,303.0,70.0,151.0,565.0,2.0,0004_01,Earth,False,F/1/S,TRAPPIST-1e,False,Willy Santantines,True


In [7]:
# spaceship = spaceship.dropna()
clean_spaceship.isnull().sum()

Age             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
VIP             0
Name            0
Transported     0
dtype: int64

In [8]:
clean_spaceship.select_dtypes(include=['number']).info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Age           8693 non-null   float64
 1   RoomService   8693 non-null   float64
 2   FoodCourt     8693 non-null   float64
 3   ShoppingMall  8693 non-null   float64
 4   Spa           8693 non-null   float64
 5   VRDeck        8693 non-null   float64
dtypes: float64(6)
memory usage: 407.6 KB


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [9]:
#your code here
from sklearn.preprocessing import MinMaxScaler

# Feature Scaling
numeric_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
scaler = MinMaxScaler()
clean_spaceship[numeric_cols] = scaler.fit_transform(clean_spaceship[numeric_cols])

# Feature Selection
clean_spaceship = clean_spaceship.drop(columns=['PassengerId', 'Name', 'Cabin'])

# Convert categorical columns to numeric (one-hot encoding)
clean_spaceship = pd.get_dummies(clean_spaceship, columns=['HomePlanet', 'Destination', 'VIP', 'CryoSleep'], drop_first = True)

In [10]:
clean_spaceship[numeric_cols].head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.493671,0.0,0.0,0.0,0.0,0.0
1,0.303797,0.007608,0.000302,0.001064,0.0245,0.001823
2,0.734177,0.003001,0.119948,0.0,0.29967,0.00203
3,0.417722,0.0,0.043035,0.015793,0.148563,0.007997
4,0.202532,0.021149,0.002348,0.006428,0.025214,8.3e-05


In [11]:
clean_spaceship.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,CryoSleep_True
0,0.493671,0.0,0.0,0.0,0.0,0.0,False,True,False,False,True,False,False
1,0.303797,0.007608,0.000302,0.001064,0.0245,0.001823,True,False,False,False,True,False,False
2,0.734177,0.003001,0.119948,0.0,0.29967,0.00203,False,True,False,False,True,True,False
3,0.417722,0.0,0.043035,0.015793,0.148563,0.007997,False,True,False,False,True,False,False
4,0.202532,0.021149,0.002348,0.006428,0.025214,8.3e-05,True,False,False,False,True,False,False


**Perform Train Test Split**

In [12]:
#your code here
X = clean_spaceship.drop('Transported', axis=1)
y = clean_spaceship.Transported

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [13]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        8693 non-null   float64
 1   RoomService                8693 non-null   float64
 2   FoodCourt                  8693 non-null   float64
 3   ShoppingMall               8693 non-null   float64
 4   Spa                        8693 non-null   float64
 5   VRDeck                     8693 non-null   float64
 6   HomePlanet_Europa          8693 non-null   bool   
 7   HomePlanet_Mars            8693 non-null   bool   
 8   Destination_PSO J318.5-22  8693 non-null   bool   
 9   Destination_TRAPPIST-1e    8693 non-null   bool   
 10  VIP_True                   8693 non-null   bool   
 11  CryoSleep_True             8693 non-null   bool   
dtypes: bool(6), float64(6)
memory usage: 458.5 KB


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Simple Decition Tree as a Baseline

In [14]:
#### Simple Decition Tree as a Baseline

from sklearn.tree import DecisionTreeClassifier

# Train a simple tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Check accuracy
print(f"Tree Accuracy (without depth value): {tree.score(X_test, y_test):.4f}")


tree3 = DecisionTreeClassifier(max_depth=3, random_state=42)  # Stops at 3 levels
tree3.fit(X_train, y_train)
tree3_score = tree3.score(X_test, y_test)
print(f"Tree Accuracy (depth=3): {tree3_score:.4f}")

Tree Accuracy (without depth value): 0.7324
Tree Accuracy (depth=3): 0.7186


- Bagging and Pasting

In [15]:
#your code here
from sklearn.ensemble import BaggingClassifier

# Base model (usually a decision tree)
base_model = DecisionTreeClassifier(max_depth=3, random_state=42)

bagging_clas = BaggingClassifier(
    base_model, # depth 3 to force tree to be "weak"
    n_estimators=10, # 10 trees
    max_samples=100, # we limit each weaker tree to 100 datapoints
    random_state=42) # same fixing random state as before

# Train and evaluate
bagging_clas.fit(X_train, y_train)
bagging_clas_score = bagging_clas.score(X_test,y_test)

print(f"Bagging Tree Accuracy (depth=3): {bagging_clas_score:.4f}")


Bagging Tree Accuracy (depth=3): 0.7818


In [16]:
pasting = BaggingClassifier(
    base_model,
    n_estimators=100,
    max_samples=0.5,
    bootstrap=False,  # Pasting (no replacement)
    random_state=42
)
pasting.fit(X_train, y_train)
pasting_clas_score = pasting.score(X_test, y_test)
print(f"Pasting Accuracy: {pasting_clas_score:.4f}")

Pasting Accuracy: 0.7523


- Random Forests

In [17]:
#your code here

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=10, # same 10 trees
                               max_depth=3, # depth 3 to force tree to be "weak"
                               random_state=42) # fixing rand because I'm insecure and afraid you will judge me if I get a bad random selection that does not prove my point
forest.fit(X_train, y_train)
forest_clas_score = forest.score(X_test,y_test)

print(f"Forest Accuracy: {forest_clas_score:.4f}")


Forest Accuracy: 0.7538


- Gradient Boosting

In [18]:
#your code here

from sklearn.ensemble import GradientBoostingClassifier

gb_clas = GradientBoostingClassifier(max_depth=5, #gradient boosting always works with trees, no need to call the tree
                                   n_estimators=50,
                                   random_state=42 # tastes like chicken
                                   )
gb_clas.fit(X_train, y_train)
gb_clas_score = gb_clas.score(X_test,y_test)

print(f"Gradient Boosting accuracy: {gb_clas_score:.4f}")

Gradient Boosting accuracy: 0.7849


- Adaptive Boosting

In [19]:
#your code here
from sklearn.ensemble import AdaBoostClassifier

ada_clas = AdaBoostClassifier(base_model, #ada boosting needs a base tree model
                              n_estimators=50,
                              random_state=42 # tastes like chicken
                              )

ada_clas.fit(X_train, y_train)
ada_clas_score = ada_clas.score(X_test,y_test)

print(f"Gradient Boosting accuracy: {ada_clas_score:.4f}")

Gradient Boosting accuracy: 0.7826


Which model is the best and why?

In [20]:
#comment here

# Compare all model performances
print("\n" + "="*50)
print("FINAL MODEL COMPARISON")
print("="*50)

# Create a comparison dictionary
model_comparison = {
    "Decision Tree": tree3_score,
    "Bagging": bagging_clas_score,
    "Pasting": pasting_clas_score,
    "Random Forest": forest_clas_score,
    "Gradient Boosting": gb_clas_score,
    "AdaBoost": ada_clas_score
}

# Print results in descending order
print("\nMODEL ACCURACIES (Highest to Lowest):")
for model, score in sorted(model_comparison.items(), 
                          key=lambda x: x[1], 
                          reverse=True):
    print(f"- {model + ':':<20} {score:.4f} ({score*100:.1f}%)")

# Identify best model
best_model = max(model_comparison, key=model_comparison.get)
best_score = model_comparison[best_model]

# Justification analysis
print("\nCONCLUSION:")
print(f"The best model is {best_model} with {best_score*100:.1f}% accuracy because:")
print("- It outperformed other models by " + 
      ", ".join([f"{best_score - score:.4f} over {model}" 
                for model, score in model_comparison.items() 
                if model != best_model]))
print("\nKey observations:")
print("1. Ensemble methods consistently outperformed the single Decision Tree")
print(f"2. {best_model} showed the strongest generalization capability")
print("3. The performance difference between top models was " + 
      f"{max(model_comparison.values()) - sorted(model_comparison.values())[-2]:.4f}")


FINAL MODEL COMPARISON

MODEL ACCURACIES (Highest to Lowest):
- Gradient Boosting:   0.7849 (78.5%)
- AdaBoost:            0.7826 (78.3%)
- Bagging:             0.7818 (78.2%)
- Random Forest:       0.7538 (75.4%)
- Pasting:             0.7523 (75.2%)
- Decision Tree:       0.7186 (71.9%)

CONCLUSION:
The best model is Gradient Boosting with 78.5% accuracy because:
- It outperformed other models by 0.0663 over Decision Tree, 0.0031 over Bagging, 0.0326 over Pasting, 0.0311 over Random Forest, 0.0023 over AdaBoost

Key observations:
1. Ensemble methods consistently outperformed the single Decision Tree
2. Gradient Boosting showed the strongest generalization capability
3. The performance difference between top models was 0.0023
