# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [76]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score



In [28]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.shape

(8693, 14)

In [29]:
# drop nulls
spaceship = spaceship.dropna().reset_index(drop=True)  # Drop nulls
spaceship.shape

(6606, 14)

In [30]:
# Keep only the first letter of Cabin

spaceship['Cabin'] = spaceship['Cabin'].str[0]

print(spaceship.Cabin.value_counts())

Cabin
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64


In [31]:
# Drop PassengerId and Name
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [33]:
# Use dummies on non-numerical columns 
# Get non-numerical columns 
non_num_columns = spaceship.select_dtypes(include=['object']).columns

# Use get_dummies to transform these columns
spaceship_with_dummies = pd.get_dummies(spaceship, columns=non_num_columns, dtype=int) #(I got booleans before specifying int)

print(spaceship_with_dummies)

       Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0     39.0          0.0        0.0           0.0     0.0     0.0        False   
1     24.0        109.0        9.0          25.0   549.0    44.0         True   
2     58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3     33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4     16.0        303.0       70.0         151.0   565.0     2.0         True   
...    ...          ...        ...           ...     ...     ...          ...   
6601  41.0          0.0     6819.0           0.0  1643.0    74.0        False   
6602  18.0          0.0        0.0           0.0     0.0     0.0        False   
6603  26.0          0.0        0.0        1872.0     1.0     0.0         True   
6604  32.0          0.0     1049.0           0.0   353.0  3235.0        False   
6605  44.0        126.0     4688.0           0.0     0.0    12.0         True   

      HomePlanet_Earth  Hom

**Perform Train Test Split**

In [35]:
features = spaceship_with_dummies.drop('Transported', axis=1)
target = spaceship_with_dummies['Transported']

# train test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

**Scaling**

In [40]:
# Scale features

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)   # fit scaler on training data, transform training data
X_test_scaled = scaler.transform(X_test)         # only transform test data

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [62]:
# Create the model 
bagging = BaggingClassifier(DecisionTreeClassifier(),
                               n_estimators=100,
                               max_samples = 0.8,
                               max_features = 1.0,
                               random_state = 0)

In [64]:
# Fit the bag 

bagging.fit(X_train_scaled, y_train)

# Predict and evaluate

y_pred = bagging.predict(X_test_scaled)
accuracy = bagging.score(X_test_scaled, y_test)
print("Bagging accuracy:", accuracy)



Bagging accuracy: 0.7881996974281392


In [65]:
# More evaluation 

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))


Accuracy: 0.7881996974281392
Precision: 0.7908396946564885
Recall: 0.783661119515885
F1 Score: 0.7872340425531915


- Random Forests

In [68]:
# Create the model
forest = RandomForestClassifier(n_estimators=100, random_state=0)

# Fit the model 
forest.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = forest.predict(X_test_scaled)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))


Random Forest Accuracy: 0.7912254160363086
Precision: 0.7975270479134466
Recall: 0.7806354009077155
F1 Score: 0.7889908256880734


- Gradient Boosting

In [74]:
# Create the model
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)

# Fit the model
gb.fit(X_train_scaled, y_train)

# Predict
y_pred = gb.predict(X_test_scaled)

# Evaluate
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

Gradient Boosting Accuracy: 0.7859304084720121
Precision: 0.761049723756906
Recall: 0.8335854765506808
F1 Score: 0.7956678700361011


- Adaptive Boosting

In [78]:
# Create and fit the model
ada = AdaBoostClassifier(n_estimators=100, random_state=0)
ada.fit(X_train_scaled, y_train)

# Predict
y_pred = ada.predict(X_test_scaled)

# Evaluate
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))



AdaBoost Accuracy: 0.7844175491679274
Precision: 0.7685714285714286
Recall: 0.8139183055975794
F1 Score: 0.7905951506245408


Which model is the best and why?

In [None]:
# If accuracy is the most important metric, Random Forest is the best. If we give more weight to Recall, Gradient Boosting is the best. 