# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [43]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.tree import  DecisionTreeClassifier
from sklearn.ensemble import  BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report

In [3]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
#your code here
#your code here
spaceship.dropna(inplace=True)
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]
spaceship = spaceship.drop(['PassengerId', 'Name'], axis = 1)
spaceship["CryoSleep"] = spaceship["CryoSleep"].astype(int)
spaceship["VIP"] = spaceship["VIP"].astype(int)


In [6]:
df_space_transformed = pd.merge(left=spaceship,
                              right= pd.get_dummies(spaceship[['HomePlanet', 'Cabin', 'Destination']], dtype=int, drop_first=True),
                              left_index=True,
                              right_index=True)
df_space_transformed = df_space_transformed.drop(['HomePlanet', 'Cabin', 'Destination'], axis = 1)
df_space_transformed.dtypes

CryoSleep                      int32
Age                          float64
VIP                            int32
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
HomePlanet_Europa              int32
HomePlanet_Mars                int32
Cabin_B                        int32
Cabin_C                        int32
Cabin_D                        int32
Cabin_E                        int32
Cabin_F                        int32
Cabin_G                        int32
Cabin_T                        int32
Destination_PSO J318.5-22      int32
Destination_TRAPPIST-1e        int32
dtype: object

**Perform Train Test Split**

In [7]:
#your code here
features = df_space_transformed.drop(columns=["Transported"])
target = df_space_transformed["Transported"].astype(int)

In [8]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [9]:
normalizer = MinMaxScaler()
normalizer.fit(X_train)
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [10]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [11]:
tree = DecisionTreeClassifier(max_depth=10)
tree.fit(X_train_norm, y_train)

In [27]:
pred = tree.predict(X_test_norm)

print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.80      0.72      0.76       661
           1       0.74      0.82      0.78       661

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



In [13]:
tree_importance = {feature : importance for feature, importance in zip(X_train_norm, tree.feature_importances_)}
tree_importance           

{'CryoSleep': 0.368381816071068,
 'Age': 0.06350805711465617,
 'VIP': 0.00016368015631942548,
 'RoomService': 0.08821030116872619,
 'FoodCourt': 0.09332006998240189,
 'ShoppingMall': 0.037370832911335077,
 'Spa': 0.09589272903123236,
 'VRDeck': 0.1259714294811913,
 'HomePlanet_Europa': 0.013570977889028566,
 'HomePlanet_Mars': 0.003795439107232034,
 'Cabin_B': 0.002762769930489119,
 'Cabin_C': 0.004591436638675179,
 'Cabin_D': 0.0065618674096864045,
 'Cabin_E': 0.018868313684785115,
 'Cabin_F': 0.002234490191396501,
 'Cabin_G': 0.06023970809134897,
 'Cabin_T': 0.0,
 'Destination_PSO J318.5-22': 0.003891829606271087,
 'Destination_TRAPPIST-1e': 0.01066425153415657}

Test with only few columns mostly correlated than others

In [14]:
features_adjusted = df_space_transformed[['CryoSleep', 'VRDeck', 'RoomService', 'FoodCourt', 'Spa']]

In [15]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(features_adjusted, target, test_size=0.20, random_state=0)

In [16]:
normalizer1 = StandardScaler()

normalizer1.fit(X_train1)
X_train1_norm = normalizer1.transform(X_train1)
X_test1_norm = normalizer1.transform(X_test1)

In [36]:
#full data
lr = LogisticRegression()
lr.fit(X_train_norm, y_train)
pred_lr = lr.predict(X_test)

print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.54      0.86      0.66       661
           1       0.65      0.25      0.36       661

    accuracy                           0.56      1322
   macro avg       0.59      0.56      0.51      1322
weighted avg       0.59      0.56      0.51      1322



In [39]:
#selected features
lr = LogisticRegression()
lr.fit(X_train1_norm, y_train1)
pred_lr = lr.predict(X_test1_norm)

print(classification_report(y_test1, pred_lr))

              precision    recall  f1-score   support

           0       0.80      0.75      0.77       661
           1       0.76      0.81      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Bagging and Pasting

In [21]:
bagging_cla = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [22]:
bagging_cla_boot = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000, bootstrap=False)

In [37]:
#without pasting
bagging_cla.fit(X_train_norm, y_train)
pred = bagging_cla.predict(X_test_norm)

print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.79      0.79      0.79       661
           1       0.79      0.79      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



In [38]:
#with pasting
bagging_cla_boot.fit(X_train_norm, y_train)
pred_boot = bagging_cla_boot.predict(X_test_norm)

print(classification_report(y_test, pred_boot))

              precision    recall  f1-score   support

           0       0.79      0.79      0.79       661
           1       0.79      0.79      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Random Forests

In [40]:
#your code here
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
forest.fit(X_train_norm, y_train)
pred_forest = forest.predict(X_test_norm)

print(classification_report(y_test, pred_forest))

              precision    recall  f1-score   support

           0       0.77      0.80      0.79       661
           1       0.80      0.77      0.78       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Gradient Boosting

In [41]:
#your code here
gb_cla = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)
gb_cla.fit(X_train_norm, y_train)
pred_gb = gb_cla.predict(X_test_norm)

print(classification_report(y_test, pred_gb))

              precision    recall  f1-score   support

           0       0.80      0.75      0.78       661
           1       0.77      0.81      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Adaptive Boosting

In [42]:
#your code here
ada_cla = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=100)
ada_cla.fit(X_train_norm, y_train)
pred_ada = ada_cla.predict(X_test_norm)

print(classification_report(y_test, pred_ada))

              precision    recall  f1-score   support

           0       0.78      0.75      0.76       661
           1       0.76      0.79      0.77       661

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



Which model is the best and why?

In [None]:
#comment here