<a href="https://www.kaggle.com/code/taimour/ensemble-7-models-stackingcvregressor-titanic?scriptVersionId=196022443" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h2>Ensemble 7 Models + StackingCVRegressor - Titanic Spaceship</h2>

![](https://i.postimg.cc/43CDn4kd/pexels-pixabay-2166.jpg)

# Understanding Data, Files and Goals

**Data Fields**

**PassengerId:** A unique identifier for each passenger.

**HomePlanet:** The planet the passenger is from.

**CryoSleep:** Whether the passenger was put into suspended animation during the journey.

**Cabin:** The location of the passenger's cabin on the ship.

**Destination:** The planet the passenger is traveling to.

**Age:** The age of the passenger.

**VIP:** Indicates if the passenger has VIP status.

**RoomService:** Amount spent for room service

**FoodCourt:** Amount spent in food court

**ShoppingMall:** Amount spent in shopping mall

**Spa:** Amount spent in spa

**VRDeck:** Amount spent for VRDeck

**Name:** The passenger's name.

**Transported:** The target variable indicating whether the passenger was transported.


**Files**

The **train.csv** file contains data for training a machine learning model to predict transportation.

The **test.csv** file contains data for testing the model's performance on unseen data.

The **sample_submission.csv** file is a template for submitting predictions.



**Goals**

**train.csv** and **test.csv** contain passenger data from a spaceship journey. The goal is to predict whether a passenger was "transported" to another dimension based on various personal details.

The **Transported column** is the target variable to predict.

By analyzing these features and their relationships, we can build a model to accurately predict whether a passenger will be transported or not.

# Import Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import ElasticNetCV,LassoCV,RidgeCV
from sklearn.svm import SVR
from mlxtend.regressor import StackingCVRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from datetime import datetime
from sklearn.model_selection import KFold
from category_encoders.target_encoder import TargetEncoder

import warnings
warnings.filterwarnings('ignore')

# Load and View Data

In [2]:
X_train=pd.read_csv("/kaggle/input/spaceship-titanic/train.csv", index_col='PassengerId')
test=pd.read_csv("/kaggle/input/spaceship-titanic/test.csv", index_col='PassengerId')

#Training data
X_train.head(5)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
#Testing data
test.head(5)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


# Check Null Values in Data

First check Null values in target column. If there is any Null value then we will drop that row

In [4]:
null_values = X_train["Transported"].isnull().sum()
null_values

0

Good news, we don't have any row with Null value in target column.

Next lets calculate percentage of Null values in train and test data.

In [5]:
def null_percent(df):
    per=((df.isnull().sum()/len(df))*100).round(2)
    return per
print("Nan Values in Train data")
print(null_percent(X_train))
print("Nan Values in Test data")
print(null_percent(test))

Nan Values in Train data
HomePlanet      2.31
CryoSleep       2.50
Cabin           2.29
Destination     2.09
Age             2.06
VIP             2.34
RoomService     2.08
FoodCourt       2.11
ShoppingMall    2.39
Spa             2.11
VRDeck          2.16
Name            2.30
Transported     0.00
dtype: float64
Nan Values in Test data
HomePlanet      2.03
CryoSleep       2.17
Cabin           2.34
Destination     2.15
Age             2.13
VIP             2.17
RoomService     1.92
FoodCourt       2.48
ShoppingMall    2.29
Spa             2.36
VRDeck          1.87
Name            2.20
dtype: float64


# Data Preprocessing

Our target column Transported is boolean i.e True and False, lets convert it to 1 and 0 for using in our models.

In [6]:
X_train['Transported'] = X_train['Transported'].astype(int)

In [7]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8693 entries, 0001_01 to 9280_02
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8492 non-null   object 
 1   CryoSleep     8476 non-null   object 
 2   Cabin         8494 non-null   object 
 3   Destination   8511 non-null   object 
 4   Age           8514 non-null   float64
 5   VIP           8490 non-null   object 
 6   RoomService   8512 non-null   float64
 7   FoodCourt     8510 non-null   float64
 8   ShoppingMall  8485 non-null   float64
 9   Spa           8510 non-null   float64
 10  VRDeck        8505 non-null   float64
 11  Name          8493 non-null   object 
 12  Transported   8693 non-null   int64  
dtypes: float64(6), int64(1), object(6)
memory usage: 950.8+ KB


In [8]:
#Name doesn't contribute in prediction, lets drop it
X_train.drop(['Name'],axis=1,inplace=True)
test.drop(['Name'],axis=1,inplace=True)

In [9]:
#Separate categorical and numerical columns for filling empty columns
#For categorical columns we will use mode
#For numerical columns we will use median

categorical_col_train=[col for col in X_train.columns if X_train[col].dtype=='O']
numerical_col_train=[col for col in X_train.columns if X_train[col].dtype!='O']

categorical_col_test=[col for col in test.columns if test[col].dtype=='O']
numerical_col_test=[col for col in test.columns if test[col].dtype!='O']

for col in categorical_col_train:
    X_train[col].fillna(X_train[col].mode()[0],inplace=True)
for col in numerical_col_train:
    X_train[col].fillna(X_train[col].median(),inplace=True)

for col in categorical_col_test:
    test[col].fillna(test[col].mode()[0],inplace=True)
for col in numerical_col_test:
    test[col].fillna(test[col].median(),inplace=True)

#Now check the Null percentage again
print("Nan Values in Train data")
print(null_percent(X_train))
print("Nan Values in Test data")
print(null_percent(test))

Nan Values in Train data
HomePlanet      0.0
CryoSleep       0.0
Cabin           0.0
Destination     0.0
Age             0.0
VIP             0.0
RoomService     0.0
FoodCourt       0.0
ShoppingMall    0.0
Spa             0.0
VRDeck          0.0
Transported     0.0
dtype: float64
Nan Values in Test data
HomePlanet      0.0
CryoSleep       0.0
Cabin           0.0
Destination     0.0
Age             0.0
VIP             0.0
RoomService     0.0
FoodCourt       0.0
ShoppingMall    0.0
Spa             0.0
VRDeck          0.0
dtype: float64


# Data Encoding

In [10]:
encoder  = TargetEncoder()
for feature in categorical_col_test:
    X_train[feature] = encoder.fit_transform(X_train[feature], X_train['Transported'])
    test[feature] = encoder.transform(test[feature])





View Data after encoding

In [11]:
X_train.head(3)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0001_01,0.658846,False,0.438098,0.472199,39.0,False,0.0,0.0,0.0,0.0,0.0,0
0002_01,0.427649,False,0.568206,0.472199,24.0,False,109.0,9.0,25.0,549.0,44.0,1
0003_01,0.658846,False,0.432184,0.472199,58.0,True,43.0,3576.0,0.0,6715.0,49.0,0


In [12]:
test.head(3)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0013_01,0.427649,True,0.503624,0.472199,27.0,False,0.0,0.0,0.0,0.0,0.0
0018_01,0.427649,False,0.503624,0.472199,19.0,False,0.0,9.0,0.0,2823.0,0.0
0019_01,0.658846,True,0.503624,0.61,31.0,False,0.0,0.0,0.0,0.0,0.0


# Separate Target Column

In [13]:
#Separate target column from data
y=X_train.Transported
X_train.drop(['Transported'],axis=1,inplace=True)

# Pipelines and Models

In [14]:
# Set a random seed for reproducibility
SEED = 40

# Define the number of cross-validation folds
K = 10

# Create a K-fold cross-validation object
kf = KFold(n_splits=K, shuffle=True, random_state=SEED)

# Create pipelines for Ridge regression, Lasso regression, Elastic Net regression, Support Vector Regression, Gradient Boosting Regression, LightGBM Regression, and XGBoost Regression

# Ridge regression with cross-validation
ridge = make_pipeline(RobustScaler(), RidgeCV(alphas=np.arange(14.5, 15.6, 0.1), cv=kf))

# Lasso regression with cross-validation
lasso = make_pipeline(RobustScaler(), LassoCV(alphas=np.arange(0.0001, 0.0009, 0.0001), random_state=SEED, cv=kf))

# Elastic Net regression with cross-validation
elasticnet = make_pipeline(RobustScaler(), ElasticNetCV(alphas=np.arange(0.0001, 0.0008, 0.0001), l1_ratio=np.arange(0.8, 1, 0.025), cv=kf))

# Support Vector Regression
svr = make_pipeline(RobustScaler(), SVR(C=20, epsilon=0.008, gamma=0.0003))

# Gradient Boosting Regression
gbr = GradientBoostingRegressor(n_estimators=700, 
                                learning_rate=0.01, 
                                max_depth=4, 
                                max_features='sqrt', 
                                min_samples_leaf=15, 
                                min_samples_split=10, 
                                loss='huber', 
                                random_state=SEED)

# LightGBM Regression
lgbmr = LGBMRegressor(objective='regression', 
                      num_leaves=4, 
                      learning_rate=0.01, 
                      n_estimators=750, 
                      max_bin=200, 
                      bagging_fraction=0.75, 
                      bagging_freq=5, 
                      bagging_seed=SEED, 
                      feature_fraction=0.2, 
                      feature_fraction_seed=SEED, 
                      verbose=0)

# XGBoost Regression
xgbr = XGBRegressor(learning_rate=0.01, 
                    n_estimators=700, 
                    max_depth=3, 
                    gamma=0.001, 
                    subsample=0.7, 
                    colsample_bytree=0.7, 
                    objective='reg:squarederror', 
                    nthread=-1, 
                    seed=SEED, 
                    reg_alpha=0.0001)

# StackingCVRegressor
stack = StackingCVRegressor(regressors=(ridge, lasso, elasticnet, svr, gbr, lgbmr), meta_regressor=xgbr, use_features_in_secondary=True)

# Create Dictionaries

In [15]:
# Create a dictionary to store the models
models = {'RidgeCV': ridge,
          'LassoCV': lasso, 
          'ElasticNetCV': elasticnet,
          'SupportVectorRegressor': svr, 
          'GradientBoostingRegressor': gbr, 
          'LightGBMRegressor': lgbmr, 
          'XGBoostRegressor': xgbr,
          'StackingCVRegressor': stack}

# Initialize dictionaries for predictions and scores
predictions = {}
scores = {}


In [16]:
"""
  Here we train all the models in the 'models' dictionary.
  It prints start and end times for each model to track training duration.
"""
for name, model in models.items():
    start = datetime.now()
    print('[{}] Running {}'.format(start, name))
    model.fit(X_train, y)
    end = datetime.now()
    print('[{}] Finished Running {} in {:.2f}s'.format(end, name, (end - start).total_seconds()))

"""
  This function takes features (X) as input and performs a weighted blend prediction
  using the models in the 'models' dictionary. Each model's prediction is weighted
  according to the specified coefficients and then averaged.
"""
def blend_predict(X):
    return ((0.1 * elasticnet.predict(X)) + 
            (0.05 * lasso.predict(X)) +
            (0.1 * ridge.predict(X)) +
            (0.1 * svr.predict(X)) +
            (0.1 * gbr.predict(X)) +
            (0.15 * xgbr.predict(X)) +
            (0.1 * lgbmr.predict(X)) +
            (0.3 * stack.predict(X)))

[2024-09-10 06:36:00.336038] Running RidgeCV
[2024-09-10 06:36:01.244587] Finished Running RidgeCV in 0.91s
[2024-09-10 06:36:01.245114] Running LassoCV
[2024-09-10 06:36:01.351895] Finished Running LassoCV in 0.11s
[2024-09-10 06:36:01.351988] Running ElasticNetCV
[2024-09-10 06:36:01.841840] Finished Running ElasticNetCV in 0.49s
[2024-09-10 06:36:01.842404] Running SupportVectorRegressor
[2024-09-10 06:36:13.586855] Finished Running SupportVectorRegressor in 11.74s
[2024-09-10 06:36:13.586998] Running GradientBoostingRegressor
[2024-09-10 06:36:19.654944] Finished Running GradientBoostingRegressor in 6.07s
[2024-09-10 06:36:19.655064] Running LightGBMRegressor
[2024-09-10 06:36:20.005884] Finished Running LightGBMRegressor in 0.35s
[2024-09-10 06:36:20.005925] Running XGBoostRegressor
[2024-09-10 06:36:20.646290] Finished Running XGBoostRegressor in 0.64s
[2024-09-10 06:36:20.646656] Running StackingCVRegressor
[2024-09-10 06:37:52.301260] Finished Running StackingCVRegressor in 91.

# Predict and Save Results

In [17]:
# Make predictions on the test data using weighted blending
preds_test = blend_predict(test)

# Convert predictions to integers
preds_test = [int(round(x)) for x in preds_test]

# Convert integers values to boolean
preds_test_bool = [True if pred == 1 else False for pred in preds_test]

# Save test predictions to file
output = pd.DataFrame({'PassengerId': test.index,'Transported': preds_test_bool})
output.to_csv('submission.csv', index=False)

