<h3>Spaceship Titani - Kaggle Competition</h3>


Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

<h3>File and Data Field Descriptions</h3>

- train.csv - Personal records for about two-thirds
  (~8700) of the passengers, to be used as training data.
<ul>
<li>PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.</li>
<li>HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.</li>
<li>Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.</li>
<li>Destination - The planet the passenger will be debarking to.</li>
<li>Age - The age of the passenger.
VIP - Whether the passenger has paid for special VIP service during the voyage.</li>
<li>RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.</li>
<li>Name - The first and last names of the passenger.</li>
<li>Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.</li>
</ul>
- test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
- sample_submission.csv - A submission file in the correct format.
<ul>
<li>PassengerId - Id for each passenger in the test set.</li>
<li>Transported - The target. For each passenger, predict either True or False.</li>
</ul>

<h3>Notes</h3>

- Passengers in a group are often family but not always, >=_02 in the passenger ID it means that it is a group travelling.
- Cabin is split up into deck/num/side
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - monetary values in terms of amount of dosh spent

MACHINE LEARNING TASK - Predict Transported (Binary Classification)

<h3>To Do</h3>

- Fill in missing values in a bunch of the columns
- Balance out the dataset if imbalanced
- Drop high cardinality columns

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "joel-filipe-QwoNAhbmLLo-unsplash.jpg", width=400, height=400)

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

<h4>Setup the dependancies</h4>

In [2]:
#!pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!mkdir ~/.kaggle


mkdir: cannot create directory ‘/home/patrick/.kaggle’: File exists


In [4]:
!cp kaggle.json ~/.kaggle/kaggle.json

In [5]:
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
!kaggle competitions download -c spaceship-titanic

spaceship-titanic.zip: Skipping, found more recently modified local copy (use --force to force download)


In [7]:
#!unzip spaceship-titanic.zip

<h4>Load in and do EDA</h4>

In [8]:
import pandas as pd

In [9]:
#conda install -c conda-forge sweetviz


In [10]:
#import sweetviz as sv
#df = pd.read_csv("train.csv")
#report = sv.analyze(df)
#report.show_html("sweetviz_report.html")  # Opens in the browser


In [17]:
df= pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [18]:
# check columns
df.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')

In [19]:
df["PassengerId"]

0       0001_01
1       0002_01
2       0003_01
3       0003_02
4       0004_01
         ...   
8688    9276_01
8689    9278_01
8690    9279_01
8691    9280_01
8692    9280_02
Name: PassengerId, Length: 8693, dtype: object

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [21]:
# summary statistics
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [22]:
# Check if we have wholly null rows if so drop?
df[df.isna().any(axis=1)].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
10,0008_02,Europa,True,B/1/P,TRAPPIST-1e,34.0,False,0.0,0.0,,0.0,0.0,Altardr Flatic,True
15,0012_01,Earth,False,,TRAPPIST-1e,31.0,False,32.0,0.0,876.0,0.0,0.0,Justie Pooles,False
16,0014_01,Mars,False,F/3/P,55 Cancri e,27.0,False,1286.0,122.0,,0.0,0.0,Flats Eccle,False
23,0020_03,Earth,True,E/0/S,55 Cancri e,29.0,False,0.0,0.0,,0.0,0.0,Mollen Mcfaddennon,False


In [24]:

def split_cabin(x):
  if len(str(x).split('/')) < 3:
    return ['Missing', 'Missing', "Missing"]
  else:   
    return str(x).split('/')

In [25]:
# Create a preprocessing function to transform dataset
def preprocess_data(df):
    """
    Preprocess the dataset by handling missing values, extracting features, and transforming columns.
    
    Parameters:
        df (pd.DataFrame): The input dataset.
        
    Returns:
        pd.DataFrame: The preprocessed dataset.
    """
    # Fill missing values with appropriate replacements
    fill_values = {
        'HomePlanet': 'Missing',
        'CryoSleep': 'Missing',
        'Destination': 'Missing',
        'VIP': 'Missing',
        'RoomService': 0,
        'FoodCourt': 0,
        'ShoppingMall': 0,
        'Spa': 0,
        'VRDeck': 0
    }
    df.fillna(value=fill_values, inplace=True)

    # Fill missing age values with mean
    df['Age'].fillna(df['Age'].mean(), inplace=True)

    # Extract Deck and Side from Cabin
    df[['Deck', '_', 'Side']] = df['Cabin'].str.split('/', expand=True)
    df.drop(columns=['Cabin', '_'], inplace=True)  # Drop original Cabin column and temporary column

    # Drop high-cardinality column 'Name'
    df.drop(columns=['Name'], inplace=True)

    return df


In [26]:
abt = df.copy()

In [27]:
abt.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [28]:
preprocess_data(abt)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Side
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,P
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,S
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,S
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,S
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,A,P
8689,9278_01,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False,G,S
8690,9279_01,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,G,S
8691,9280_01,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,E,S


In [33]:
abt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   object 
 2   CryoSleep     8693 non-null   object 
 3   Destination   8693 non-null   object 
 4   Age           8693 non-null   float64
 5   VIP           8693 non-null   object 
 6   RoomService   8693 non-null   float64
 7   FoodCourt     8693 non-null   float64
 8   ShoppingMall  8693 non-null   float64
 9   Spa           8693 non-null   float64
 10  VRDeck        8693 non-null   float64
 11  Transported   8693 non-null   bool   
 12  Deck          8494 non-null   object 
 13  Side          8494 non-null   object 
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


<h3>Modelling</h3>

- Feature and Target values - X, y
- One hot encode any categorical features
- Train, holdout split
- Train on a bunch of algorithms

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [38]:
# Create feature columns
# Drop identifier column
X = abt.drop(['Transported', 'PassengerId'], axis=1)
# One hot encode
X = pd.get_dummies(X)
# Create target columns
y = abt['Transported']
# Create test train splts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)


<h3>Setup ML Pipelines</h3>

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


In [41]:

pipelines = {
    'rf': make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1234)),
    'gb': make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1234))
}

In [42]:
RandomForestClassifier().get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [43]:

grid = {
    'rf': {
        'randomforestclassifier__n_estimators':[100,200,300]
    },
    'gb':{
        'gradientboostingclassifier__n_estimators':[100,200,300]
    } 
}

In [44]:
pipelines.items()

dict_items([('rf', Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(random_state=1234))])), ('gb', Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gradientboostingclassifier',
                 GradientBoostingClassifier(random_state=1234))]))])

In [47]:

# Dictionary to store trained models
fit_models = {}

# Loop through all algorithms and their corresponding pipelines
for algo, pipeline in pipelines.items():
    print(f'Training the {algo} model...')
    
    try:
        # Create Grid Search with cross-validation
        grid_search = GridSearchCV(
            estimator=pipeline, 
            param_grid=grid[algo], 
            n_jobs=-1, 
            cv=10, 
            verbose=1  # Added verbosity for tracking progress
        )

        # Train the model
        grid_search.fit(X_train, y_train)

        # Store the trained model in the dictionary
        fit_models[algo] = grid_search

        print(f'{algo} model training completed.\n')

    except Exception as e:
        print(f'Error training {algo}: {e}\n')



Training the rf model...
Fitting 10 folds for each of 3 candidates, totalling 30 fits
rf model training completed.

Training the gb model...
Fitting 10 folds for each of 3 candidates, totalling 30 fits
gb model training completed.



<h3>Evaluate Performance on Test Partition</h3>

- Grab the testing data from the test.csv and evalute on that

In [48]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Evaluate the performance of each trained model
print("\nModel Evaluation Results:\n")
for algo, model in fit_models.items():
    try:
        # Generate predictions
        y_pred = model.predict(X_test)

        # Calculate evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted')  # Supports multi-class
        recall = recall_score(y_test, y_pred, average='weighted')  # Supports multi-class

        # Print formatted results
        print(f"🔹 {algo} Model:")
        print(f"   - Accuracy:  {accuracy:.4f}")
        print(f"   - Precision: {precision:.4f}")
        print(f"   - Recall:    {recall:.4f}\n")

    except Exception as e:
        print(f"⚠️ Error evaluating {algo}: {e}\n")



Model Evaluation Results:

🔹 rf Model:
   - Accuracy:  0.7918
   - Precision: 0.7925
   - Recall:    0.7918

🔹 gb Model:
   - Accuracy:  0.8067
   - Precision: 0.8083
   - Recall:    0.8067



<h3>Save Best Model</h3>

In [49]:
import pickle


In [50]:

with open('gradientboosted.pkl', 'wb') as f: 
  pickle.dump(fit_models['gb'], f)
     

In [52]:

with open('gradientboosted.pkl', 'rb') as f: 
  reloaded_model = pickle.load(f)
     

In [53]:
reloaded_model

<h3>Predict on Test Data</h3>


In [54]:
test_df = pd.read_csv("test.csv")
abt_test = test_df.copy()
preprocess_data(abt_test)
# One hot encode categorical variables
abt_test = pd.get_dummies(abt_test.drop('PassengerId', axis=1))
     

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


In [55]:
len(abt_test.columns)

30

In [56]:
len(X.columns)
     

30

In [57]:
yhat_test = fit_models['gb'].predict(abt_test)
     

In [58]:

submission = pd.DataFrame([test_df['PassengerId'], yhat_test]).T
submission.columns = ['PassengerID', 'Transported']
     

In [59]:

submission.head()

Unnamed: 0,PassengerID,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


<h3>Submit to Kaggle</h3>

In [60]:

submission.to_csv('kaggle_submission.csv', index=False)
     

In [61]:

!kaggle competitions submit -c spaceship-titanic -m "initial gb model" -f "kaggle_submission.csv"
     

100%|██████████████████████████████████████| 56.3k/56.3k [00:01<00:00, 55.7kB/s]
Successfully submitted to Spaceship Titanic