# Spaceship Titanic

Our goal is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly.


## File and Data Field Descriptions

* **train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    * PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    * HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    * CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    * Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    * Destination - The planet the passenger will be debarking to.
    * Age - The age of the passenger.
    * VIP - Whether the passenger has paid for special VIP service during the voyage.
    * RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    * Name - The first and last names of the passenger.
    * Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
* **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
* **sample_submission.csv** - A submission file in the correct format.
    * PassengerId - Id for each passenger in the test set.
    * Transported - The target. For each passenger, predict either True or False.

## Libraries imports

In [31]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## 1) Dataset loading and preprocess

First of all, we load the training and test sets.

In [32]:
# Load a dataset into a Pandas Dataframe
# Try to load the dataset from Kaggle, if not found, load from local directory
try:
    train_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
    test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
except FileNotFoundError:
    train_df = pd.read_csv('kaggle/input/spaceship-titanic/train.csv')
    test_df = pd.read_csv('kaggle/input/spaceship-titanic/test.csv')

print("Full train dataset shape is {}".format(train_df.shape))
print("Full test dataset shape is {}".format(test_df.shape))

Full train dataset shape is (8693, 14)
Full test dataset shape is (4277, 13)


In [33]:
# I split the datasets into features (X) and tag (Y)
train_x = train_df.drop(columns=['Transported'])
train_y = train_df['Transported'].astype(int)  # Convert boolean to int (0 or 1)

test_x = test_df

To evaluate the different models used, I split the training set into train and validation, giving a 10% of the samples to the validation set. I set that percentage in order to have a dataset big enough to evaluate the models.

In [34]:
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size=0.1, random_state=0)
train_x.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
4132,4408_01,Mars,True,F/906/P,TRAPPIST-1e,75.0,False,0.0,0.0,0.0,0.0,0.0,Pich Knike
7217,7710_01,Europa,False,B/253/P,TRAPPIST-1e,27.0,False,118.0,1769.0,4127.0,118.0,619.0,Chabih Eguing
7216,7709_02,Earth,True,G/1238/P,PSO J318.5-22,24.0,False,0.0,,0.0,0.0,0.0,Lerome Sweett
7968,8512_01,Earth,False,F/1637/S,55 Cancri e,48.0,False,0.0,717.0,0.0,0.0,10.0,Verly Flyncharlan
50,0052_01,Earth,False,G/6/S,TRAPPIST-1e,,False,4.0,0.0,2.0,4683.0,0.0,Elaney Hubbarton


After separate a validation set from our training set, the next step is to preprocess the dataset, to do that I will check each individual feature.

### PassengerId

This variable represents a unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

At the begining it seems to be a variant without information to solve the problem, something like a name, however as it said, we can extract the number of members that each group has and that can be something relevant.
However to know the number of members a group has, we have to use the whole dataset, trining + validation + test sets. To do this would be a problem in most cases because we are not sure that our dataset contain all the existent samples, however for this task, the description tells us that we have about two-thirds (~8700) of the passengers as the training set and THE REMAINING one-third (~4300) of the passengers as test, so we got at least partial information about all the passengers.

In [35]:
# Check for missing values in PassengerId
print("Missing values in PassengerId:")
print(f"train_x: {train_x['PassengerId'].isnull().sum()}")
print(f"test_x: {test_x['PassengerId'].isnull().sum()}")
print(f"val_x: {val_x['PassengerId'].isnull().sum()}")

# Extract group identifier (gggg) from PassengerId
train_x['Group'] = train_x['PassengerId'].str.split('_').str[0]
test_x['Group'] = test_x['PassengerId'].str.split('_').str[0]
val_x['Group'] = val_x['PassengerId'].str.split('_').str[0]

# Combine all dataframes to calculate group sizes across all datasets
combined_df = pd.concat([train_x, val_x, test_x])

# Calculate the total number of members in each group
group_sizes = combined_df['Group'].value_counts()

# Add the numMembers column to each dataframe
train_x['numMembers'] = train_x['Group'].map(group_sizes)
val_x['numMembers'] = val_x['Group'].map(group_sizes)
test_x['numMembers'] = test_x['Group'].map(group_sizes)

# Display the updated dataframes
print("Updated train_x:")
print(train_x[['PassengerId', 'Group', 'numMembers']].head())

print("\nUpdated val_x:")
print(val_x[['PassengerId', 'Group', 'numMembers']].head())

print("\nUpdated test_x:")
print(test_x[['PassengerId', 'Group', 'numMembers']].head())


Missing values in PassengerId:
train_x: 0
test_x: 0
val_x: 0
Updated train_x:
     PassengerId Group  numMembers
4132     4408_01  4408           1
7217     7710_01  7710           1
7216     7709_02  7709           2
7968     8512_01  8512           1
50       0052_01  0052           1

Updated val_x:
     PassengerId Group  numMembers
3601     3868_05  3868           7
6057     6405_02  6405           4
2797     3021_01  3021           2
7110     7578_01  7578           1
8579     9158_01  9158           1

Updated test_x:
  PassengerId Group  numMembers
0     0013_01  0013           1
1     0018_01  0018           1
2     0019_01  0019           1
3     0021_01  0021           1
4     0023_01  0023           1


Now we can remove the columns PassengerId and Group, that don't give us any more information, however I will wait and remove all the useless columns after analyze all the features.

### HomePlanet 
The planet the passenger departed from, typically their planet of permanent residence.

This is a categorical feature, so we will need to encode it in some way in order to give the information to the models.

In [36]:
# First I check how many different values we have in this variant (HomePlanet)
print("HomePlanet unique values: ", train_x['HomePlanet'].unique())

# And also if there are any missing values and how many
print("Missing values in HomePlanet: ", train_x['HomePlanet'].isnull().sum(), f"({train_x['HomePlanet'].isnull().sum()/len(train_x)*100:.2f}%)")

HomePlanet unique values:  ['Mars' 'Europa' 'Earth' nan]
Missing values in HomePlanet:  174 (2.22%)


First of all, let's solve the issue with the missing values, we can assume that members of the same group depart from the same planet, so the first approach will be to assign the existing value of one member of the group to other members that have a null value in the HomePlanet variable.

In [37]:
combined_df = pd.concat([train_x, val_x, test_x])

# Fill missing HomePlanet values based on the group
combined_df['HomePlanet'] = combined_df.groupby('Group')['HomePlanet'].transform(lambda x: x.ffill().bfill())

# Update train_x, val_x, and test_x with the filled values from combined_df based on PassengerId
train_x['HomePlanet'] = train_x['PassengerId'].map(combined_df.set_index('PassengerId')['HomePlanet'])
val_x['HomePlanet'] = val_x['PassengerId'].map(combined_df.set_index('PassengerId')['HomePlanet'])
test_x['HomePlanet'] = test_x['PassengerId'].map(combined_df.set_index('PassengerId')['HomePlanet'])

# Verify if there are still missing values in HomePlanet
print("Missing values in HomePlanet after filling:")
print(f"train_x: {train_x['HomePlanet'].isnull().sum()}")
print(f"val_x: {val_x['HomePlanet'].isnull().sum()}")
print(f"test_x: {test_x['HomePlanet'].isnull().sum()}")

  combined_df['HomePlanet'] = combined_df.groupby('Group')['HomePlanet'].transform(lambda x: x.ffill().bfill())


Missing values in HomePlanet after filling:
train_x: 96
val_x: 15
test_x: 46


In [38]:
# Check an example to see that everything is correct, in this case the group 0064 come from Mars
combined_df[combined_df['Name'] == 'Colatz Keen']

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Group,numMembers
59,0064_02,Mars,True,E/3/S,TRAPPIST-1e,33.0,False,0.0,0.0,,0.0,0.0,Colatz Keen,64,2


We filled ~50% of the empty values based on the training set, however we still have a little bit more than a 1% of empty entries, so now I will check the class distribution, maybe there is one popular class and I can simply give that class value to the empty entries.

In [39]:
# Class distribution for the HomePlanet column
homeplanet_distribution = train_x['HomePlanet'].value_counts(dropna=False)
homeplanet_percentages = (train_x['HomePlanet'].value_counts(normalize=True, dropna=False) * 100).round(2)

print("HomePlanet distribution (counts):")
print(homeplanet_distribution)
print("\nHomePlanet distribution (percentages):")
print(homeplanet_percentages)

HomePlanet distribution (counts):
HomePlanet
Earth     4151
Europa    1962
Mars      1614
NaN         96
Name: count, dtype: int64

HomePlanet distribution (percentages):
HomePlanet
Earth     53.06
Europa    25.08
Mars      20.63
NaN        1.23
Name: proportion, dtype: float64


Seeing the distribution, if I assign Earth to the empty values, I will aim ~53% of the cases, taking into account that we have 1.23% of empty values, I will have ~0.6% of incorrect values with this approach, this is a very low percentage, so I go with this plan.

In [40]:
# Fill NaN values in the HomePlanet column with 'Earth'
train_x['HomePlanet'] = train_x['HomePlanet'].fillna('Earth')
val_x['HomePlanet'] = val_x['HomePlanet'].fillna('Earth')
test_x['HomePlanet'] = test_x['HomePlanet'].fillna('Earth')

# Verify if there are still missing values in HomePlanet
print("Missing values in HomePlanet after filling with 'Earth':")
print(f"train_x: {train_x['HomePlanet'].isnull().sum()}")
print(f"val_x: {val_x['HomePlanet'].isnull().sum()}")
print(f"test_x: {test_x['HomePlanet'].isnull().sum()}")

Missing values in HomePlanet after filling with 'Earth':
train_x: 0
val_x: 0
test_x: 0


Finally, in order to train the model with this feature, we need to encode it, to do that I will use one-hot encoding that will add 1 more variable to the problem.

The idea is to pass from HomePlanet to two boolean variables, isHomeEarth and isHomeEuropa, if both variants are False we still have the info that the HomePlanet is Mars without having to explicitly saving into another variant. With this approach I am assuming that the only possible HomePlanets are Earth, Mars and Europa, which is not crazy to say seeing the class distribution in the training set. 

In [42]:
# One-hot encoding for HomePlanet
train_x['isHomeEarth'] = (train_x['HomePlanet'] == 'Earth').astype(int)
train_x['isHomeEuropa'] = (train_x['HomePlanet'] == 'Europa').astype(int)

val_x['isHomeEarth'] = (val_x['HomePlanet'] == 'Earth').astype(int)
val_x['isHomeEuropa'] = (val_x['HomePlanet'] == 'Europa').astype(int)

test_x['isHomeEarth'] = (test_x['HomePlanet'] == 'Earth').astype(int)
test_x['isHomeEuropa'] = (test_x['HomePlanet'] == 'Europa').astype(int)

# Verify the new columns
train_x[['HomePlanet', 'isHomeEarth', 'isHomeEuropa']].head()

Unnamed: 0,HomePlanet,isHomeEarth,isHomeEuropa
4132,Mars,0,0
7217,Europa,0,1
7216,Earth,1,0
7968,Earth,1,0
50,Earth,1,0


### CryoSleep 
Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

This is a boolean variable, so we only have to cast it into integer, however first I need to check if there are any empty value.