# Spaceship Titanic

Our goal is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly.


## File and Data Field Descriptions

* **train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    * PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    * HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    * CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    * Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    * Destination - The planet the passenger will be debarking to.
    * Age - The age of the passenger.
    * VIP - Whether the passenger has paid for special VIP service during the voyage.
    * RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    * Name - The first and last names of the passenger.
    * Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
* **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
* **sample_submission.csv** - A submission file in the correct format.
    * PassengerId - Id for each passenger in the test set.
    * Transported - The target. For each passenger, predict either True or False.

## Libraries imports

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## 1) Dataset loading and visualization

First of all, we load the training and test sets.

In [5]:
# Load a dataset into a Pandas Dataframe
# Try to load the dataset from Kaggle, if not found, load from local directory
try:
    train_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
    test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
except FileNotFoundError:
    train_df = pd.read_csv('kaggle/input/spaceship-titanic/train.csv')
    test_df = pd.read_csv('kaggle/input/spaceship-titanic/test.csv')

print("Full train dataset shape is {}".format(train_df.shape))
print("Full test dataset shape is {}".format(test_df.shape))

Full train dataset shape is (8693, 14)
Full test dataset shape is (4277, 13)


In [12]:
# I split the datasets into features (X) and tag (Y)
train_x = train_df.drop(columns=['Transported'])
train_y = train_df['Transported'].astype(int)  # Convert boolean to int (0 or 1)

test_x = test_df

To evaluate the different models used, I split the training set into train and validation, giving a 10% of the samples to the validation set. I set that percentage in order to have a dataset big enough to evaluate the models.

In [13]:
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size=0.1, random_state=0)
train_x.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
4132,4408_01,Mars,True,F/906/P,TRAPPIST-1e,75.0,False,0.0,0.0,0.0,0.0,0.0,Pich Knike
7217,7710_01,Europa,False,B/253/P,TRAPPIST-1e,27.0,False,118.0,1769.0,4127.0,118.0,619.0,Chabih Eguing
7216,7709_02,Earth,True,G/1238/P,PSO J318.5-22,24.0,False,0.0,,0.0,0.0,0.0,Lerome Sweett
7968,8512_01,Earth,False,F/1637/S,55 Cancri e,48.0,False,0.0,717.0,0.0,0.0,10.0,Verly Flyncharlan
50,0052_01,Earth,False,G/6/S,TRAPPIST-1e,,False,4.0,0.0,2.0,4683.0,0.0,Elaney Hubbarton
