# Data Exploring

https://www.kaggle.com/competitions/spaceship-titanic/data

Data description

train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.  
- `PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.  
- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.  
- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.  
- `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.  
- `Destination` - The planet the passenger will be debarking to.  
- `Age` - The age of the passenger.  
- `VIP` - Whether the passenger has paid for special VIP service during the voyage.  
- `RoomService`, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.  
- `Name` - The first and last names of the passenger.  
- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.  

In [2]:
import pandas as pd
import ydata_profiling

Reading

In [6]:
path = './spaceshit-titanic/train.csv'
path_test = './spaceshit-titanic/test.csv'

df = pd.read_csv(path)
df_test = pd.read_csv(path_test)

df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [14]:
# Calculate the correlation matrix
correlation_matrix = df.corr(numeric_only=True)

# Display the correlation matrix
correlation_matrix

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
Age,1.0,0.068723,0.130421,0.033133,0.12397,0.101007,-0.075026
RoomService,0.068723,1.0,-0.015889,0.05448,0.01008,-0.019581,-0.244611
FoodCourt,0.130421,-0.015889,1.0,-0.014228,0.221891,0.227995,0.046566
ShoppingMall,0.033133,0.05448,-0.014228,1.0,0.013879,-0.007322,0.010141
Spa,0.12397,0.01008,0.221891,0.013879,1.0,0.153821,-0.221131
VRDeck,0.101007,-0.019581,0.227995,-0.007322,0.153821,1.0,-0.207075
Transported,-0.075026,-0.244611,0.046566,0.010141,-0.221131,-0.207075,1.0


Profiling

In [None]:
profiler = ydata_profiling.ProfileReport(df, title='Spaceshit Titanic Profiling Report', explorative=True)

profiler.to_file("profile.html")

profiler_test = ydata_profiling.ProfileReport(df_test, title='Spaceshit Titanic Profiling Report Test', explorative=True)

profiler_test.to_file("profile_test.html")

Notes

There are missing values distributed throughout the dataset, even in the the test dataset.

This leads that is necessary to create a data pipeline to train the model.

# Data Exploring

Useful information

- `Cabin` has the pattern deck/num/side.
- PassengerId has no missing values.  
  The pattern is *gggg_pp* where *gggg* indicates a group the passenger is travelling with and *pp* is their number within the group

## Profiling

In [45]:
pd.DataFrame(
    {
        'Number of missing values':df.isnull().sum(), 
        'Type': df.dtypes,
        'Distinct values': df.nunique()
    }
).sort_values(by='Type')


Unnamed: 0,Number of missing values,Type,Distinct values
Transported,0,bool,2
Age,179,float64,80
RoomService,181,float64,1273
FoodCourt,183,float64,1507
ShoppingMall,208,float64,1115
Spa,183,float64,1327
VRDeck,188,float64,1306
PassengerId,0,object,8693
HomePlanet,201,object,3
CryoSleep,217,object,2


Some fields are categorical, and others are numerical.

On numerical fields, they can be used for training the model, after a normalization process and filling the missing values.