# Spaceship Titanic

The dataset is from the Kaggle getting starting prediction competition [paceship titanic](https://www.kaggle.com/competitions/spaceship-titanic/overview).
> 
    PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. 
    
    People in a group are often family members, but not always.
    
    HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    
    CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    
    Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    
    Destination - The planet the passenger will be debarking to.
    
    Age - The age of the passenger.
    
    VIP - Whether the passenger has paid for special VIP service during the voyage.
    
    RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    
    Name - The first and last names of the passenger.
    
    Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

# Import main libraries

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

In [35]:
# Upload the dataframe
url="data/train.csv"
df = pd.read_csv(url)

# Create a copy of the dataframe
df_oginal = df.copy()

# display dataframe first 5 lines
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [38]:
#  Chek for dplicated rows

df.duplicated().sum()

0

In [39]:

# My function to get general info about the dataframe
def my_info(df: pd.DataFrame) -> pd.DataFrame:
    dict_info = {
        'n_unique': df.nunique(),
        'n_missing': df.isna().sum(),
        'non_null_count': df.count(),
        '%_of_null/total_lignes': (df.isna().sum() / df.shape[0]) * 100,
        'dtype': df.dtypes
    }
    return pd.DataFrame(dict_info)

display(my_info(df))

Unnamed: 0,n_unique,n_missing,non_null_count,%_of_null/total_lignes,dtype
PassengerId,8693,0,8693,0.0,object
HomePlanet,3,201,8492,2.312205,object
CryoSleep,2,217,8476,2.496261,object
Cabin,6560,199,8494,2.289198,object
Destination,3,182,8511,2.093639,object
Age,80,179,8514,2.059128,float64
VIP,2,203,8490,2.335212,object
RoomService,1273,181,8512,2.082135,float64
FoodCourt,1507,183,8510,2.105142,float64
ShoppingMall,1115,208,8485,2.39273,float64


## Statistic type of variables
* PassengerId: Categorical/nominal
* HomePlanet: Categorical/nominal
* CryoSleep: categorical/boolean
* Cabin: categorical/nominal
* Destination: categorical/nominal
* Age: numerical/discrete
* VIP: categrical/boolean
* RoomService: numerical/discrete
* FoodCourt: numerical/discrete
* ShoppingMall: numerical/discrete
* Spa: numerical/discrete
* VRDeck: numerical/discrete
* Name: categorical/nominal
* Transported: categorical/boolean

In [40]:
# Drop the name and ID columns, because it should be irrelevant
try:
    df.drop(columns=['Name', 'PassengerId'], inplace=True)
except:
    print("Nothing to delete")


# Separate the cabin feature into the three features deck, num, side
df[['deck', 'num_cabin', 'side']] = df.Cabin.str.split('/', expand=True)

In [41]:
df.columns

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'deck', 'num_cabin', 'side'],
      dtype='object')

In [42]:
display(my_info(df))

Unnamed: 0,n_unique,n_missing,non_null_count,%_of_null/total_lignes,dtype
HomePlanet,3,201,8492,2.312205,object
CryoSleep,2,217,8476,2.496261,object
Cabin,6560,199,8494,2.289198,object
Destination,3,182,8511,2.093639,object
Age,80,179,8514,2.059128,float64
VIP,2,203,8490,2.335212,object
RoomService,1273,181,8512,2.082135,float64
FoodCourt,1507,183,8510,2.105142,float64
ShoppingMall,1115,208,8485,2.39273,float64
Spa,1327,183,8510,2.105142,float64


## Divide target from numerical and caegorical features

In [43]:
# Select the target
target = 'Transported'

# Categorical features without target
cat_cols = df.select_dtypes(include=['object', 'bool']).columns
cat_cols = cat_cols.drop(['Transported'])

# Numerical features
num_cols = df.select_dtypes(include='number').columns

print('Target: ', target)
print('Categorical features: ', cat_cols)
print('Numerical features: ', num_cols)


Target:  Transported
Categorical features:  Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'deck',
       'num_cabin', 'side'],
      dtype='object')
Numerical features:  Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')


# Univariate Statistic to check the distributions of the features as random variables



In [101]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Function to print the histograms
def print_hitograms(columns):

    # Set the number of rows and columns
    num_rows = int(np.ceil(len(columns)/2))
    num_cols = 2

    # Define the subplots
    fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=columns)

    # Create the histograms for each feature
    for i, col in enumerate(columns):
        fig.add_trace(go.Histogram(x=df[col]), row=int(np.floor(i/num_cols))+1, col=i%num_cols+1)

    # Modify te loyout of the fgure
    fig.update_layout(
        title = 'Histograms of categorical variables',
        title_font_size = 34,
        height=500 * num_rows
    )
    fig.update_traces(showlegend=False)
    fig.show()

## Categorcal variables



### Target

In [102]:
print_hitograms([target])

The number of people transported to another dimension is basically half of the people present in the spaceship

### Categorical features

In [104]:
print_hitograms(cat_cols)

### numeical features


In [90]:
num_cols

Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')

In [105]:
print_hitograms(num_cols)

In [120]:
# Let's check the amenities
amenities = num_cols.drop('Age')



print_hitograms(amenities)
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['Spa']))
fig.update_layout(
    title = 'Spa',
    xaxis=dict(
        range=[0, 1000]  # Set the range (min, max) for the x-axis
    ),
    
)
