Data Exploring

In [2]:
import pandas as pd
import ydata_profiling

Reading

In [6]:
path = './spaceshit-titanic/train.csv'
path_test = './spaceshit-titanic/test.csv'

df = pd.read_csv(path)
df_test = pd.read_csv(path_test)

df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [14]:
# Calculate the correlation matrix
correlation_matrix = df.corr(numeric_only=True)

# Display the correlation matrix
correlation_matrix

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
Age,1.0,0.068723,0.130421,0.033133,0.12397,0.101007,-0.075026
RoomService,0.068723,1.0,-0.015889,0.05448,0.01008,-0.019581,-0.244611
FoodCourt,0.130421,-0.015889,1.0,-0.014228,0.221891,0.227995,0.046566
ShoppingMall,0.033133,0.05448,-0.014228,1.0,0.013879,-0.007322,0.010141
Spa,0.12397,0.01008,0.221891,0.013879,1.0,0.153821,-0.221131
VRDeck,0.101007,-0.019581,0.227995,-0.007322,0.153821,1.0,-0.207075
Transported,-0.075026,-0.244611,0.046566,0.010141,-0.221131,-0.207075,1.0


Profiling

In [3]:
profiler = ydata_profiling.ProfileReport(df, title='Spaceshit Titanic Profiling Report', explorative=True)

profiler.to_file("profile.html")

profiler_test = ydata_profiling.ProfileReport(df_test, title='Spaceshit Titanic Profiling Report Test', explorative=True)

profiler_test.to_file("profile_test.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Notes

There are missing values distributed throughout the dataset, even in the the test dataset.

This leads that is necessary to create a data pipeline to train the model.

# Data Exploring

Useful information

- `Cabin` has the pattern deck/num/side.
- PassengerId has no missing values.  
  The pattern is *gggg_pp* where *gggg* indicates a group the passenger is travelling with and *pp* is their number within the group

## Profiling

In [45]:
pd.DataFrame(
    {
        'Number of missing values':df.isnull().sum(), 
        'Type': df.dtypes,
        'Distinct values': df.nunique()
    }
).sort_values(by='Type')


Unnamed: 0,Number of missing values,Type,Distinct values
Transported,0,bool,2
Age,179,float64,80
RoomService,181,float64,1273
FoodCourt,183,float64,1507
ShoppingMall,208,float64,1115
Spa,183,float64,1327
VRDeck,188,float64,1306
PassengerId,0,object,8693
HomePlanet,201,object,3
CryoSleep,217,object,2


Some fields are categorical, and others are numerical.

On numerical fields, they can be used for training the model, after a normalization process and filling the missing values.

## Dropping features

The **Name** and **Cabin** have mostly of unique values (high cardinality), such as the **PassengerId**.

But, **Cabin** and **PassengerId** have a pattern that can be used to extract new information by splitting the values.

**Name**, on the other hand, can not be used directly, so, it will be desconsidered in training the model.

## Feature Engineering

**PassengerId** has the pattern *gggg_pp* where *gggg* indicates a group the passenger is travelling with and *pp* is their number within the group.

The group can be used to fill the missing values in the Age field, while the number can be discarded in an approach to simplify the model.

**Cabin** has the pattern *deck/num/side*. Again, the *num* can be discarded, and the *deck* and *side* ramains to be used.

In [30]:
%%html
<style>
/*overwrite hard coded write background by vscode for ipywidges */
.cell-output-ipywidget-background {
   background-color: transparent !important;
}

/*set widget foreground text and color of interactive widget to vs dark theme color */
:root {
    --jp-widgets-color: var(--vscode-editor-foreground);
    --jp-widgets-font-size: var(--vscode-editor-font-size);
}
</style>


## Missing values strategy

In [None]:
import ipywidgets as widgets
from IPython.display import display

def show_df(slicer):
    return display(df.iloc[slicer:slicer+10])

slicer = widgets.IntSlider(min=0, max=len(df), step=5, value=0)
widgets.interactive(show_df, slicer=slicer)

In [37]:
import ipywidgets as widgets
from IPython.display import display

# Data sample with missing values
def show_df(slicer):
    # Missing values
    missing_values = df.isnull().any(axis=1)
    df_ = df[missing_values].iloc[slicer:slicer+10]
    return display(df_)

slider = widgets.IntSlider(value=100, min=0, max=df.shape[0], step=1, description='Slider:')

widgets.interactive(show_df, slicer=slider)

interactive(children=(IntSlider(value=100, description='Slider:', max=8693), Output()), _dom_classes=('widget-…

In [40]:
df.CryoSleep.value_counts()

CryoSleep
False    5439
True     3037
Name: count, dtype: int64

### Numerical fields

- Numerical fields can receive the following strategies:
  - Fill with the median value
  - Fill with the mean value
  - Fill with the mode value
  - Fill with a constant value

The transformer hyperparameter can be tuned to find the best strategy.

### Categorical fields
- Name is a high cardinality field, so it will be dropped.

- **VIP** and **CryoSleep**: both columns are True/False, so for the missing values it could be try two approaches:
  - Replaced by most frequent value.
  - Replaced by the False value (thereby, the absence of information is the negative answer).  
    - Note: Strategy to be used in a hyperparameter tuning.

- **HomePlanet** and **Destination**: a classifier could be used to fill the missing values based on the other fields.
- **Cabin**: the value will be sliced to extract the *deck* and the *side* of the ship. After this, the missing values will be replaced by a classifier, as the HomePlanet.

### Planning the pipeline

- Dropping
1. **Name**: desconsider the field, letting drop it.

- Feature Engineering
1. **PassengerId**: split the field in two new fields: *group* and *number* and keep only the *group*. There will be no missing values in this field.
   `Create a function to a transformer in FunctionTransformer`
2. **Cabin**: split the field in two new fields: *deck* and *side*.  
   There will be missing values in this field, so it will need to be filled by a classifier.  
   `Create a function to a transformer in FunctionTransformer`
3. Encoder categorical fields: *deck*, *side*, *HomePlanet*, *Destination*, *group*, *VIP*, *CryoSleep*.  
   `Create a function to a transformer with OrdinalEncoder`

- Missing values
1. **Numeric fields**: fill with the median value. `Set an hyperparameter to choose the best strategy`
2. **VIP** and **CryoSleep**: fill with the most frequent value. `Set an hyperparameter to choose the best strategy`
3. **HomePlanet**: fill with a classifier. 
4. **Destination**: fill with a classifier.
5. **deck** and **side**: fill with a classifier.

- Feature Engineering
1. Hot encode all the categorical fields.
