# Data Processing

Feature engineering and missing values filling strategies for the project

In [5]:
import pandas as pd

In [9]:
path = 'data/spaceshit-titanic/train.csv'

df = pd.read_csv(path)

df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [11]:
pd.DataFrame(
    {
        'Number of missing values':df.isnull().sum(), 
        'Type': df.dtypes,
        'Distinct values': df.nunique()
    }
).sort_values(by='Type')

Unnamed: 0,Number of missing values,Type,Distinct values
Transported,0,bool,2
Age,179,float64,80
RoomService,181,float64,1273
FoodCourt,183,float64,1507
ShoppingMall,208,float64,1115
Spa,183,float64,1327
VRDeck,188,float64,1306
PassengerId,0,object,8693
HomePlanet,201,object,3
CryoSleep,217,object,2


## Dropping features

The **Name** and **Cabin** columns have mostly of unique values (high cardinality), such as the **PassengerId**.

But, **Cabin** and **PassengerId** have a pattern that can be used to extract new information by splitting the values.

**Name**, on the other hand, can not be used directly, so, it will be desconsidered in training the model.

## Feature Engineering

**PassengerId** has the pattern *gggg_pp* where *gggg* indicates a group the passenger is travelling with and *pp* is their number within the group. So, two columns will arise: **Group** and **Number**.

These information could be useful to predict the tranportation of the passengers, but need be tested after the model is trained.

**Cabin** has the pattern *deck/num/side*. Again, these information could be useful and will be evaluated after the model is trained. This time, three columns will arise: **Deck**, **Num** and **Side**.

## Encoding

The categorical columns need to be encoded in order to be fitted in the model.

The following columns will be encoded:
- **Deck** (from **Cabin**)
- **Side** (from **Cabin**)
- **VIP** 
- **CryoSleep**
- **HomePlanet**
- **Destination**

As said above, the **Name** column will be dropped.  

The **Group**, **Number** (from **PassengerId**) and **Num** (from **Cabin**) columns will be kept as integers, since they are already numerical.

### Two-step encoding

In a first step, the string classes will be turned into integers through a simple encoder. 

After replacing missing values (read more below), the columns will be one-hot encoded, which will create a new bollean column for each class.


## Missing values strategy

### Numerical fields

- The follwing strategies might be apply to fill missing values in numerical fields:
  - Fill with the median value
  - Fill with the mean value
  - Fill with the mode value
  - Fill with a constant value
  - Drop the rows with missing values

### Categorical fields

- **PassengerId** - No missing values. So, the **Group** and **Number** will have no missing values too.

- Just to remember, **Name** is a high cardinality field, so it will be dropped.

- **VIP** and **CryoSleep**: both columns are True/False fields, so for the missing values it could be try the approaches:
  - Replaced by most frequent value.
  - Replaced by the False value (thereby, the absence of information is the negative answer).
  - Drop the rows with missing values.

- **HomePlanet** and **Destination** have more than 2 classes each. Possible strategies:
  - A classifier could be used to fill the missing values based on the other fields.
  - Replaced by the most frequent value.
  - Replaced by a constant value.
  - Drop the rows with missing values.

- For **Deck**, **Num** and **Side** will have missing values after the feature engineering of the **Cabin** field. The following strategies could be applied:
  - A classifier could be used to fill the missing values based on the other fields.
  - Replaced by the most frequent value.
  - Replaced by a constant value.
  - Drop the rows with missing values.

## Sckit-learn transformers pipeline

For the data processing, a scikit-learn pipeline will be used to apply the stated above strategies.

A pipeline is a sequence of transformers followed by a final estimator. The transformers are applied to the data in the order they are added to the pipeline.

```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('step1', transformer1),
    ('step2', transformer2),
    ...
    ('stepN', transformerN),
    ('estimator', estimator)
])
```

The intermediates transformers are apply to the whole `X` dataset, while the estimator is applied with `y` together.

### Different transformers for different columns types

As the project dataset has different types of columns, the ColumnTransformer will be used to apply different transformers to different columns. ColumnTransformer is a special kind of pipeline that applies different transformations to different columns, specified by a list (columns labels or indexes)

```python
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ])
```

After defining a ColumnTransformer, it can be used in a pipeline as a transformer.

```python
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('estimator', estimator)
])
```

In that manner, the pipeline will apply the transformations into right columns at each step.


### Custom transformers for feature engineering

Besides the scikit-learn transformers, custom transformers can be created to apply feature engineering. FunctionTransformer is a manner to attach a custom function to the pipeline. In combination with the ColumnTransformer, it can be used to apply feature engineering to specific columns.

```python
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

def feature_engineering(X):
    # do something with X
    return X

feature_engineering = FunctionTransformer(feature_engineering)

preprocessor = ColumnTransformer(
    transformers=[
        ('feature_engineering', feature_engineering, feature_engineering_features)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('estimator', estimator)
])
```

Gather all the information above, the pipeline will be created to process the data and train the model.

Below, the code for the project transformers pipeline is presented.

The complete pipeline, with the final step estimator, will be defined in the `Training` notebook.


    Note: this code would be different from the final code, since the final code will be modified through the project development.

In [17]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline   import Pipeline

X = df.drop('Transported', axis=1)

## 1. Columns separation
# Categorical columns
categorical_columns = [
    'Group', # it will be created by the passenger_id_spliter
    'Deck', # it will be created by the cabin_spliter
    'Num', # it will be created by the cabin_spliter
    'Side', # it will be created by the cabin_spliter
    'VIP',
    'CryoSleep',
    'HomePlanet',
    'Destination'
]
# Categorical columns to encode. They have low cardinality
columns_to_encode = [
    'Deck', # it will be created by the cabin_spliter
    'Side', # it will be created by the cabin_spliter
    'VIP',
    'CryoSleep',
    'HomePlanet',
    'Destination'
]

# Numeric columns
numeric_columns = X.select_dtypes(include=['int64', 'float64']).columns

## 2. FunctionTransformers. Custom transformers for columns
# Function
def passenger_id_spliter(passenger_id_col: pd.Series) -> pd.DataFrame:
    '''Function to split the passenger id into two columns

    Args:
        passenger_id_col: pd.Series - The passenger id column

    Returns:
        pd.DataFrame - The dataframe with the two columns (Group, Number)
    '''
    df = (
        pd.DataFrame(
            passenger_id_col
            .str
            .split('_')
            .to_list(),
        columns=['Group', 'Number'],
        index=passenger_id_col.index
        )
        .drop('Number', axis=1)
    )
    return df

def cabin_spliter(cabin_col: pd.Series) -> pd.DataFrame:
    '''Function to split the cabin into two columns

    Args:
        cabin_col: pd.Series - The cabin column

    Returns:
        pd.DataFrame - The dataframe with three columns (Deck, Num, Side)
    '''
    df = (
        cabin_col
        .str
        .split('/', expand=True)
        .rename(columns={0: 'Deck', 1: 'Num', 2: 'Side'})
    )

    return df

## 3. Transformer instantiation

# Custom transformers
passenger_id_transformer = FunctionTransformer(func=passenger_id_spliter,
    feature_names_out=lambda _, __: np.array(['Group'])
)

cabin_transformer = FunctionTransformer(func=cabin_spliter,
    feature_names_out=lambda _, __: np.array(['Deck', 'Num', 'Side'])
)

# Encoders
encoder = OrdinalEncoder()
one_hot = OneHotEncoder()

# Scalers
scaler = StandardScaler()

# Imputer
knn = KNNImputer(n_neighbors=10, weights='uniform')
simple_imputer = SimpleImputer(strategy='median')

# Rounder for the categorical columns
rounder = FunctionTransformer(func=lambda x: np.round(x, 0),
    feature_names_out=lambda _, __: categorical_columns
)

## 4. Transformers pipeline

# Columns splitter function transformer
splitter_transformer = ColumnTransformer(
    transformers=[
        ('passenger_id', passenger_id_transformer, 'PassengerId'),
        ('cabin', cabin_transformer, 'Cabin')
    ],
    remainder='passthrough',
    force_int_remainder_cols=False, #type: ignore
    verbose_feature_names_out=False
)

# Enconder the categorical columns
encoder_transfomer = ColumnTransformer(
    transformers=[
        ('encoder', encoder, columns_to_encode)
    ],
    remainder='passthrough',
    force_int_remainder_cols=False, #type: ignore
    verbose_feature_names_out=False
)

# Imputers and dropping high cardinality columns
imputer_transformer = ColumnTransformer(
    transformers=[
        ('numeric_imputer', simple_imputer, numeric_columns),
        ('categorical_imputer', knn, categorical_columns)
    ],
    remainder='passthrough',
    force_int_remainder_cols=False, #type: ignore
    verbose_feature_names_out=False
)

# Rounder and scalers
rounder_scaler_transformer = ColumnTransformer(
    transformers=[
        ('rounder', rounder, categorical_columns),
        ('scaler', scaler, numeric_columns)
    ],
    remainder='passthrough',
    force_int_remainder_cols=False, #type: ignore
    verbose_feature_names_out=False
)

# Dropping columns
drop_columns_transformer = ColumnTransformer(
    transformers=[
        ('drop', 'drop', 'Name')
    ],
    remainder='passthrough',
    force_int_remainder_cols=False, #type: ignore
    verbose_feature_names_out=False
)

# Gather all transformers into a pipeline
transfomers = Pipeline([
    ('splitter', splitter_transformer),
    ('encoder', encoder_transfomer),
    ('imputer', imputer_transformer),
    ('rounder_scaler', rounder_scaler_transformer),
    ('drop_columns', drop_columns_transformer)
])

# Garantee that the output of "transform" and "fit_transform" will be a pandas dataframe
# This is necessary due use label columns into transformers
transfomers.set_output(transform='pandas')

## 5. Fit and transform the data
transfomers.fit_transform(X)

Unnamed: 0,Group,Deck,Num,Side,VIP,CryoSleep,HomePlanet,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,0.711945,-0.333105,-0.281027,-0.283579,-0.270626,-0.263003
1,2.0,5.0,0.0,1.0,0.0,0.0,0.0,2.0,-0.334037,-0.168073,-0.275387,-0.241771,0.217158,-0.224205
2,3.0,0.0,0.0,1.0,1.0,0.0,1.0,2.0,2.036857,-0.268001,1.959998,-0.283579,5.695623,-0.219796
3,3.0,0.0,0.0,1.0,0.0,0.0,1.0,2.0,0.293552,-0.333105,0.523010,0.336851,2.687176,-0.092818
4,4.0,5.0,1.0,1.0,0.0,0.0,0.0,2.0,-0.891895,0.125652,-0.237159,-0.031059,0.231374,-0.261240
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276.0,0.0,98.0,0.0,1.0,0.0,1.0,0.0,0.851410,-0.333105,3.992336,-0.283579,1.189173,-0.197751
8689,9278.0,6.0,1499.0,1.0,0.0,1.0,0.0,1.0,-0.752431,-0.333105,-0.281027,-0.283579,-0.270626,-0.263003
8690,9279.0,6.0,1500.0,1.0,0.0,0.0,0.0,2.0,-0.194573,-0.333105,-0.281027,2.846999,-0.269737,-0.263003
8691,9280.0,4.0,608.0,1.0,0.0,0.0,1.0,0.0,0.223820,-0.333105,0.376365,-0.283579,0.043013,2.589576
