# Data Transformation

In this notebook, i'm going to replace, modify, reshape and scale the data, thus increasing the accuracy of the model

Import libraries

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import re
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error, accuracy_score, precision_score, recall_score

## Data dictionary

- **PassengerId** - A unique Id for each passenger. Each Id takes the form ```gggg_pp``` where ```gggg``` indicates a group the passenger is travelling with and ```pp``` is their number within the group. People in a group are often family members, but not always.
- **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- **Destination** - The planet the passenger will be debarking to.
- **Age** - The age of the passenger.
- **VIP** - Whether the passenger has paid for special VIP service during the voyage.
- **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- **Name** - The first and last names of the passenger.
- **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## Check the dataframe

let's load the data

In [2]:
df = pd.read_csv('../data/stg/train_stg.csv')
# df = pd.read_csv('../data/train.csv', dtype_backend='pyarrow')

In [3]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0,P,0.0,1,1,True
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0,S,736.0,2,1,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0,S,10383.0,3,2,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0,S,5176.0,3,2,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1,S,1091.0,4,1,True


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PassengerId     8693 non-null   object 
 1   HomePlanet      8693 non-null   object 
 2   CryoSleep       8693 non-null   bool   
 3   Cabin           8693 non-null   object 
 4   Destination     8693 non-null   object 
 5   Age             8693 non-null   float64
 6   VIP             8693 non-null   bool   
 7   RoomService     8693 non-null   float64
 8   FoodCourt       8693 non-null   float64
 9   ShoppingMall    8693 non-null   float64
 10  Spa             8693 non-null   float64
 11  VRDeck          8693 non-null   float64
 12  Transported     8693 non-null   bool   
 13  Deck            8693 non-null   object 
 14  Num             8693 non-null   int64  
 15  Side            8693 non-null   object 
 16  Luxury          8693 non-null   float64
 17  Group           8693 non-null   i

In [5]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Num,Luxury,Group,GroupSize
count,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0
mean,28.790291,220.009318,448.434027,169.5723,304.588865,298.26182,586.624065,1440.866329,4633.389624,2.035546
std,14.341404,660.51905,1595.790627,598.007164,1125.562559,1134.126417,513.880084,2803.045694,2671.028856,1.596347
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,20.0,0.0,0.0,0.0,0.0,0.0,152.0,0.0,2319.0,1.0
50%,27.0,0.0,0.0,0.0,0.0,0.0,407.0,716.0,4630.0,1.0
75%,37.0,41.0,61.0,22.0,53.0,40.0,983.0,1441.0,6883.0,3.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0,1894.0,35987.0,9280.0,8.0


In [6]:
df.corr(numeric_only=True)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Num,Luxury,Group,GroupSize,TravelingAlone
CryoSleep,1.0,-0.071323,-0.078281,-0.244089,-0.205928,-0.207798,-0.198307,-0.192721,0.460132,-0.040133,-0.376692,-0.006883,0.079363,-0.091562
Age,-0.071323,1.0,0.091863,0.068629,0.12739,0.033148,0.120946,0.09959,-0.074233,-0.127788,0.184628,-0.009099,-0.176957,0.133804
VIP,-0.078281,0.091863,1.0,0.056566,0.125499,0.018412,0.060991,0.123061,-0.037261,-0.096811,0.162987,0.013608,0.002856,-0.034027
RoomService,-0.244089,0.068629,0.056566,1.0,-0.015126,0.052337,0.009244,-0.018624,-0.241124,-0.012673,0.234374,0.000375,-0.039734,0.019338
FoodCourt,-0.205928,0.12739,0.125499,-0.015126,1.0,-0.013717,0.221468,0.224572,0.045583,-0.177197,0.742608,-0.0092,0.032502,-0.066683
ShoppingMall,-0.207798,0.033148,0.018412,0.052337,-0.013717,1.0,0.014542,-0.007849,0.009391,0.00353,0.220529,0.017796,-0.038536,0.029095
Spa,-0.198307,0.120946,0.060991,0.009244,0.221468,0.014542,1.0,0.147658,-0.218545,-0.129222,0.592656,-0.005198,0.019218,-0.043639
VRDeck,-0.192721,0.09959,0.123061,-0.018624,0.224572,-0.007849,0.147658,1.0,-0.204874,-0.133074,0.585684,0.015945,0.00913,-0.044293
Transported,0.460132,-0.074233,-0.037261,-0.241124,0.045583,0.009391,-0.218545,-0.204874,1.0,-0.043832,-0.199514,0.021491,0.082644,-0.113792
Num,-0.040133,-0.127788,-0.096811,-0.012673,-0.177197,0.00353,-0.129222,-0.133074,-0.043832,1.0,-0.208844,0.665621,-0.051351,0.133426


## Data Transformation

First thing: drop the Passenger id column, i don't think there's enough value in that feature to keep it in the dataframe

In [7]:
df.drop('PassengerId', axis=1, inplace=True)

Next, dropping the excess of "spending" features, we created the luxury feature that encapsulates all that data

In [8]:
df.drop(['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], axis=1, inplace=True)

In the previous notebook, i separated the cabin feature, dont need it anymore

In [9]:
df.drop('Cabin', axis=1, inplace=True)

In [10]:
df.columns

Index(['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'Transported',
       'Deck', 'Num', 'Side', 'Luxury', 'Group', 'GroupSize',
       'TravelingAlone'],
      dtype='object')

Next, handle categorical and numerical features

### Numerical Features
These features should be scaled using the standard scaler(z-scaling):

z = (x - u) / s

Where `x` is the current training value, `u` is the mean of the training samples or zero if with_mean=False, and `s` is the standard deviation of the training samples or one if with_std=False. (From scikit learn docs)

I'm torn about the Group and the num feature, in a way it's a categorical feature, because its the id of the groups but there's ~6000 different ids, i should rescale that, also the num feature is something similar. for now i'm going to ignore them

In [11]:
num_features = ['Age', 'Luxury', 'GroupSize']

In [12]:
df['Age'].shape

(8693,)

We need to passthrough a (n, n) array to the scaler not an (n,) array, so we're going to reshape the pandas series with `df['Age'].values.reshape((-1,1))`

In [13]:
scalers = {}
for feature in num_features:
    current_scaler = StandardScaler()
    df[feature] = current_scaler.fit_transform(df[feature].values.reshape(-1, 1))
    scalers[feature] = current_scaler

In [14]:
df[num_features]

Unnamed: 0,Age,Luxury,GroupSize
0,0.711945,-0.514066,-0.648735
1,-0.334037,-0.251479,-0.648735
2,2.036857,3.190333,-0.022268
3,0.293552,1.332604,-0.022268
4,-0.891895,-0.124824,-0.648735
...,...,...,...
8688,0.851410,2.531369,-0.648735
8689,-0.752431,-0.514066,-0.648735
8690,-0.194573,0.154175,-0.648735
8691,0.223820,1.140302,-0.022268


In [15]:
df[num_features].describe()

Unnamed: 0,Age,Luxury,GroupSize
count,8693.0,8693.0,8693.0
mean,-2.125171e-17,1.409969e-17,-1.3077980000000001e-17
std,1.000058,1.000058,1.000058
min,-2.00761,-0.5140655,-0.6487347
25%,-0.6129662,-0.5140655,-0.6487347
50%,-0.1248409,-0.2586144,-0.6487347
75%,0.572481,4.769043e-05,0.6041982
max,3.501233,12.32521,3.73653


Now let's confirm the 3 scalers are different

In [16]:
scalers['Age'].mean_

array([28.79029104])

In [17]:
for scaler in scalers: 
    print(f'{scaler}')
    print(f'Mean: {scalers[scaler].mean_}')
    print(f'Scale or standard deviation: {scalers[scaler].scale_}')
    print(f'Variance: {scalers[scaler].var_}')

Age
Mean: [28.79029104]
Scale or standard deviation: [14.34057929]
Variance: [205.65221449]
Luxury
Mean: [1440.86632923]
Scale or standard deviation: [2802.88446483]
Variance: [7856161.32321115]
GroupSize
Mean: [2.03554584]
Scale or standard deviation: [1.59625469]
Variance: [2.54802903]


### Categorical Features
These features should be one-hot or Label encoded, because they represent a charasteristic of the training sample

In [18]:
df.columns

Index(['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'Transported',
       'Deck', 'Num', 'Side', 'Luxury', 'Group', 'GroupSize',
       'TravelingAlone'],
      dtype='object')

In [19]:
cat_features_gen = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side', 'TravelingAlone']

In [20]:
for cat_feature in cat_features_gen: 
    print(df[cat_feature].value_counts())

HomePlanet
Earth     4803
Europa    2131
Mars      1759
Name: count, dtype: int64
CryoSleep
False    5656
True     3037
Name: count, dtype: int64
Destination
TRAPPIST-1e      6097
55 Cancri e      1800
PSO J318.5-22     796
Name: count, dtype: int64
VIP
False    8494
True      199
Name: count, dtype: int64
Deck
F    2794
G    2559
E     876
B     779
C     747
D     478
A     256
0     199
T       5
Name: count, dtype: int64
Side
S    4288
P    4206
0     199
Name: count, dtype: int64
TravelingAlone
True     4805
False    3888
Name: count, dtype: int64


The deck feature has many possible values, i shouldnt use one hot encoding with it, for the binary features i' ll use the label encoder for simplicity, for 3 or more possible values (until certain pont) in a feature i'll use the one hot encoder 

In [21]:
cat_features_label = ['Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone']
cat_features_one_hot = ['HomePlanet','Destination']

Label encoding: every unique ocurrence of a value will be replaced with a unique number

In [22]:
l_encoders = {}
for feature in cat_features_label:
    current_encoder = LabelEncoder()
    df[feature] = current_encoder.fit_transform(df[feature])
    l_encoders[feature] = current_encoder

Let's confirm by calling the original classes and transforming them back and forth

In [23]:
l_encoders['Deck'].classes_

array(['0', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'], dtype=object)

In [24]:
l_encoders['Deck'].transform(l_encoders['Deck'].classes_)

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [25]:
l_encoders['Deck'].inverse_transform([0, 1, 2, 3, 4, 5, 6])

array(['0', 'A', 'B', 'C', 'D', 'E', 'F'], dtype=object)

Create a dictionary based on the {original value:encoded value}, using the zip 

In [26]:
deck_encoder_dict = {}
deck_encoder_dict = dict(zip(l_encoders['Deck'].classes_, l_encoders['Deck'].transform(l_encoders['Deck'].classes_)))
print(deck_encoder_dict)

{'0': 0, 'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'T': 8}


In [27]:
transported_encoder_dict = {}
transported_encoder_dict = dict(zip(l_encoders['Transported'].classes_, l_encoders['Transported'].transform(l_encoders['Transported'].classes_)))
print(transported_encoder_dict)

{False: 0, True: 1}


One hot encoding: Create new columns, one for each unique value in the original columns

In [28]:
pd.get_dummies(df[cat_features_one_hot], dtype=int)

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0,1,0,0,0,1
1,1,0,0,0,0,1
2,0,1,0,0,0,1
3,0,1,0,0,0,1
4,1,0,0,0,0,1
...,...,...,...,...,...,...
8688,0,1,0,1,0,0
8689,1,0,0,0,1,0
8690,1,0,0,0,0,1
8691,0,1,0,1,0,0


In [29]:
df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,Europa,0,TRAPPIST-1e,0.711945,0,0,2,0,1,-0.514066,1,-0.648735,1
1,Earth,0,TRAPPIST-1e,-0.334037,0,1,6,0,2,-0.251479,2,-0.648735,1
2,Europa,0,TRAPPIST-1e,2.036857,1,0,1,0,2,3.190333,3,-0.022268,0
3,Europa,0,TRAPPIST-1e,0.293552,0,0,1,0,2,1.332604,3,-0.022268,0
4,Earth,0,TRAPPIST-1e,-0.891895,0,1,6,1,2,-0.124824,4,-0.648735,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,0,55 Cancri e,0.851410,1,0,1,98,1,2.531369,9276,-0.648735,1
8689,Earth,1,PSO J318.5-22,-0.752431,0,0,7,1499,2,-0.514066,9278,-0.648735,1
8690,Earth,0,TRAPPIST-1e,-0.194573,0,1,7,1500,2,0.154175,9279,-0.648735,1
8691,Europa,0,55 Cancri e,0.223820,0,0,5,608,2,1.140302,9280,-0.022268,0


In [30]:
df.shape

(8693, 13)

Now let's join the one hot dataframe with the original dataframe

In [31]:
df_merged_a = df.merge(pd.get_dummies(df[cat_features_one_hot]), left_index=True, right_index=True)

In [32]:
df_merged_a.columns.shape

(19,)

In [33]:
sorted(df_merged_a.columns.to_list())

['Age',
 'CryoSleep',
 'Deck',
 'Destination',
 'Destination_55 Cancri e',
 'Destination_PSO J318.5-22',
 'Destination_TRAPPIST-1e',
 'Group',
 'GroupSize',
 'HomePlanet',
 'HomePlanet_Earth',
 'HomePlanet_Europa',
 'HomePlanet_Mars',
 'Luxury',
 'Num',
 'Side',
 'Transported',
 'TravelingAlone',
 'VIP']

In [34]:
print(cat_features_one_hot)

['HomePlanet', 'Destination']


Now let's do it with scikit learn's API

In [35]:
t_encoder = OneHotEncoder(sparse_output=False)
t_encoder.fit(df['HomePlanet'].values.reshape(-1, 1))
print(t_encoder.categories_)
# print(t_encoder.transform(df['HomePlanet'].values.reshape(-1, 1)))
print(t_encoder.transform(df['HomePlanet'].values.reshape(-1, 1)).shape)

[array(['Earth', 'Europa', 'Mars'], dtype=object)]
(8693, 3)


The t_encoder.categories_ outputs a list of arrays, [[ x ],] instad of a list [ x ] 

In [36]:
print(t_encoder.categories_)
print(t_encoder.categories_[0])

[array(['Earth', 'Europa', 'Mars'], dtype=object)]
['Earth' 'Europa' 'Mars']


In [37]:
t_df = pd.DataFrame(t_encoder.fit_transform(df['HomePlanet'].values.reshape(-1, 1)), columns=t_encoder.categories_[0])
t_df.head()

Unnamed: 0,Earth,Europa,Mars
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,1.0,0.0,0.0


In [38]:
t_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Earth   8693 non-null   float64
 1   Europa  8693 non-null   float64
 2   Mars    8693 non-null   float64
dtypes: float64(3)
memory usage: 203.9 KB


Gonna try different ways to join or fuse dataframes

In [39]:
pd.concat([df, t_df], axis=1)

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars
0,Europa,0,TRAPPIST-1e,0.711945,0,0,2,0,1,-0.514066,1,-0.648735,1,0.0,1.0,0.0
1,Earth,0,TRAPPIST-1e,-0.334037,0,1,6,0,2,-0.251479,2,-0.648735,1,1.0,0.0,0.0
2,Europa,0,TRAPPIST-1e,2.036857,1,0,1,0,2,3.190333,3,-0.022268,0,0.0,1.0,0.0
3,Europa,0,TRAPPIST-1e,0.293552,0,0,1,0,2,1.332604,3,-0.022268,0,0.0,1.0,0.0
4,Earth,0,TRAPPIST-1e,-0.891895,0,1,6,1,2,-0.124824,4,-0.648735,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,0,55 Cancri e,0.851410,1,0,1,98,1,2.531369,9276,-0.648735,1,0.0,1.0,0.0
8689,Earth,1,PSO J318.5-22,-0.752431,0,0,7,1499,2,-0.514066,9278,-0.648735,1,1.0,0.0,0.0
8690,Earth,0,TRAPPIST-1e,-0.194573,0,1,7,1500,2,0.154175,9279,-0.648735,1,1.0,0.0,0.0
8691,Europa,0,55 Cancri e,0.223820,0,0,5,608,2,1.140302,9280,-0.022268,0,0.0,1.0,0.0


In [40]:
df.join(t_df, how='inner')

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars
0,Europa,0,TRAPPIST-1e,0.711945,0,0,2,0,1,-0.514066,1,-0.648735,1,0.0,1.0,0.0
1,Earth,0,TRAPPIST-1e,-0.334037,0,1,6,0,2,-0.251479,2,-0.648735,1,1.0,0.0,0.0
2,Europa,0,TRAPPIST-1e,2.036857,1,0,1,0,2,3.190333,3,-0.022268,0,0.0,1.0,0.0
3,Europa,0,TRAPPIST-1e,0.293552,0,0,1,0,2,1.332604,3,-0.022268,0,0.0,1.0,0.0
4,Earth,0,TRAPPIST-1e,-0.891895,0,1,6,1,2,-0.124824,4,-0.648735,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,0,55 Cancri e,0.851410,1,0,1,98,1,2.531369,9276,-0.648735,1,0.0,1.0,0.0
8689,Earth,1,PSO J318.5-22,-0.752431,0,0,7,1499,2,-0.514066,9278,-0.648735,1,1.0,0.0,0.0
8690,Earth,0,TRAPPIST-1e,-0.194573,0,1,7,1500,2,0.154175,9279,-0.648735,1,1.0,0.0,0.0
8691,Europa,0,55 Cancri e,0.223820,0,0,5,608,2,1.140302,9280,-0.022268,0,0.0,1.0,0.0


In [41]:
df.merge(t_df, left_index=True, right_index=True, how='inner')

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars
0,Europa,0,TRAPPIST-1e,0.711945,0,0,2,0,1,-0.514066,1,-0.648735,1,0.0,1.0,0.0
1,Earth,0,TRAPPIST-1e,-0.334037,0,1,6,0,2,-0.251479,2,-0.648735,1,1.0,0.0,0.0
2,Europa,0,TRAPPIST-1e,2.036857,1,0,1,0,2,3.190333,3,-0.022268,0,0.0,1.0,0.0
3,Europa,0,TRAPPIST-1e,0.293552,0,0,1,0,2,1.332604,3,-0.022268,0,0.0,1.0,0.0
4,Earth,0,TRAPPIST-1e,-0.891895,0,1,6,1,2,-0.124824,4,-0.648735,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,0,55 Cancri e,0.851410,1,0,1,98,1,2.531369,9276,-0.648735,1,0.0,1.0,0.0
8689,Earth,1,PSO J318.5-22,-0.752431,0,0,7,1499,2,-0.514066,9278,-0.648735,1,1.0,0.0,0.0
8690,Earth,0,TRAPPIST-1e,-0.194573,0,1,7,1500,2,0.154175,9279,-0.648735,1,1.0,0.0,0.0
8691,Europa,0,55 Cancri e,0.223820,0,0,5,608,2,1.140302,9280,-0.022268,0,0.0,1.0,0.0


In [42]:
df.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,Europa,0,TRAPPIST-1e,0.711945,0,0,2,0,1,-0.514066,1,-0.648735,1
1,Earth,0,TRAPPIST-1e,-0.334037,0,1,6,0,2,-0.251479,2,-0.648735,1
2,Europa,0,TRAPPIST-1e,2.036857,1,0,1,0,2,3.190333,3,-0.022268,0
3,Europa,0,TRAPPIST-1e,0.293552,0,0,1,0,2,1.332604,3,-0.022268,0
4,Earth,0,TRAPPIST-1e,-0.891895,0,1,6,1,2,-0.124824,4,-0.648735,1


In [43]:
t_encoder.get_feature_names_out(['HomePlanet'])

array(['HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars'],
      dtype=object)

In [44]:
cat_features_one_hot

['HomePlanet', 'Destination']

In [45]:
one_hot_encoders = {}
one_hot_df = pd.DataFrame()
for feature in cat_features_one_hot: 
    print(f'Currently working on {feature}')
    current_encoder = OneHotEncoder(sparse_output=False)
    # current_encoder.fit(df[feature].values.reshape(-1, 1)
    current_df = pd.DataFrame(current_encoder.fit_transform(df[feature].values.reshape(-1, 1)), columns=current_encoder.categories_[0])
    # df = pd.concat([df, current_df], axis=1)
    df = pd.merge(df, current_df, how='inner', left_index=True, right_index=True)
    df.drop(feature, axis=1, inplace=True)
    one_hot_encoders[feature] = current_encoder

Currently working on HomePlanet
Currently working on Destination


In [46]:
df.head()

Unnamed: 0,CryoSleep,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars,55 Cancri e,PSO J318.5-22,TRAPPIST-1e
0,0,0.711945,0,0,2,0,1,-0.514066,1,-0.648735,1,0.0,1.0,0.0,0.0,0.0,1.0
1,0,-0.334037,0,1,6,0,2,-0.251479,2,-0.648735,1,1.0,0.0,0.0,0.0,0.0,1.0
2,0,2.036857,1,0,1,0,2,3.190333,3,-0.022268,0,0.0,1.0,0.0,0.0,0.0,1.0
3,0,0.293552,0,0,1,0,2,1.332604,3,-0.022268,0,0.0,1.0,0.0,0.0,0.0,1.0
4,0,-0.891895,0,1,6,1,2,-0.124824,4,-0.648735,1,1.0,0.0,0.0,0.0,0.0,1.0


we can also do:  

In [47]:
# TODO: Make a ColumnTransformer in order to do all the process in one step
# TODO: Make a Pipeline that contains the ColumnTransformer

Now i'm going to rewrite what i've done in a more standardized way

In [48]:
# Prepare ColumnTransformer
oh_encoder = OneHotEncoder(sparse_output=False)
l_encoder = LabelEncoder()
s_scaler = StandardScaler()
print(f'Numerical features = {num_features}')
print(f'One-hot categorical features = {cat_features_one_hot}')
print(f'Label categorical features ')

Numerical features = ['Age', 'Luxury', 'GroupSize']
One-hot categorical features = ['HomePlanet', 'Destination']
Label categorical features 


In [49]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
main_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('LabelEncoder_step', l_encoder, 'Transported'),
    ('OneHotEncorder', oh_encoder, cat_features_one_hot)], 
    remainder='passthrough')

Reset the data 

In [50]:
df = pd.read_csv('../data/stg/train_stg.csv')
# df = pd.read_csv('../data/train.csv', dtype_backend='pyarrow')

In [51]:
# main_transformer.fit_transform(df)
# ERROR: TypeError: LabelEncoder.fit_transform() takes 2 positional arguments but 3 were given

After reading the docs the LabelEncoder was created to encode only the label or target feature, gonna try the ordinal encoder,  though this transformer implies an order in the labels

In [52]:
# Prepare ColumnTransformer
oh_encoder = OneHotEncoder(sparse_output=False)
o_encoder = OrdinalEncoder()
s_scaler = StandardScaler()
print(f'Numerical features = {num_features}')
print(f'One-hot categorical features = {cat_features_one_hot}')
print(f'Label categorical features ')

Numerical features = ['Age', 'Luxury', 'GroupSize']
One-hot categorical features = ['HomePlanet', 'Destination']
Label categorical features 


In [53]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
# The remainder parameter controls what to do with the columns not involved in the ColumnTransformer 
# Remainder default value = 'drop', drop the others column in the output 
# The columns in the output are ordered by their step, first in first out 
# The verbose parameter makes the transformers return the time required to complete their operations
main_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('LabelEncoder_step', o_encoder, cat_features_label),
    ('OneHotEncorder', oh_encoder, cat_features_one_hot)], 
    remainder='passthrough', 
    # verbose=True, 
    verbose_feature_names_out=False)

In [54]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
# The remainder parameter controls what to do with the columns not involved in the ColumnTransformer
# Remainder default value = 'drop', drop the others column in the output 
# The columns in the output are ordered by their step, first in first out 
# The verbose parameter makes the transformers return the time required to complete their operations
drop_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('LabelEncoder_step', o_encoder, cat_features_label),
    ('OneHotEncorder', oh_encoder, cat_features_one_hot)], 
    remainder='drop',
    # verbose = True,
    verbose_feature_names_out=False)

In [55]:
main_transformer.fit_transform(df)

array([[0.7119453650967104, -0.5140655447299861, -0.6487347223401672,
        ..., 0.0, 0, 1],
       [-0.33403748485524115, -0.251478909699551, -0.6487347223401672,
        ..., 44.0, 0, 2],
       [2.0368569750358487, 3.1903325959235564, -0.022268276961021034,
        ..., 49.0, 0, 3],
       ...,
       [-0.19457310486164758, 0.15417462838415102, -0.6487347223401672,
        ..., 0.0, 1500, 9279],
       [0.22382003511913298, 1.1403016110256219, -0.022268276961021034,
        ..., 3235.0, 608, 9280],
       [1.0606063150806941, 1.2077321463799047, -0.022268276961021034,
        ..., 12.0, 608, 9280]], dtype=object)

In [56]:
drop_transformer.fit_transform(df)

array([[ 0.71194537, -0.51406554, -0.64873472, ...,  0.        ,
         0.        ,  1.        ],
       [-0.33403748, -0.25147891, -0.64873472, ...,  0.        ,
         0.        ,  1.        ],
       [ 2.03685698,  3.1903326 , -0.02226828, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.1945731 ,  0.15417463, -0.64873472, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.22382004,  1.14030161, -0.02226828, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.06060632,  1.20773215, -0.02226828, ...,  0.        ,
         0.        ,  1.        ]])

In [57]:
drop_transformer.get_feature_names_out(df.columns)

array(['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep',
       'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth',
       'HomePlanet_Europa', 'HomePlanet_Mars', 'Destination_55 Cancri e',
       'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e'],
      dtype=object)

The method get_feature_names_out returns a numpy array with the output of the ColumnTransformer, 
personally i dont like to use numpy arrray for text data, so i created a list based in the original array

In [58]:
drop_features = drop_transformer.get_feature_names_out(df.columns)
print(type(drop_features))
print(drop_features)
drop_features_list = drop_features.tolist()
print(drop_features_list)

<class 'numpy.ndarray'>
['Age' 'Luxury' 'GroupSize' 'Deck' 'Transported' 'CryoSleep' 'Side' 'VIP'
 'TravelingAlone' 'HomePlanet_Earth' 'HomePlanet_Europa' 'HomePlanet_Mars'
 'Destination_55 Cancri e' 'Destination_PSO J318.5-22'
 'Destination_TRAPPIST-1e']
['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e']


In [59]:
drop_output_df = pd.DataFrame( drop_transformer.fit_transform(df), columns= drop_features)

In [60]:
drop_output_df.head()

Unnamed: 0,Age,Luxury,GroupSize,Deck,Transported,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.711945,-0.514066,-0.648735,2.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
1,-0.334037,-0.251479,-0.648735,6.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
2,2.036857,3.190333,-0.022268,1.0,0.0,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.293552,1.332604,-0.022268,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,-0.891895,-0.124824,-0.648735,6.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0


Main transformer 

In [61]:
main_features = main_transformer.get_feature_names_out(df.columns).tolist()
print(main_features)
main_transformer_data = main_transformer.fit_transform(df)
print(main_transformer_data)

['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e', 'PassengerId', 'Cabin', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Num', 'Group']
[[0.7119453650967104 -0.5140655447299861 -0.6487347223401672 ... 0.0 0 1]
 [-0.33403748485524115 -0.251478909699551 -0.6487347223401672 ... 44.0 0
  2]
 [2.0368569750358487 3.1903325959235564 -0.022268276961021034 ... 49.0 0
  3]
 ...
 [-0.19457310486164758 0.15417462838415102 -0.6487347223401672 ... 0.0
  1500 9279]
 [0.22382003511913298 1.1403016110256219 -0.022268276961021034 ... 3235.0
  608 9280]
 [1.0606063150806941 1.2077321463799047 -0.022268276961021034 ... 12.0
  608 9280]]


Create the main transformer's dataframe 

In [63]:
main_output_df = pd.DataFrame(main_transformer_data, columns=main_features)
main_output_df.head()

There are more columns because we didn't drop the remainder columns and we reset the df dataframe, thus undoing some of the drop commands made at the beginning of this notebook