# Data Transformation

In this notebook, i'm going to replace, modify, reshape and scale the data, thus increasing the accuracy of the model

Import libraries

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import re
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, OrdinalEncoder, FunctionTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error, accuracy_score, precision_score, recall_score

## Data dictionary

- **PassengerId** - A unique Id for each passenger. Each Id takes the form ```gggg_pp``` where ```gggg``` indicates a group the passenger is travelling with and ```pp``` is their number within the group. People in a group are often family members, but not always.
- **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- **Destination** - The planet the passenger will be debarking to.
- **Age** - The age of the passenger.
- **VIP** - Whether the passenger has paid for special VIP service during the voyage.
- **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- **Name** - The first and last names of the passenger.
- **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## Check the dataframe

let's load the data

In [2]:
df = pd.read_csv('../data/stg/train_stg.csv')
# df = pd.read_csv('../data/train.csv', dtype_backend='pyarrow')

In [3]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,False,B,0.0,P,0.0,1,1,True
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,True,F,0.0,S,736.0,2,1,True
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,False,A,0.0,S,10383.0,3,2,False
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,False,A,0.0,S,5176.0,3,2,False
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,True,F,1.0,S,1091.0,4,1,True


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PassengerId     8693 non-null   object 
 1   HomePlanet      8492 non-null   object 
 2   CryoSleep       8476 non-null   object 
 3   Destination     8511 non-null   object 
 4   Age             8514 non-null   float64
 5   VIP             8490 non-null   object 
 6   Transported     8693 non-null   bool   
 7   Deck            8494 non-null   object 
 8   Num             8494 non-null   float64
 9   Side            8494 non-null   object 
 10  Luxury          7785 non-null   float64
 11  Group           8693 non-null   int64  
 12  GroupSize       8693 non-null   int64  
 13  TravelingAlone  8693 non-null   bool   
dtypes: bool(2), float64(3), int64(2), object(7)
memory usage: 832.1+ KB


In [5]:
df.describe()

Unnamed: 0,Age,Num,Luxury,Group,GroupSize
count,8514.0,8494.0,7785.0,8693.0,8693.0
mean,28.82793,600.367671,1484.601541,4633.389624,2.035546
std,14.489021,511.867226,2845.288241,2671.028856,1.596347
min,0.0,0.0,0.0,1.0,1.0
25%,19.0,167.25,0.0,2319.0,1.0
50%,27.0,427.0,736.0,4630.0,1.0
75%,38.0,999.0,1486.0,6883.0,3.0
max,79.0,1894.0,35987.0,9280.0,8.0


In [6]:
df.corr(numeric_only=True)

Unnamed: 0,Age,Transported,Num,Luxury,Group,GroupSize,TravelingAlone
Age,1.0,-0.075026,-0.132255,0.189475,-0.009439,-0.179102,0.135174
Transported,-0.075026,1.0,-0.045097,-0.197671,0.021491,0.082644,-0.113792
Num,-0.132255,-0.045097,1.0,-0.21996,0.679723,-0.049381,0.134073
Luxury,0.189475,-0.197671,-0.21996,1.0,-0.001793,0.012971,-0.063655
Group,-0.009439,0.021491,0.679723,-0.001793,1.0,0.014753,-0.000266
GroupSize,-0.179102,0.082644,-0.049381,0.012971,0.014753,1.0,-0.721192
TravelingAlone,0.135174,-0.113792,0.134073,-0.063655,-0.000266,-0.721192,1.0


## Data Transformation

First thing: drop the Passenger id column, i don't think there's enough value in that feature to keep it in the dataframe

In [7]:
# df.drop('PassengerId', axis=1, inplace=True)

In [8]:
df.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
       'Transported', 'Deck', 'Num', 'Side', 'Luxury', 'Group', 'GroupSize',
       'TravelingAlone'],
      dtype='object')

Next, handle categorical and numerical features

### Numerical Features
These features should be scaled using the standard scaler(z-scaling):

z = (x - u) / s

Where `x` is the current training value, `u` is the mean of the training samples or zero if with_mean=False, and `s` is the standard deviation of the training samples or one if with_std=False. (From scikit learn docs)

I'm torn about the Group and the num feature, in a way it's a categorical feature, because its the id of the groups but there's ~6000 different ids, i should rescale that, also the num feature is something similar. for now i'm going to ignore them

In [9]:
num_features = ['Age', 'Luxury', 'GroupSize']

In [10]:
df['Age'].shape

(8693,)

We need to passthrough a (n, n) array to the scaler not an (n,) array, so we're going to reshape the pandas series with `df['Age'].values.reshape((-1,1))`

In [11]:
scalers = {}
for feature in num_features:
    current_scaler = StandardScaler()
    df[feature] = current_scaler.fit_transform(df[feature].values.reshape(-1, 1))
    scalers[feature] = current_scaler

In [12]:
df[num_features]

Unnamed: 0,Age,Luxury,GroupSize
0,0.702095,-0.521809,-0.648735
1,-0.333233,-0.263119,-0.648735
2,2.013510,3.127616,-0.022268
3,0.287964,1.297456,-0.022268
4,-0.885407,-0.138343,-0.648735
...,...,...,...
8688,0.840138,2.478431,-0.648735
8689,-0.747364,-0.521809,-0.648735
8690,-0.195189,0.136515,-0.648735
8691,0.218942,1.108008,-0.022268


In [13]:
df[num_features].describe()

Unnamed: 0,Age,Luxury,GroupSize
count,8514.0,7785.0,8693.0
mean,6.217458000000001e-17,2.6468520000000003e-17,-1.3077980000000001e-17
std,1.000059,1.000064,1.000058
min,-1.989756,-0.521809,-0.6487347
25%,-0.6783417,-0.521809,-0.6487347
50%,-0.1261671,-0.2631191,-0.6487347
75%,0.633073,0.0004915314,0.6041982
max,3.462968,12.12693,3.73653


Now let's confirm the 3 scalers are different

In [14]:
scalers['Age'].mean_

array([28.82793047])

In [15]:
for scaler in scalers: 
    print(f'{scaler}')
    print(f'Mean: {scalers[scaler].mean_}')
    print(f'Scale or standard deviation: {scalers[scaler].scale_}')
    print(f'Variance: {scalers[scaler].var_}')

Age
Mean: [28.82793047]
Scale or standard deviation: [14.48817051]
Variance: [209.90708458]
Luxury
Mean: [1484.60154143]
Scale or standard deviation: [2845.10549318]
Variance: [8094625.26730655]
GroupSize
Mean: [2.03554584]
Scale or standard deviation: [1.59625469]
Variance: [2.54802903]


### Categorical Features
These features should be one-hot or Label encoded, because they represent a charasteristic of the training sample

In [16]:
df.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
       'Transported', 'Deck', 'Num', 'Side', 'Luxury', 'Group', 'GroupSize',
       'TravelingAlone'],
      dtype='object')

In [17]:
cat_features_gen = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side', 'TravelingAlone']

In [18]:
for cat_feature in cat_features_gen: 
    print(df[cat_feature].value_counts())

HomePlanet
Earth     4602
Europa    2131
Mars      1759
Name: count, dtype: int64
CryoSleep
False    5439
True     3037
Name: count, dtype: int64
Destination
TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Name: count, dtype: int64
VIP
False    8291
True      199
Name: count, dtype: int64
Deck
F    2794
G    2559
E     876
B     779
C     747
D     478
A     256
T       5
Name: count, dtype: int64
Side
S    4288
P    4206
Name: count, dtype: int64
TravelingAlone
True     4805
False    3888
Name: count, dtype: int64


The deck feature has many possible values, i shouldnt use one hot encoding with it, for the binary features i' ll use the label encoder for simplicity, for 3 or more possible values (until certain pont) in a feature i'll use the one hot encoder 

In [19]:
cat_features_label = ['Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone']
cat_features_one_hot = ['HomePlanet','Destination']

Label encoding: every unique ocurrence of a value will be replaced with a unique number

In [20]:
l_encoders = {}
for feature in cat_features_label:
    current_encoder = LabelEncoder()
    df[feature] = current_encoder.fit_transform(df[feature])
    l_encoders[feature] = current_encoder

Let's confirm by calling the original classes and transforming them back and forth

In [21]:
l_encoders['Deck'].classes_

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T', nan], dtype=object)

In [22]:
l_encoders['Deck'].transform(l_encoders['Deck'].classes_)

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [23]:
l_encoders['Deck'].inverse_transform([0, 1, 2, 3, 4, 5, 6])

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)

Create a dictionary based on the {original value:encoded value}, using the zip 

In [24]:
deck_encoder_dict = {}
deck_encoder_dict = dict(zip(l_encoders['Deck'].classes_, l_encoders['Deck'].transform(l_encoders['Deck'].classes_)))
print(deck_encoder_dict)

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'T': 7, nan: 8}


In [25]:
transported_encoder_dict = {}
transported_encoder_dict = dict(zip(l_encoders['Transported'].classes_, l_encoders['Transported'].transform(l_encoders['Transported'].classes_)))
print(transported_encoder_dict)

{False: 0, True: 1}


One hot encoding: Create new columns, one for each unique value in the original columns

In [26]:
pd.get_dummies(df[cat_features_one_hot], dtype=int)

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0,1,0,0,0,1
1,1,0,0,0,0,1
2,0,1,0,0,0,1
3,0,1,0,0,0,1
4,1,0,0,0,0,1
...,...,...,...,...,...,...
8688,0,1,0,1,0,0
8689,1,0,0,0,1,0
8690,1,0,0,0,0,1
8691,0,1,0,1,0,0


In [27]:
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,0001_01,Europa,0,TRAPPIST-1e,0.702095,0,0,1,0.0,0,-0.521809,1,-0.648735,1
1,0002_01,Earth,0,TRAPPIST-1e,-0.333233,0,1,5,0.0,1,-0.263119,2,-0.648735,1
2,0003_01,Europa,0,TRAPPIST-1e,2.013510,1,0,0,0.0,1,3.127616,3,-0.022268,0
3,0003_02,Europa,0,TRAPPIST-1e,0.287964,0,0,0,0.0,1,1.297456,3,-0.022268,0
4,0004_01,Earth,0,TRAPPIST-1e,-0.885407,0,1,5,1.0,1,-0.138343,4,-0.648735,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,0,55 Cancri e,0.840138,1,0,0,98.0,0,2.478431,9276,-0.648735,1
8689,9278_01,Earth,1,PSO J318.5-22,-0.747364,0,0,6,1499.0,1,-0.521809,9278,-0.648735,1
8690,9279_01,Earth,0,TRAPPIST-1e,-0.195189,0,1,6,1500.0,1,0.136515,9279,-0.648735,1
8691,9280_01,Europa,0,55 Cancri e,0.218942,0,0,4,608.0,1,1.108008,9280,-0.022268,0


In [28]:
df.shape

(8693, 14)

Now let's join the one hot dataframe with the original dataframe

In [29]:
df_merged_a = df.merge(pd.get_dummies(df[cat_features_one_hot]), left_index=True, right_index=True)

In [30]:
df_merged_a.columns.shape

(20,)

In [31]:
sorted(df_merged_a.columns.to_list())

['Age',
 'CryoSleep',
 'Deck',
 'Destination',
 'Destination_55 Cancri e',
 'Destination_PSO J318.5-22',
 'Destination_TRAPPIST-1e',
 'Group',
 'GroupSize',
 'HomePlanet',
 'HomePlanet_Earth',
 'HomePlanet_Europa',
 'HomePlanet_Mars',
 'Luxury',
 'Num',
 'PassengerId',
 'Side',
 'Transported',
 'TravelingAlone',
 'VIP']

In [32]:
print(cat_features_one_hot)

['HomePlanet', 'Destination']


Now let's do it with scikit learn's API

In [33]:
t_encoder = OneHotEncoder(sparse_output=False)
t_encoder.fit(df['HomePlanet'].values.reshape(-1, 1))
print(t_encoder.categories_)
# print(t_encoder.transform(df['HomePlanet'].values.reshape(-1, 1)))
print(t_encoder.transform(df['HomePlanet'].values.reshape(-1, 1)).shape)

[array(['Earth', 'Europa', 'Mars', nan], dtype=object)]
(8693, 4)


The t_encoder.categories_ outputs a list of arrays, [[ x ],] instad of a list [ x ] 

In [34]:
print(t_encoder.categories_)
print(t_encoder.categories_[0])

[array(['Earth', 'Europa', 'Mars', nan], dtype=object)]
['Earth' 'Europa' 'Mars' nan]


In [35]:
t_df = pd.DataFrame(t_encoder.fit_transform(df['HomePlanet'].values.reshape(-1, 1)), columns=t_encoder.categories_[0])
t_df.head()

Unnamed: 0,Earth,Europa,Mars,NaN
0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0


In [36]:
t_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Earth   8693 non-null   float64
 1   Europa  8693 non-null   float64
 2   Mars    8693 non-null   float64
 3   nan     8693 non-null   float64
dtypes: float64(4)
memory usage: 271.8 KB


Gonna try different ways to join or fuse dataframes

In [37]:
pd.concat([df, t_df], axis=1)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars,NaN
0,0001_01,Europa,0,TRAPPIST-1e,0.702095,0,0,1,0.0,0,-0.521809,1,-0.648735,1,0.0,1.0,0.0,0.0
1,0002_01,Earth,0,TRAPPIST-1e,-0.333233,0,1,5,0.0,1,-0.263119,2,-0.648735,1,1.0,0.0,0.0,0.0
2,0003_01,Europa,0,TRAPPIST-1e,2.013510,1,0,0,0.0,1,3.127616,3,-0.022268,0,0.0,1.0,0.0,0.0
3,0003_02,Europa,0,TRAPPIST-1e,0.287964,0,0,0,0.0,1,1.297456,3,-0.022268,0,0.0,1.0,0.0,0.0
4,0004_01,Earth,0,TRAPPIST-1e,-0.885407,0,1,5,1.0,1,-0.138343,4,-0.648735,1,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,0,55 Cancri e,0.840138,1,0,0,98.0,0,2.478431,9276,-0.648735,1,0.0,1.0,0.0,0.0
8689,9278_01,Earth,1,PSO J318.5-22,-0.747364,0,0,6,1499.0,1,-0.521809,9278,-0.648735,1,1.0,0.0,0.0,0.0
8690,9279_01,Earth,0,TRAPPIST-1e,-0.195189,0,1,6,1500.0,1,0.136515,9279,-0.648735,1,1.0,0.0,0.0,0.0
8691,9280_01,Europa,0,55 Cancri e,0.218942,0,0,4,608.0,1,1.108008,9280,-0.022268,0,0.0,1.0,0.0,0.0


In [38]:
df.join(t_df, how='inner')

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars,NaN
0,0001_01,Europa,0,TRAPPIST-1e,0.702095,0,0,1,0.0,0,-0.521809,1,-0.648735,1,0.0,1.0,0.0,0.0
1,0002_01,Earth,0,TRAPPIST-1e,-0.333233,0,1,5,0.0,1,-0.263119,2,-0.648735,1,1.0,0.0,0.0,0.0
2,0003_01,Europa,0,TRAPPIST-1e,2.013510,1,0,0,0.0,1,3.127616,3,-0.022268,0,0.0,1.0,0.0,0.0
3,0003_02,Europa,0,TRAPPIST-1e,0.287964,0,0,0,0.0,1,1.297456,3,-0.022268,0,0.0,1.0,0.0,0.0
4,0004_01,Earth,0,TRAPPIST-1e,-0.885407,0,1,5,1.0,1,-0.138343,4,-0.648735,1,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,0,55 Cancri e,0.840138,1,0,0,98.0,0,2.478431,9276,-0.648735,1,0.0,1.0,0.0,0.0
8689,9278_01,Earth,1,PSO J318.5-22,-0.747364,0,0,6,1499.0,1,-0.521809,9278,-0.648735,1,1.0,0.0,0.0,0.0
8690,9279_01,Earth,0,TRAPPIST-1e,-0.195189,0,1,6,1500.0,1,0.136515,9279,-0.648735,1,1.0,0.0,0.0,0.0
8691,9280_01,Europa,0,55 Cancri e,0.218942,0,0,4,608.0,1,1.108008,9280,-0.022268,0,0.0,1.0,0.0,0.0


In [39]:
df.merge(t_df, left_index=True, right_index=True, how='inner')

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars,NaN
0,0001_01,Europa,0,TRAPPIST-1e,0.702095,0,0,1,0.0,0,-0.521809,1,-0.648735,1,0.0,1.0,0.0,0.0
1,0002_01,Earth,0,TRAPPIST-1e,-0.333233,0,1,5,0.0,1,-0.263119,2,-0.648735,1,1.0,0.0,0.0,0.0
2,0003_01,Europa,0,TRAPPIST-1e,2.013510,1,0,0,0.0,1,3.127616,3,-0.022268,0,0.0,1.0,0.0,0.0
3,0003_02,Europa,0,TRAPPIST-1e,0.287964,0,0,0,0.0,1,1.297456,3,-0.022268,0,0.0,1.0,0.0,0.0
4,0004_01,Earth,0,TRAPPIST-1e,-0.885407,0,1,5,1.0,1,-0.138343,4,-0.648735,1,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,0,55 Cancri e,0.840138,1,0,0,98.0,0,2.478431,9276,-0.648735,1,0.0,1.0,0.0,0.0
8689,9278_01,Earth,1,PSO J318.5-22,-0.747364,0,0,6,1499.0,1,-0.521809,9278,-0.648735,1,1.0,0.0,0.0,0.0
8690,9279_01,Earth,0,TRAPPIST-1e,-0.195189,0,1,6,1500.0,1,0.136515,9279,-0.648735,1,1.0,0.0,0.0,0.0
8691,9280_01,Europa,0,55 Cancri e,0.218942,0,0,4,608.0,1,1.108008,9280,-0.022268,0,0.0,1.0,0.0,0.0


In [40]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,0001_01,Europa,0,TRAPPIST-1e,0.702095,0,0,1,0.0,0,-0.521809,1,-0.648735,1
1,0002_01,Earth,0,TRAPPIST-1e,-0.333233,0,1,5,0.0,1,-0.263119,2,-0.648735,1
2,0003_01,Europa,0,TRAPPIST-1e,2.01351,1,0,0,0.0,1,3.127616,3,-0.022268,0
3,0003_02,Europa,0,TRAPPIST-1e,0.287964,0,0,0,0.0,1,1.297456,3,-0.022268,0
4,0004_01,Earth,0,TRAPPIST-1e,-0.885407,0,1,5,1.0,1,-0.138343,4,-0.648735,1


In [41]:
t_encoder.get_feature_names_out(['HomePlanet'])

array(['HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
       'HomePlanet_nan'], dtype=object)

In [42]:
cat_features_one_hot

['HomePlanet', 'Destination']

In [43]:
one_hot_encoders = {}
one_hot_df = pd.DataFrame()
for feature in cat_features_one_hot: 
    print(f'Currently working on {feature}')
    current_encoder = OneHotEncoder(sparse_output=False)
    # current_encoder.fit(df[feature].values.reshape(-1, 1)
    current_df = pd.DataFrame(current_encoder.fit_transform(df[feature].values.reshape(-1, 1)), columns=current_encoder.categories_[0])
    # df = pd.concat([df, current_df], axis=1)
    df = pd.merge(df, current_df, how='inner', left_index=True, right_index=True)
    df.drop(feature, axis=1, inplace=True)
    one_hot_encoders[feature] = current_encoder

Currently working on HomePlanet
Currently working on Destination


In [44]:
df.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone,Earth,Europa,Mars,nan_x,55 Cancri e,PSO J318.5-22,TRAPPIST-1e,nan_y
0,0001_01,0,0.702095,0,0,1,0.0,0,-0.521809,1,-0.648735,1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0002_01,0,-0.333233,0,1,5,0.0,1,-0.263119,2,-0.648735,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0003_01,0,2.01351,1,0,0,0.0,1,3.127616,3,-0.022268,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0003_02,0,0.287964,0,0,0,0.0,1,1.297456,3,-0.022268,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0004_01,0,-0.885407,0,1,5,1.0,1,-0.138343,4,-0.648735,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


we can also do:  

In [45]:
# TODO: Make a ColumnTransformer in order to do all the process in one step
# TODO: Make a Pipeline that contains the ColumnTransformer

Now i'm going to rewrite what i've done in a more standardized way

In [46]:
# Prepare ColumnTransformer
oh_encoder = OneHotEncoder(sparse_output=False)
l_encoder = LabelEncoder()
s_scaler = StandardScaler()
print(f'Numerical features = {num_features}')
print(f'One-hot categorical features = {cat_features_one_hot}')
print(f'Label categorical features {cat_features_label}')

Numerical features = ['Age', 'Luxury', 'GroupSize']
One-hot categorical features = ['HomePlanet', 'Destination']
Label categorical features ['Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone']


In [47]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
main_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('LabelEncoder_step', l_encoder, 'Transported'),
    ('OneHotEncorder', oh_encoder, cat_features_one_hot)], 
    remainder='passthrough')

Reset the data 

In [48]:
df = pd.read_csv('../data/stg/train_stg.csv')
# df = pd.read_csv('../data/train.csv', dtype_backend='pyarrow')

In [49]:
# main_transformer.fit_transform(df)
# ERROR: TypeError: LabelEncoder.fit_transform() takes 2 positional arguments but 3 were given

After reading the docs the LabelEncoder was created to encode only the label or target feature, gonna try the ordinal encoder,  though this transformer implies an order in the labels

In [50]:
# Prepare ColumnTransformer
oh_encoder = OneHotEncoder(sparse_output=False)
o_encoder = OrdinalEncoder()
s_scaler = StandardScaler()
print(f'Numerical features = {num_features}')
print(f'One-hot categorical features = {cat_features_one_hot}')
print(f'Label categorical features {cat_features_label}')

Numerical features = ['Age', 'Luxury', 'GroupSize']
One-hot categorical features = ['HomePlanet', 'Destination']
Label categorical features ['Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone']


In [51]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
# The remainder parameter controls what to do with the columns not involved in the ColumnTransformer 
# Remainder default value = 'drop', drop the others column in the output 
# The columns in the output are ordered by their step, first in first out 
# The verbose parameter makes the transformers return the time required to complete their operations
# The verbose_feature_names_out parameter adds a prefix to each column with the stepname that generated it
main_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('LabelEncoder_step', o_encoder, cat_features_label),
    ('OneHotEncorder', oh_encoder, cat_features_one_hot)], 
    remainder='passthrough', 
    # verbose=True, 
    verbose_feature_names_out=False)

In [52]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
# The remainder parameter controls what to do with the columns not involved in the ColumnTransformer
# Remainder default value = 'drop', drop the others column in the output 
# The columns in the output are ordered by their step, first in first out 
# The verbose parameter makes the ColumnTransformer return the time required to complete their operations
# The verbose_feature_names_out parameter adds a prefix to each column with the stepname that generated it
drop_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('LabelEncoder_step', o_encoder, cat_features_label),
    ('OneHotEncorder', oh_encoder, cat_features_one_hot)], 
    remainder='drop',
    # verbose = True,
    verbose_feature_names_out=False)

In [53]:
main_transformer.fit_transform(df)

array([[0.7020948248098347, -0.5218089610336295, -0.6487347223401672,
        ..., '0001_01', 0.0, 1],
       [-0.3332325821119825, -0.2631190805476465, -0.6487347223401672,
        ..., '0002_01', 0.0, 2],
       [2.013509540244136, 3.127616350224472, -0.022268276961021034, ...,
        '0003_01', 0.0, 3],
       ...,
       [-0.1951889278557402, 0.1365146071052921, -0.6487347223401672,
        ..., '9279_01', 1500.0, 9279],
       [0.21894203491298664, 1.1080075821912394, -0.022268276961021034,
        ..., '9280_01', 608.0, 9280],
       [1.0472039604504404, 1.1744374563921236, -0.022268276961021034,
        ..., '9280_02', 608.0, 9280]], dtype=object)

In [54]:
drop_transformer.fit_transform(df)

array([[ 0.70209482, -0.52180896, -0.64873472, ...,  0.        ,
         1.        ,  0.        ],
       [-0.33323258, -0.26311908, -0.64873472, ...,  0.        ,
         1.        ,  0.        ],
       [ 2.01350954,  3.12761635, -0.02226828, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.19518893,  0.13651461, -0.64873472, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.21894203,  1.10800758, -0.02226828, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.04720396,  1.17443746, -0.02226828, ...,  0.        ,
         1.        ,  0.        ]])

In [55]:
drop_transformer.get_feature_names_out(df.columns)

array(['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep',
       'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth',
       'HomePlanet_Europa', 'HomePlanet_Mars', 'HomePlanet_nan',
       'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'Destination_nan'], dtype=object)

The method get_feature_names_out returns a numpy array with the output of the ColumnTransformer, 
personally i dont like to use numpy arrray for text data, so i created a list based in the original array

In [56]:
drop_features = drop_transformer.get_feature_names_out(df.columns)
print(type(drop_features))
print(drop_features)
drop_features_list = drop_features.tolist()
print(drop_features_list)

<class 'numpy.ndarray'>
['Age' 'Luxury' 'GroupSize' 'Deck' 'Transported' 'CryoSleep' 'Side' 'VIP'
 'TravelingAlone' 'HomePlanet_Earth' 'HomePlanet_Europa' 'HomePlanet_Mars'
 'HomePlanet_nan' 'Destination_55 Cancri e' 'Destination_PSO J318.5-22'
 'Destination_TRAPPIST-1e' 'Destination_nan']
['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'HomePlanet_nan', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e', 'Destination_nan']


In [57]:
drop_output_df = pd.DataFrame( drop_transformer.fit_transform(df), columns= drop_features)

In [58]:
drop_output_df.head()

Unnamed: 0,Age,Luxury,GroupSize,Deck,Transported,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_nan,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_nan
0,0.702095,-0.521809,-0.648735,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-0.333233,-0.263119,-0.648735,5.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2.01351,3.127616,-0.022268,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.287964,1.297456,-0.022268,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-0.885407,-0.138343,-0.648735,5.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Main transformer 

In [59]:
main_features = main_transformer.get_feature_names_out(df.columns).tolist()
print(main_features)
main_transformer_data = main_transformer.fit_transform(df)
print(main_transformer_data)

['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'HomePlanet_nan', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e', 'Destination_nan', 'PassengerId', 'Num', 'Group']
[[0.7020948248098347 -0.5218089610336295 -0.6487347223401672 ...
  '0001_01' 0.0 1]
 [-0.3332325821119825 -0.2631190805476465 -0.6487347223401672 ...
  '0002_01' 0.0 2]
 [2.013509540244136 3.127616350224472 -0.022268276961021034 ... '0003_01'
  0.0 3]
 ...
 [-0.1951889278557402 0.1365146071052921 -0.6487347223401672 ...
  '9279_01' 1500.0 9279]
 [0.21894203491298664 1.1080075821912394 -0.022268276961021034 ...
  '9280_01' 608.0 9280]
 [1.0472039604504404 1.1744374563921236 -0.022268276961021034 ...
  '9280_02' 608.0 9280]]


Create the main transformer's dataframe 

In [60]:
main_output_df = pd.DataFrame(main_transformer_data, columns=main_features)
main_output_df.head()

Unnamed: 0,Age,Luxury,GroupSize,Deck,Transported,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_nan,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_nan,PassengerId,Num,Group
0,0.702095,-0.521809,-0.648735,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0001_01,0.0,1
1,-0.333233,-0.263119,-0.648735,5.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0002_01,0.0,2
2,2.01351,3.127616,-0.022268,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_01,0.0,3
3,0.287964,1.297456,-0.022268,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_02,0.0,3
4,-0.885407,-0.138343,-0.648735,5.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0004_01,1.0,4


In [61]:
main_transformer.get_feature_names_out()

array(['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep',
       'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth',
       'HomePlanet_Europa', 'HomePlanet_Mars', 'HomePlanet_nan',
       'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'Destination_nan', 'PassengerId', 'Num',
       'Group'], dtype=object)

Make the final ColumnTransformer

In [62]:
# Prepare ColumnTransformer
num_features = ['Age', 'Luxury', 'GroupSize']
cat_features_label = ['Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone']
cat_features_one_hot = ['HomePlanet','Destination']
columns_to_drop = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Cabin']
oh_encoder = OneHotEncoder(sparse_output=False)
o_encoder = OrdinalEncoder()
s_scaler = StandardScaler()
print(f'Numerical features = {num_features}')
print(f'One-hot categorical features = {cat_features_one_hot}')
print(f'Label categorical features {cat_features_label}')
print(f'Columns to drop: {columns_to_drop}')

Numerical features = ['Age', 'Luxury', 'GroupSize']
One-hot categorical features = ['HomePlanet', 'Destination']
Label categorical features ['Deck', 'Transported', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone']
Columns to drop: ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Cabin']


In [63]:
# The ColumnTransformer class needs a list of transformers, these transformers are tuples of 3 values: 
# Name of the step, the transformer to run, the objects affected by the step
# The remainder parameter controls what to do with the columns not involved in the ColumnTransformer
# Remainder default value = 'drop', drop the others column in the output 
# The columns in the output are ordered by their step, first in first out 
# The verbose parameter makes the ColumnTransformer return the time required to complete their operations
# The verbose_feature_names_out parameter adds a prefix to each column with the stepname that generated it
final_column_transformer = ColumnTransformer([
    ('Scaler', s_scaler, num_features),
    ('OrdinalEncoder', o_encoder, cat_features_label),
    ('OneHotEncoder', oh_encoder, cat_features_one_hot)],
    remainder='passthrough', 
    verbose_feature_names_out=False
    )

In [64]:
final_column_transformer.fit_transform(df)

array([[0.7020948248098347, -0.5218089610336295, -0.6487347223401672,
        ..., '0001_01', 0.0, 1],
       [-0.3332325821119825, -0.2631190805476465, -0.6487347223401672,
        ..., '0002_01', 0.0, 2],
       [2.013509540244136, 3.127616350224472, -0.022268276961021034, ...,
        '0003_01', 0.0, 3],
       ...,
       [-0.1951889278557402, 0.1365146071052921, -0.6487347223401672,
        ..., '9279_01', 1500.0, 9279],
       [0.21894203491298664, 1.1080075821912394, -0.022268276961021034,
        ..., '9280_01', 608.0, 9280],
       [1.0472039604504404, 1.1744374563921236, -0.022268276961021034,
        ..., '9280_02', 608.0, 9280]], dtype=object)

In [65]:
final_column_transformer.get_feature_names_out()

array(['Age', 'Luxury', 'GroupSize', 'Deck', 'Transported', 'CryoSleep',
       'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth',
       'HomePlanet_Europa', 'HomePlanet_Mars', 'HomePlanet_nan',
       'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'Destination_nan', 'PassengerId', 'Num',
       'Group'], dtype=object)

Make the final dataframe

In [66]:
final_cols = final_column_transformer.get_feature_names_out().tolist()

In [67]:
final_df = pd.DataFrame(final_column_transformer.fit_transform(df), columns=final_cols)
final_df.head()

Unnamed: 0,Age,Luxury,GroupSize,Deck,Transported,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_nan,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_nan,PassengerId,Num,Group
0,0.702095,-0.521809,-0.648735,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0001_01,0.0,1
1,-0.333233,-0.263119,-0.648735,5.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0002_01,0.0,2
2,2.01351,3.127616,-0.022268,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_01,0.0,3
3,0.287964,1.297456,-0.022268,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_02,0.0,3
4,-0.885407,-0.138343,-0.648735,5.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0004_01,1.0,4


Reorganize the dataframe in different ways

In [68]:
# Using iloc
final_df_re = pd.concat([final_df.iloc[:, 0:4], final_df.iloc[:, 5:], final_df.iloc[:, 4]], axis=1)
final_df_re.head()

Unnamed: 0,Age,Luxury,GroupSize,Deck,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_nan,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_nan,PassengerId,Num,Group,Transported
0,0.702095,-0.521809,-0.648735,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0001_01,0.0,1,0.0
1,-0.333233,-0.263119,-0.648735,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0002_01,0.0,2,1.0
2,2.01351,3.127616,-0.022268,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_01,0.0,3,0.0
3,0.287964,1.297456,-0.022268,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_02,0.0,3,0.0
4,-0.885407,-0.138343,-0.648735,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0004_01,1.0,4,1.0


In [69]:
# Using a list and its index to filter 
# the item cols[-i] is outputted as a string, not a list, so we need to transform it to list in order to concatenate all values
cols = list(final_df.columns.values)
final_df_re = final_df[cols[0:4] + cols[5:] + [cols[4]]]
final_df_re.head()

Unnamed: 0,Age,Luxury,GroupSize,Deck,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_nan,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_nan,PassengerId,Num,Group,Transported
0,0.702095,-0.521809,-0.648735,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0001_01,0.0,1,0.0
1,-0.333233,-0.263119,-0.648735,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0002_01,0.0,2,1.0
2,2.01351,3.127616,-0.022268,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_01,0.0,3,0.0
3,0.287964,1.297456,-0.022268,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_02,0.0,3,0.0
4,-0.885407,-0.138343,-0.648735,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0004_01,1.0,4,1.0


In [70]:
# Using list comprehension to filter with the column names
columns_re = [col for col in final_cols if col != 'Transported']
print(columns_re)
final_df_re = final_df[columns_re + ['Transported']]
final_df_re.head()

['Age', 'Luxury', 'GroupSize', 'Deck', 'CryoSleep', 'Side', 'VIP', 'TravelingAlone', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'HomePlanet_nan', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e', 'Destination_nan', 'PassengerId', 'Num', 'Group']


Unnamed: 0,Age,Luxury,GroupSize,Deck,CryoSleep,Side,VIP,TravelingAlone,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_nan,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_nan,PassengerId,Num,Group,Transported
0,0.702095,-0.521809,-0.648735,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0001_01,0.0,1,0.0
1,-0.333233,-0.263119,-0.648735,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0002_01,0.0,2,1.0
2,2.01351,3.127616,-0.022268,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_01,0.0,3,0.0
3,0.287964,1.297456,-0.022268,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0003_02,0.0,3,0.0
4,-0.885407,-0.138343,-0.648735,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0004_01,1.0,4,1.0


Save the dataframe

In [71]:
final_df_re.to_csv('../data/processed/train.csv')