# Data Transformation

In this notebook, i'm going to replace, modify, reshape and scale the data, thus increasing the accuracy of the model

Import libraries

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error, accuracy_score, precision_score, recall_score

## Data dictionary

- **PassengerId** - A unique Id for each passenger. Each Id takes the form ```gggg_pp``` where ```gggg``` indicates a group the passenger is travelling with and ```pp``` is their number within the group. People in a group are often family members, but not always.
- **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- **Destination** - The planet the passenger will be debarking to.
- **Age** - The age of the passenger.
- **VIP** - Whether the passenger has paid for special VIP service during the voyage.
- **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- **Name** - The first and last names of the passenger.
- **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## Check the dataframe

let's load the data

In [2]:
df = pd.read_csv('../data/stg/train_stg.csv')
# df = pd.read_csv('../data/train.csv', dtype_backend='pyarrow')

In [3]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Luxury,Group,GroupSize,TravelingAlone
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0,P,0.0,1,1,True
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0,S,736.0,2,1,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0,S,10383.0,3,2,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0,S,5176.0,3,2,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1,S,1091.0,4,1,True


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PassengerId     8693 non-null   object 
 1   HomePlanet      8693 non-null   object 
 2   CryoSleep       8693 non-null   bool   
 3   Cabin           8693 non-null   object 
 4   Destination     8693 non-null   object 
 5   Age             8693 non-null   float64
 6   VIP             8693 non-null   bool   
 7   RoomService     8693 non-null   float64
 8   FoodCourt       8693 non-null   float64
 9   ShoppingMall    8693 non-null   float64
 10  Spa             8693 non-null   float64
 11  VRDeck          8693 non-null   float64
 12  Transported     8693 non-null   bool   
 13  Deck            8693 non-null   object 
 14  Num             8693 non-null   int64  
 15  Side            8693 non-null   object 
 16  Luxury          8693 non-null   float64
 17  Group           8693 non-null   i

In [5]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Num,Luxury,Group,GroupSize
count,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0
mean,28.790291,220.009318,448.434027,169.5723,304.588865,298.26182,586.624065,1440.866329,4633.389624,2.035546
std,14.341404,660.51905,1595.790627,598.007164,1125.562559,1134.126417,513.880084,2803.045694,2671.028856,1.596347
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,20.0,0.0,0.0,0.0,0.0,0.0,152.0,0.0,2319.0,1.0
50%,27.0,0.0,0.0,0.0,0.0,0.0,407.0,716.0,4630.0,1.0
75%,37.0,41.0,61.0,22.0,53.0,40.0,983.0,1441.0,6883.0,3.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0,1894.0,35987.0,9280.0,8.0


In [6]:
df.corr(numeric_only=True)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Num,Luxury,Group,GroupSize,TravelingAlone
CryoSleep,1.0,-0.071323,-0.078281,-0.244089,-0.205928,-0.207798,-0.198307,-0.192721,0.460132,-0.040133,-0.376692,-0.006883,0.079363,-0.091562
Age,-0.071323,1.0,0.091863,0.068629,0.12739,0.033148,0.120946,0.09959,-0.074233,-0.127788,0.184628,-0.009099,-0.176957,0.133804
VIP,-0.078281,0.091863,1.0,0.056566,0.125499,0.018412,0.060991,0.123061,-0.037261,-0.096811,0.162987,0.013608,0.002856,-0.034027
RoomService,-0.244089,0.068629,0.056566,1.0,-0.015126,0.052337,0.009244,-0.018624,-0.241124,-0.012673,0.234374,0.000375,-0.039734,0.019338
FoodCourt,-0.205928,0.12739,0.125499,-0.015126,1.0,-0.013717,0.221468,0.224572,0.045583,-0.177197,0.742608,-0.0092,0.032502,-0.066683
ShoppingMall,-0.207798,0.033148,0.018412,0.052337,-0.013717,1.0,0.014542,-0.007849,0.009391,0.00353,0.220529,0.017796,-0.038536,0.029095
Spa,-0.198307,0.120946,0.060991,0.009244,0.221468,0.014542,1.0,0.147658,-0.218545,-0.129222,0.592656,-0.005198,0.019218,-0.043639
VRDeck,-0.192721,0.09959,0.123061,-0.018624,0.224572,-0.007849,0.147658,1.0,-0.204874,-0.133074,0.585684,0.015945,0.00913,-0.044293
Transported,0.460132,-0.074233,-0.037261,-0.241124,0.045583,0.009391,-0.218545,-0.204874,1.0,-0.043832,-0.199514,0.021491,0.082644,-0.113792
Num,-0.040133,-0.127788,-0.096811,-0.012673,-0.177197,0.00353,-0.129222,-0.133074,-0.043832,1.0,-0.208844,0.665621,-0.051351,0.133426


## Data Transformation

First thing: drop the Passenger id column, i don't think there's enough value in that feature to keep it in the dataframe

In [7]:
df.drop('PassengerId', axis=1, inplace=True)

Next, dropping the excess of "spending" features, we created the luxury feature that encapsulates all that data

In [8]:
df.drop(['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], axis=1, inplace=True)

In the previous notebook, i separated the cabin feature, dont need it anymore

In [9]:
df.drop('Cabin', axis=1, inplace=True)

In [10]:
df.columns

Index(['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'Transported',
       'Deck', 'Num', 'Side', 'Luxury', 'Group', 'GroupSize',
       'TravelingAlone'],
      dtype='object')

Next, handle categorical and numerical features

### Numerical Features
These features should be scaled using the standard scaler(z-scaling):

z = (x - u) / s

Where `x` is the current training value, `u` is the mean of the training samples or zero if with_mean=False, and `s` is the standard deviation of the training samples or one if with_std=False. (From scikit learn docs)

I'm torn about the Group and the num feature, in a way it's a categorical feature, because its the id of the groups but there's ~6000 different ids, i should rescale that, also the num feature is something similar. for now i'm going to ignore them

In [11]:
num_features = ['Age', 'Luxury', 'GroupSize']

In [12]:
df['Age'].shape

(8693,)

We need to passthrough a (n, n) array to the scaler not an (n,) array, so we're going to reshape the pandas series with `df['Age'].values.reshape((-1,1))`

In [13]:
scalers = {}
for feature in num_features:
    current_scaler = StandardScaler()
    df[feature] = current_scaler.fit_transform(df[feature].values.reshape(-1, 1))
    scalers[feature] = current_scaler

In [14]:
df[num_features]

Unnamed: 0,Age,Luxury,GroupSize
0,0.711945,-0.514066,-0.648735
1,-0.334037,-0.251479,-0.648735
2,2.036857,3.190333,-0.022268
3,0.293552,1.332604,-0.022268
4,-0.891895,-0.124824,-0.648735
...,...,...,...
8688,0.851410,2.531369,-0.648735
8689,-0.752431,-0.514066,-0.648735
8690,-0.194573,0.154175,-0.648735
8691,0.223820,1.140302,-0.022268


In [15]:
df[num_features].describe()

Unnamed: 0,Age,Luxury,GroupSize
count,8693.0,8693.0,8693.0
mean,-2.125171e-17,1.409969e-17,-1.3077980000000001e-17
std,1.000058,1.000058,1.000058
min,-2.00761,-0.5140655,-0.6487347
25%,-0.6129662,-0.5140655,-0.6487347
50%,-0.1248409,-0.2586144,-0.6487347
75%,0.572481,4.769043e-05,0.6041982
max,3.501233,12.32521,3.73653


Now let's confirm the 3 scalers are different

In [16]:
scalers['Age'].mean_

array([28.79029104])

In [17]:
for scaler in scalers: 
    print(f'{scaler}')
    print(f'Mean: {scalers[scaler].mean_}')
    print(f'Scale or standard deviation: {scalers[scaler].scale_}')
    print(f'Variance: {scalers[scaler].var_}')

Age
Mean: [28.79029104]
Scale or standard deviation: [14.34057929]
Variance: [205.65221449]
Luxury
Mean: [1440.86632923]
Scale or standard deviation: [2802.88446483]
Variance: [7856161.32321115]
GroupSize
Mean: [2.03554584]
Scale or standard deviation: [1.59625469]
Variance: [2.54802903]


### Categorical Features
These features should be one-hot or Label encoded, because they represent a charasteristic of the training sample

In [18]:
df.columns

Index(['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'Transported',
       'Deck', 'Num', 'Side', 'Luxury', 'Group', 'GroupSize',
       'TravelingAlone'],
      dtype='object')

In [21]:
cat_features_gen = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Transported', 'Deck', 'Side', 'TravelingAlone']

In [24]:
for cat_feature in cat_features_gen: 
    print(df[cat_feature].value_counts())

HomePlanet
Earth     4803
Europa    2131
Mars      1759
Name: count, dtype: int64
CryoSleep
False    5656
True     3037
Name: count, dtype: int64
Destination
TRAPPIST-1e      6097
55 Cancri e      1800
PSO J318.5-22     796
Name: count, dtype: int64
VIP
False    8494
True      199
Name: count, dtype: int64
Transported
True     4378
False    4315
Name: count, dtype: int64
Deck
F    2794
G    2559
E     876
B     779
C     747
D     478
A     256
0     199
T       5
Name: count, dtype: int64
Side
S    4288
P    4206
0     199
Name: count, dtype: int64
TravelingAlone
True     4805
False    3888
Name: count, dtype: int64


In [None]:
cat_features_label = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Trnsported', 'Deck', 'Side', 'TravelingAlone']
cat_features_one_hot = ['HomePlanet', 'CryoSleep', 'Destination']

encoders = {}
for feature in cat_features_label:
    current_encoder = LabelEncoder()
    df[feature] = current_encoder.fit_transform(df[feature].values.reshape(-1, 1))
    encoders[feature] = current_encoder