# Data Preprocessing 1

## Contents

- [Imports](#imports)
- [Load Data](#load-data)
    - [Confirm Concatenate Shape](#confirm-concatenated-shape)
- [Data Prep](#data-preparation)
    - [Check Nulls](#check-null-value)
    - [Drop Columns](#drop-columns) 
    - [Cabin Column](#cabin)
    - [Encode Categorical Features 1](#encode-categorical-features-1)
    - [Impute Numerical Features](#impute-numerical-columns)
    - [Impute Categorical Features](#impute-categorical-columns)
    - [Encode Categorical Features 2](#encode-categorical-features-2)

- [Feature Engineering](#feature-engineering)
    - [Create Bills Column](#create-bills-column)

- [Divide Dataset](#divide-into-train-and-test)

- [Save New DataFrames](#save-new-dfs)

**This notebook is the inital preprocessing of the data to determine a base score.**

**Further detection will build from here**




# Imports

In [37]:
import pandas as pd

from sklearn.impute import KNNImputer

# Load Data

- Load Train
- Load Test
- Combine Train and Test

In [38]:
train = pd.read_csv('../../data/original_data/train.csv')
test = pd.read_csv('../../data/original_data/test.csv')

len_train = len(train)
len_test = len(test)

train.shape, test.shape

((8693, 14), (4277, 13))

In [39]:
df = pd.concat([train, test])

## Confirm Concatenated Shape

In [40]:
df.shape[0] == test.shape[0] + train.shape[0]

True

# Data Preparation

## Check Null Values

In [41]:
df.isnull().sum()

PassengerId        0
HomePlanet       288
CryoSleep        310
Cabin            299
Destination      274
Age              270
VIP              296
RoomService      263
FoodCourt        289
ShoppingMall     306
Spa              284
VRDeck           268
Name             294
Transported     4277
dtype: int64

## Drop Columns

In [42]:
df.drop(columns=['Name'], inplace=True)

## Cabin

This is taken from the challenge page:

*Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.*

Therefore, Cabin will be separated into three columns (Deck, Num, Side)

In [43]:
df[['Deck', 'Num', 'Side']] = df['Cabin'].str.split('/', expand=True)
df.drop(columns=['Cabin'], inplace=True)

### Fill Unknown Deck, Num and Side columns with 'U'

In [44]:
df['Deck'] = df['Deck'].fillna('U')
df['Num'] = df['Num'].fillna(-1)
df['Side'] = df['Side'].fillna('U')

## Encode Categorical Features 1

- Deck
- Side

In [45]:
df['Deck'] = df['Deck'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'T': 7, 'U': -1})
df['Side'] = df['Side'].map({'S': 0, 'P': 1, 'U': -1})

## Impute Numerical Columns

In [46]:
impute_list = ['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Deck', 'Num', 'Side']
rest_columns = list(set(df.columns) - set(impute_list))

df_rest = df[rest_columns]

In [48]:
imp = KNNImputer()

df_imputed = imp.fit_transform(df[impute_list])
df_imputed = pd.DataFrame(df_imputed, columns=impute_list)
df = pd.concat([df_rest.reset_index(drop=True), df_imputed.reset_index(drop=True)], axis=1)

## Impute Categorical Columns

In [49]:
df['HomePlanet'] = df['HomePlanet'].fillna('U')
df['Destination'] = df['Destination'].fillna('U')

df.isnull().sum()

HomePlanet         0
Transported     4277
PassengerId        0
Destination        0
CryoSleep          0
Age                0
VIP                0
RoomService        0
FoodCourt          0
ShoppingMall       0
Spa                0
VRDeck             0
Deck               0
Num                0
Side               0
dtype: int64

## Encode Categorical Features 2

- HomePlanet
- Destination

In [50]:
category_columns = ['HomePlanet', 'Destination']

for col in category_columns:
    df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
    
df.drop(columns=category_columns, inplace=True)

In [51]:
df.head(3)

Unnamed: 0,Transported,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,...,Num,Side,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_U,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_U
0,False,0001_01,0.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,False,True,False,False,False,False,True,False
1,True,0002_01,0.0,24.0,0.0,109.0,9.0,25.0,549.0,44.0,...,0.0,0.0,True,False,False,False,False,False,True,False
2,False,0003_01,0.0,58.0,1.0,43.0,3576.0,0.0,6715.0,49.0,...,0.0,0.0,False,True,False,False,False,False,True,False


# Feature Engineering

## Create Bills Column

- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck

In [52]:
bills_columns = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
df['AmountSpent'] = df[bills_columns].sum(axis=1)
df.drop(columns=bills_columns, inplace=True)
df.head(3)

Unnamed: 0,Transported,PassengerId,CryoSleep,Age,VIP,Deck,Num,Side,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_U,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_U,AmountSpent
0,False,0001_01,0.0,39.0,0.0,1.0,0.0,1.0,False,True,False,False,False,False,True,False,0.0
1,True,0002_01,0.0,24.0,0.0,5.0,0.0,0.0,True,False,False,False,False,False,True,False,736.0
2,False,0003_01,0.0,58.0,1.0,0.0,0.0,0.0,False,True,False,False,False,False,True,False,10383.0


# Divide Into Train and Test

In [54]:
train_preprocessed = df[:len_train].copy()
test_preprocessed = df[len_train:].copy()

train_preprocessed.reset_index(drop=True, inplace=True)
test_preprocessed.reset_index(drop=True, inplace=True)

test_preprocessed.drop(columns=['Transported'], inplace=True)

len(test) == len(test_preprocessed)
test_preprocessed.head(3)

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,Deck,Num,Side,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,HomePlanet_U,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Destination_U,AmountSpent
0,0013_01,1.0,27.0,0.0,6.0,3.0,0.0,True,False,False,False,False,False,True,False,0.0
1,0018_01,0.0,19.0,0.0,5.0,4.0,0.0,True,False,False,False,False,False,True,False,2832.0
2,0019_01,1.0,31.0,0.0,2.0,0.0,0.0,False,True,False,False,True,False,False,False,0.0


# Save New DFs

In [55]:
train_preprocessed.to_csv('../../data/preproc_data/train_1_0.csv', index=False)
test_preprocessed.to_csv('../../data/preproc_data/test_1_0.csv', index=False)