# Data Cleaning

This is the data cleaning notebook,

Import libraries

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error, accuracy_score, precision_score, recall_score

### Data dictionary

- **PassengerId** - A unique Id for each passenger. Each Id takes the form ```gggg_pp``` where ```gggg``` indicates a group the passenger is travelling with and ```pp``` is their number within the group. People in a group are often family members, but not always.
- **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- **Destination** - The planet the passenger will be debarking to.
- **Age** - The age of the passenger.
- **VIP** - Whether the passenger has paid for special VIP service during the voyage.
- **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- **Name** - The first and last names of the passenger.
- **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

let's load the data

In [2]:
df = pd.read_csv('../data/raw/train.csv')
# df = pd.read_csv('../data/train.csv', dtype_backend='pyarrow')

## Data Cleaning

Drop unique columns

In [3]:
df = df.drop(['Name'], axis=1)
print(df.shape)

(8693, 13)


Check for missing values 

In [4]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Transported       0
dtype: int64

Filling the gaps....

HomePlanet

In [5]:
# print('Printing most repeated entry: ')
# print(df['HomePlanet'].value_counts().index[0])
print('Printing the mode')
print(df['HomePlanet'].mode())

Printing the mode
0    Earth
Name: HomePlanet, dtype: object


In [6]:
df['HomePlanet'].isnull().sum()

201

In [7]:
df['HomePlanet'].fillna(df['HomePlanet'].mode()[0],inplace=True)

In [8]:
def fillmode(df_series: str) -> None:
    """ Fill the series's null values with its mode

    Args:
        df_series (str): series to fill
    """
    df[df_series].fillna(df[df_series].mode()[0], inplace=True)

CryoSleep

In [9]:
fillmode('CryoSleep')

Cabin

In [10]:
def fillzeros (df_series: str) -> None:
    """ Fill the series's null values with 0 (float)

    Args:
        df_series (str): series to fill
    """
    df[df_series].fillna(0, inplace=True)

In [11]:
fillzeros('Cabin')

Destination

In [12]:
fillmode('Destination')

Age

In [13]:
df['Age'].median()

27.0

In [14]:
def fillmedian(df_series: str) -> None:
    """ Fill the series's null values with 0 (float)

    Args:
        df_series (str): series to fill
    """
    df[df_series].fillna(df[df_series].median(), inplace=True)

In [15]:
fillmedian('Age')

VIP

In [16]:
fillmode('VIP')

Amenities

In [17]:
fillzeros('FoodCourt')
fillzeros('ShoppingMall')
fillzeros('Spa')
fillzeros('VRDeck')
fillzeros('RoomService')

Checking again

In [18]:
df.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

In [19]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0
mean,28.790291,220.009318,448.434027,169.5723,304.588865,298.26182
std,14.341404,660.51905,1595.790627,598.007164,1125.562559,1134.126417
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,37.0,41.0,61.0,22.0,53.0,40.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


### Saving the CSV

In [20]:
df.to_csv('../data/stg/train_stg.csv', index=False)