# **Titanic: Machine Learning from Disaster - Part 2** #

## **3. Data Wrangling** ##

Data wrangling is the art of dealing with and / or converting missing or ill-formatted data into a format that more easily lends itself to analysis. First, let's have an overview look on the features.

In [3]:
import pandas as pd
import numpy as np
from helpers.settings import DATA_DIR, OUT_DIR, FIG_DIR
import os 


train = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))
test = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'))

train.info()
print('_'*40)
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

We can drop right away some unnecessary columns, which are "PassengerId", "Name", "Ticket", and "Embarked". They won't be useful for analysis and prediction.

In [2]:
train.drop(['PassengerId','Name','Ticket', 'Embarked'], axis=1, inplace = True)
test.drop(['PassengerId','Name','Ticket', 'Embarked'], axis=1, inplace = True)

Next, we check if there are any missing values in our train and test data.

In [3]:
train.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
dtype: int64

In [4]:
test.isnull().sum()

Pclass      0
Sex         0
Age        86
SibSp       0
Parch       0
Fare        1
Cabin     327
dtype: int64

The "Cabin" feature has a lot of missing values both on train and test data, so removing the entire feature is actually better than trying to impute the missing values.

In [5]:
train.drop("Cabin",axis=1,inplace=True)
test.drop("Cabin",axis=1,inplace=True)

For numerical features, we can fill in the median and mean values on the missing ones.

In [6]:
test["Fare"].fillna(test["Fare"].median(), inplace = True)
train["Age"].fillna(train["Age"].mean(), inplace  = True)
test["Age"].fillna(test["Age"].mean(), inplace  = True)

After cleaning the data, we will export them to new .csv file for further analysis.

In [7]:
train.to_csv(os.path.join(DATA_DIR, 'train_cleaned.csv'), index = False)
test.to_csv(os.path.join(DATA_DIR, 'test_cleaned.csv'), index = False)

In the next step we will begin exploratory analysis.