# Data Preparation

Dataset: <a href="https://www.kaggle.com/c/titanic/data">Kaggle Titanic Data</a><br>
Filename: Titanic_train.csv


## Import Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

In [2]:
# Set Options for display
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
pd.options.display.float_format = '{:.2f}'.format

#Filter Warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
from scipy.stats import norm
from scipy import stats

________

# Part 1 - All Features

## Load the Dataset
* Specify the Parameters (Filepath, Index Column)
* Check for Date-Time Columns to Parse Dates
* Check Encoding if file does not load correctly

In [4]:
df = pd.read_csv("./train.csv")

View the Dataset

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Check the Shape

In [6]:
df.shape

(891, 12)

Set the correct index

In [7]:
df.set_index('PassengerId')

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C


## Ensure Columns / Features have Proper Labels

In [10]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [11]:
# RENAME COLUMNS
proper_label = {'Pclass':'Passenger_class', 'SibSp':'Siblings_Spouse_Aboard', 'Parch':'Parents_Children_Aboard'}

df.rename(columns=proper_label)

Unnamed: 0,PassengerId,Survived,Passenger_class,Name,Sex,Age,Siblings_Spouse_Aboard,Parents_Children_Aboard,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C


## Ensure Correct Format of Values

In [12]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

## Remove Duplicates

Check if Index is duplicated

In [14]:
df.index.duplicated().sum()

0

Check if there are duplicated rows


In [15]:
df[df.duplicated()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


Remove the duplicates if any

In [16]:
df.drop_duplicates(inplace=True)

## Handle Missing Data

Check for missing data

In [19]:
#Gets the total number of missing data
total = df.isnull().sum().sort_values(ascending=False)


In [20]:
#Get % of Null
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)

In [21]:
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

missing_data.head(20)

Unnamed: 0,Total,Percent
Cabin,687,0.77
Age,177,0.2
Embarked,2,0.0
Survived,0,0.0
Pclass,0,0.0
Name,0,0.0
Sex,0,0.0
SibSp,0,0.0
Parch,0,0.0
Ticket,0,0.0


Handle the columns with missing values
* For numerical values, fill in with either the mean or the median
* For categorical values, fill with a value to represent Unknown values. 

## Check the Distribution of the Target Variable

## Split into Numerical, Categorical, and Target

_______

# Part 2 - Numerical Features

Get the statistics for numerical data

Analyze the following features:
1. Pclass
    1. Get Value Counts
2. Age
    1. Plot the distribution
    2. Create a Boxplot
3. Fare
    1. Plot the distribution
    2. Create a Boxplot

*For now, do not remove any outliers*

## Feature Scaling 

Check Scale of Features

Use the MinMax Scaler to scale the numerical features

______

# Part 3 - Categorical Features

Get the statistics for the categorical features

Remove any categorical features which does not add value to the model.<br>
In this case this would be the following features:
* Name = name of the passenger
* Ticket = ticket number
* Cabin = cabin number

Print categorical values for each feature

Convert Categories to Number 

_______

# Combine all the prepared dataframes

## Save final DataFrame as a csv file

### Check if it loads correctly