# Pipeline 1
This will be the simplest pipeline, we will fill missing values with the column's mean and drop the following columns, as they seem to be the least related to survival, which may not be true, and we will test it in another pipeline
- PassengerId
- Name
- Ticket
- Cabin

## Data Dictionary

| Variable | Definition                     | Key                                            |
|----------|--------------------------------|------------------------------------------------|
| survival | Survival                       | 0 = No, 1 = Yes                                |
| pclass   | Ticket Class                   | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                            |                                                |
| Age      | Age in years                   |                                                |
| sibsp    | # of siblings / spouses aboard |                                                |
| parch    | # of parents / children aboard |                                                |
| ticket   | Ticket Number                  |                                                |
| fare     | Passenger Fare                 |                                                |
| cabin    | Cabin Number                   |                                                |
| embarked | Port of Embarkation            | C = Cherbourg, Q = Queenstown, S = Southampton |

## Variable Notes
**pclass:** A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp:** The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch:** The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

## Import the Data

In [1]:
import numpy as np
import pandas as pd

trainData = pd.read_csv("../data/raw/train.csv")
trainData.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
testData = pd.read_csv("../data/raw/test.csv")
testData.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Prepare the data

### Drop the columns that we will not use

In [3]:
datasets = [trainData, testData]

for dataset in datasets:
    dataset.drop("PassengerId", axis=1, inplace=True)
    dataset.drop("Name", axis=1, inplace=True)
    dataset.drop("Ticket", axis=1, inplace=True)
    dataset.drop("Cabin", axis=1, inplace=True)

In [4]:
trainData.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


### Handle missing values

In [5]:
missingData = []

for dataset in datasets:
    totalMissing = dataset.isnull().sum().sort_values(ascending=False)
    percent = dataset.isnull().sum() / dataset.isnull().count() * 100
    percent = round(percent, 1).sort_values(ascending=False)
    
    missingData.append(pd.concat([totalMissing, percent], axis=1, keys=["Total", "%"]))
    
pd.concat(missingData, axis=1, keys=["Training Data", "Testing Data"])

Unnamed: 0_level_0,Training Data,Training Data,Testing Data,Testing Data
Unnamed: 0_level_1,Total,%,Total,%
Age,177,19.9,86.0,20.6
Embarked,2,0.2,0.0,0.0
Survived,0,0.0,,
Pclass,0,0.0,0.0,0.0
Sex,0,0.0,0.0,0.0
SibSp,0,0.0,0.0,0.0
Parch,0,0.0,0.0,0.0
Fare,0,0.0,1.0,0.2


In [6]:
for dataset in datasets:
    meanAge = dataset["Age"].mean()
    mostFrequentEmbarked = dataset["Embarked"].mode().item()
    meanFare = dataset["Fare"].mean()
    
    dataset["Age"].fillna(meanAge, inplace=True)
    dataset["Embarked"].fillna(mostFrequentEmbarked, inplace=True)
    dataset["Fare"].fillna(meanFare, inplace=True)

In [7]:
missingData = []

for dataset in datasets:
    totalMissing = dataset.isnull().sum().sort_values(ascending=False)
    percent = dataset.isnull().sum() / dataset.isnull().count() * 100
    percent = round(percent, 1).sort_values(ascending=False)
    
    missingData.append(pd.concat([totalMissing, percent], axis=1, keys=["Total", "%"]))
    
pd.concat(missingData, axis=1, keys=["Training Data", "Testing Data"])

Unnamed: 0_level_0,Training Data,Training Data,Testing Data,Testing Data
Unnamed: 0_level_1,Total,%,Total,%
Survived,0,0.0,,
Pclass,0,0.0,0.0,0.0
Sex,0,0.0,0.0,0.0
Age,0,0.0,0.0,0.0
SibSp,0,0.0,0.0,0.0
Parch,0,0.0,0.0,0.0
Fare,0,0.0,0.0,0.0
Embarked,0,0.0,0.0,0.0


### Normalize values

In [8]:
for dataset in datasets:
    pclassMax = dataset["Pclass"].max()
    ageMax = dataset["Age"].max()
    sibSpMax = dataset["SibSp"].max()
    parchMax = dataset["Parch"].max()
    fareMax = dataset["Fare"].max()

    dataset["Pclass"] = dataset["Pclass"] / pclassMax
    dataset["Age"] = dataset["Age"] / ageMax
    dataset["SibSp"] = dataset["SibSp"] / sibSpMax
    dataset["Parch"] = dataset["Parch"] / parchMax
    dataset["Fare"] = dataset["Fare"] / fareMax

In [9]:
trainData.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,1.0,male,0.275,0.125,0.0,0.014151,S
1,1,0.333333,female,0.475,0.125,0.0,0.139136,C
2,1,1.0,female,0.325,0.0,0.0,0.015469,S
3,1,0.333333,female,0.4375,0.125,0.0,0.103644,S
4,0,1.0,male,0.4375,0.0,0.0,0.015713,S


### One-Hot Encode categorical features

In [10]:
trainData = pd.get_dummies(trainData, columns=["Embarked", "Sex"])
testData = pd.get_dummies(testData, columns=["Embarked", "Sex"])

In [11]:
trainData.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
0,0,1.0,0.275,0.125,0.0,0.014151,False,False,True,False,True
1,1,0.333333,0.475,0.125,0.0,0.139136,True,False,False,True,False
2,1,1.0,0.325,0.0,0.0,0.015469,False,False,True,True,False
3,1,0.333333,0.4375,0.125,0.0,0.103644,False,False,True,True,False
4,0,1.0,0.4375,0.0,0.0,0.015713,False,False,True,False,True


## Save new Training and Testing Datasets to csv

In [12]:
trainData.to_csv("../data/preprocessed/pipeline1/train.csv")
testData.to_csv("../data/preprocessed/pipeline1/test.csv")