## End Goal : Submit if a given set of PassengerIds survived/not
```
PassengerId, Survived
123,0
234,1
...
```


### Data Dictionary

| **Variable** | **Definition**                             | **Key**                                        |
| ------------ | ------------------------------------------ | ---------------------------------------------- |
| survival     | Survival                                   | 0 = No, 1 = Yes                                |
| pclass       | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex          | Sex                                        |                                                |
| Age          | Age in years                               |                                                |
| sibsp        | # of siblings / spouses aboard the Titanic |                                                |
| parch        | # of parents / children aboard the Titanic |                                                |
| ticket       | Ticket number                              |                                                |
| fare         | Passenger fare                             |                                                |
| cabin        | Cabin number                               |                                                |
| embarked     | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

### Variable Notes

**pclass**: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
**sibsp**: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
**parch**: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

In [None]:
sns.set_style('darkgrid')

In [None]:
df = pd.read_csv('data/train.csv', index_col='PassengerId')
df.head()

## Task 1 : Identify and Impune Missing Fields

In [None]:
df.info()
# Age has (891 - 714 = 177) missing columns
# Cabin has (891 - 204 = 687) missing columns

In [None]:
ser = df.isnull().sum()
ser = ser.loc[ser>0]
ser

In [None]:
df.loc[df.Embarked.isnull()]  # Fare is 80.0 for these 2 females carrying 1st class ticket. And Cabin is B28.

In [None]:
# Find female passengers whose  1st class Fare is 80.0 and if their cabin starts with B?
df.loc[(df.Sex=='female') & (df.Pclass==1) & (df.Fare==80.0000)] # Getting same 2 records with Embarked value asa nan.
df.loc[df.Fare==80.0000] # Getting same 2 records with Embarked value asa nan.
df.loc[df.Cabin=='B28'] # Getting same 2 records with Embarked value asa nan.
df.loc[df.Ticket=='113572'] # Getting same 2 records with Embarked value asa nan.

In [None]:
# Filling missing values with mode
mode = df.Embarked.mode().values[0]
# df[['Embarked']].fillna(mode, inplace=True) #Fails!
# df[['Embarked']] = df[['Embarked']].fillna(mode) #Works!
# df.loc[[62,830],'Embarked'] = [mode,mode] #Works!
df.fillna({'Embarked':mode}, inplace=True) # Works! This is nicer way. Use Dict {'coln_name': replacement_value}
df.isnull().sum()

In [None]:
# With majority of Cabin columns having missing values, this column can be dropped
df.drop(columns=['Cabin'], inplace=True)
df.isnull().sum()

In [None]:
# How to meaningfully impute Missing Ages???
df.Age.describe()

In [None]:
''' This function is extracted to seperate file and can be seen imported after this snippet
def title(name):
    titles = ['Mr.','Mrs.','Miss.', 'Ms.', 'Master.', 'Major.', 'Mme.', 
              'Mlle.', 'Sir.', 'Lady.', 'Dr.', 'Rev.', 'Col.', 'Capt.', 'Don.', 'Dona.',
             'Countess.', 'Jonkheer.']
    for t in titles:
        if (t in name):
            return t
'''

from util.eda import title
# df.loc[df.Name.apply(title).isnull()].Name
df['Title'] = df.Name.apply(title)
df.Title.value_counts()

In [None]:
age_map = df.groupby(by=['Title']).mean().Age.round()
# age_map.to_pickle('pickle/age_map.pkl')
age_map.to_csv('data/age_map.csv', index=True, header=True)
age_map

In [None]:
# age_map['Col.']
# df[['Title','Age']].fillna(??)

''' This function is extracted to seperate file
def guess_age(row):
#     print(type(row)) # <class 'pandas.core.series.Series'?
    t = row['Title']
    row['Age'] = age_map[t]
    return row
'''
from util.eda import guess_age # Importing the extracted function above

# df.loc[df.Age.isnull()].apply(guess_age, axis=1)
df.loc[df.Age.isnull()] = df.loc[df.Age.isnull()].apply(guess_age, axis=1)
df.isnull().sum()

In [None]:
# Data-Manipulation
df['GroupCount'] = eval('df.SibSp + df.Parch')
df.head()

## Task 1 Completed Successfully!
* Added New Columns
  - 'Title'
* Imputed missing values:
  - Cabin : drop column because too many missing values
  - Age : Guess age by mean-age of the  group the person belongs to. See guess_age()
  - Embarked : Filled the gap with mode - 'S' in this case.

In [None]:
# Categorize Ages into  Groups -- Kids, Children, Adults, Aged
''' this method is extracted to seperate file
def age_group(age):
    if (age == np.nan): return np.nan
    elif (age<3): return 'baby'
    elif (age<12): return 'child'
    elif (age<20): return 'teen'
    elif (age<60): return 'adult'
    else: return 'aged'
'''
from util.eda import age_group
# bins = [0,3,17,59,120]
# df['AgeGroup'] = pd.cut(df['Age'], bins)
# df['AgeGroup'].value_counts()
df['AgeGroup'] = df.Age.map(age_group)
df['AgeGroup']

In [None]:
# ax = sns.regplot(x="Age", y="Fare", data=df)
# df.loc[df.Age.isnull()==False].groupby(by=['Embarked', 'Pclass', 'Fare']).mean()
# df[['RoundedFare']] = df['Fare'].rou #apply(lambda x: (x//10)*10).value_counts()
''' Extracted to seperate file
def rounded_fare(fare):
    #Rounds fare to nearest 10
    return round( (fare/10) ) * 10
'''
from util.eda import rounded_fare
df['RoundedFare'] = df.Fare.map(rounded_fare) # (df.Fare /10).round() * 10 #.value_counts()
df.RoundedFare.value_counts()
tmp = df.loc[df.Age.isnull()==False].groupby(by=['Embarked', 'Pclass', 'AgeGroup']).mean()
# tmp.loc[['S',3,61]]
# queryExpr = 'Pclass in [1,3]' # Works! 
queryExpr = '(Embarked=="S") & (Pclass==3) & (AgeGroup=="aged")'
tmp.query(queryExpr)
# tmp.loc[[('S',3,61)]] # Works!
# df.drop(columns=['RoundedFare'], inplace=True)

In [None]:
# How to display just Fare and RoundedFare columns
# Ref.: https://jeffdelaney.me/blog/useful-snippets-in-pandas/
# Ref.: http://pandas.pydata.org/pandas-docs/stable/advanced.html#cross-section
fare_pivot_table = tmp.loc[:,['Fare','RoundedFare']] # Works to show all indices
fare_pivot_table.to_pickle('pickle/fare_pivot_table.pkl')

In [None]:
tmp.loc[('S',1,['baby','aged']),:] # Show selective indices and all features/columns

In [None]:
tmp.loc[('S',1,['baby','aged']),['Fare','RoundedFare']]

## Task 2 : Identify Noise/Duplicate columns to be Dropped - Mark and Sweep
* Name
* Ticket
* Fare
* AgeGroup (Choosing Age feature for Modelling)
* Title

In [None]:
df.head()

In [None]:
df.Ticket.value_counts()

In [None]:
df.drop(columns=['Name','Ticket'], inplace=True) # Name, Ticket is just a meta-data and not a feature
df.head()

In [None]:
df.drop(columns=['Fare','AgeGroup','Title'],inplace=True)
df.head()

In [None]:
print(df.Sex.unique())
df.Sex.value_counts()

In [None]:
df['Embarked'].value_counts()

In [None]:
df.head()

## Task 3 : Feature Engineering (on 2nd thoughts not doing it here and now!)
* Category Columns
  - Sex
  - Embarked
  - AgeGroup

In [None]:
# pd.get_dummies(data=df, drop_first=True)

## Final Task : Persist processed train-dataset

In [None]:
df.to_csv('data/train_processed_1.csv')

In [None]:
# df.head()

## Ancilarry EDA : Would Cabin column have helped, if not deleted???

In [None]:
# df.loc[df['Cabin'] == 'B96 B98']

## Digging a bit on the domain understanding
* References :https://www.rmg.co.uk/discover/explore/rms-titanic-fact-sheet
    - Passengers could be 1st, 2nd, 3rd class or Crew. How is a Crew member represented in the dataset??
* Is there any relevance of cabin-layout and survival? See diagram below: <img src='images/titanic.jpg'/>
    - 3rd class cabins are at the either end of the ship, bordered inside by 2nd class cabins. At the middle are 1st-class cabins.
    - For zoomable vectorized-picture of dec plan/layout see https://www.encyclopedia-titanica.org/titanic-deckplans/



In [None]:
# Split cabins as Cabin-Class, Cabin-Counts, Cabin-Start
'''
def split_cabins(df):
#     cdf = df['Cabin']
    for idx,row in df.iterrows():
        print(df['Cabin'])
        break;

split_cabins(df)        
'''