In [None]:
# inference: https://www.kaggle.com/startupsci/titanic-data-science-solutions/notebook
'''
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
'''

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# acquire data
train_df = pd.read_csv('../common/dataset/titanic/train.csv')
test_df = pd.read_csv('../common/dataset/titanic/test.csv')
combine = [train_df, test_df]


In [None]:
train_df.head()

In [None]:
train_df.info()
# Cabin > Age > Embarked features contain a number of null values

In [None]:
test_df.head()
test_df.info()
# Cabin > Age are incomplete in case of test dataset 

In [None]:
# print column names
print(train_df.columns) 

# preview the data
train_df.head()

In [None]:
train_df.tail()

In [None]:
# observe the numerical data
train_df.describe()

In [None]:
# observe the categorical data
train_df.describe(include=['O'])  # string stype data

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

Pclass We observe significant correlation (>0.5) among Pclass=1 and Survived. We decide to include this feature in our model.

Sex We confirm the observation during problem definition that Sex=female had very high survival rate at 74%.

SibSp and Parch These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features.

In [None]:
train_df[["Pclass", "Survived"]].groupby(['Pclass'], as_index=False).mean().sort_values(by="Survived", ascending=False)

In [None]:
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by="Survived", ascending=False)

In [None]:
train_df[['SibSp', "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by="Survived", ascending=False)

In [None]:
train_df[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by="Survived", ascending=False)

In [None]:
train_df.groupby(['Parch'], as_index=False).agg({'Survived':'mean', "Sex": 'count'}).sort_values(by = "Survived", ascending = False)

# train_df["Parch"] == 3

A histogram chart is useful for analyzing continous numerical variables

### Correlating numerical features
Let us start by understanding correlations between numerical features and our solution goal (Survived).

A histogram chart is useful for analyzing continous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better survival rate?)

### Observations.

Infants (Age <=4) had high survival rate.

Oldest passengers (Age = 80) survived.

Large number of 15-25 year olds did not survive.

Most passengers are in 15-35 age range.

### Decisions.

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.

We should consider Age in our model training.

Complete the Age feature for null values.

We should band age groups .

In [None]:

axes1 = plt.subplot(1, 2, 1)
axes1.hist(train_df.Age[train_df.Survived == 0].dropna(), bins=20)
axes1.set_title("survived = 0")

axes2 = plt.subplot(1, 2, 2)
axes2.set_ylim(0, 50)
axes2.set_title("survived = 1")
axes2.hist(train_df.Age[train_df.Survived == 1].dropna(), bins=20)

## Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

### Observations.

Pclass=3 had most passengers, however most did not survive.

Confirms our classifying assumption .

Infant passengers in Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption .

Most passengers in Pclass=1 survived. Confirms our classifying assumption.

Pclass varies in terms of Age distribution of passengers.

### Decisions.

Consider Pclass for model training.

In [None]:
fig = plt.figure(figsize=(12, 6))
survived = [0, 1]
pclass = [1, 2, 3]
tuples = [(y, x) for x in pclass for y in survived]
print(tuples)

for i, value in enumerate(tuples):
    survived, pclass = value[0], value[1]
    axes = plt.subplot(3, 2, i+1)
    axes.hist(train_df.Age[np.logical_and(train_df.Survived == survived, train_df.Pclass == pclass)].dropna(), bins=20)
    axes.set_title(f"survived = {survived} |pclass = {pclass}")
    axes.set_ylim(0, 40)
plt.tight_layout()

## Correlating categorical features
Now we can correlate categorical features with our solution goal.

### Observations.

Female passengers had much better survival rate than males. Confirms classifying.
    
Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass 
and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
    
Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing.
    
Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating.
    
### Decisions.

Add Sex feature to model training.
    
Complete and add Embarked feature to model training.

In [None]:
train_df.head()

In [None]:
fig = plt.figure(figsize=(6, 12))
embarked = ['S', 'C', 'Q']

for i, value in enumerate(embarked):
    on = value[0]
    axes = plt.subplot(3, 1, i+1)
    train_df[np.logical_and(train_df["Embarked"] == on, train_df["Sex"] == "male")].groupby(["Pclass"]).mean()["Survived"].plot()
    train_df[np.logical_and(train_df["Embarked"] == on, train_df["Sex"] == "female")].groupby(["Pclass"]).mean()["Survived"].plot()
    axes.set_title(f"Embarked = {on}")
    axes.set_xticks([1, 2, 3])
    axes.legend(["male", "female"], fontsize=12, loc="upper right")
plt.tight_layout()

## Correlating categorical and numerical features
We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).

### Observations.

Higher fare paying passengers had better survival. Confirms our assumption for creating fare ranges.

Port of embarkation correlates with survival rates. Confirms correlating and completing.
    
### Decisions.

Consider banding Fare feature.

In [None]:
fig = plt.figure(figsize=(12, 6))
survived = [0, 1]
embarked = ['S', 'C', 'Q']
tuples = [(y, x) for x in embarked for y in survived]
print(tuples)

for i, value in enumerate(tuples):
    survived, embarked = value[0], value[1]
    axes = plt.subplot(3, 2, i+1)
    train_df[np.logical_and(train_df.Survived == survived, train_df.Embarked == embarked)].groupby(["Sex"])['Fare'].mean().plot(kind="bar")
    axes.set_title(f"survived = {survived} |Embarked = {embarked}")
    axes.set_ylim(0, 80)
    plt.xticks(rotation='horizontal')
plt.tight_layout()

we want to drop the Cabin and Ticket features.

In [None]:
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

## Creating new feature extracting from existing
We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.

In the following code we extract Title feature using regular expressions. The RegEx pattern (\w+\.) matches the first word which ends with a dot character within Name feature. The expand=False flag returns a DataFrame.

### Observations.

When we plot Title, Age, and Survived, we note the following observations.

Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
Survival among Title Age bands varies slightly.
Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).
### Decision.

We decide to retain the new Title feature for model training.

In [None]:
# Creating new feature extracting from existing

for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train_df['Title'], train_df['Sex'])  # from A to Z more than one + .

We can replace many titles with a more common name or classify them as Rare.

In [None]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

We can convert the categorical titles to ordinal.

In [None]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

We also do not need the PassengerId feature in the training dataset.

In [None]:
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

### Converting a categorical feature

In [None]:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map({'female': 1, 'male': 0}).astype(int)

train_df.head()

Let us start by preparing an empty array to contain guessed Age values based on Pclass x Gender combinations.

In [None]:
guess_ages = np.zeros((2, 3))
guess_ages

Now we iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six combinations.

In [None]:
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna()
            age_guess = guess_df.median()
            # convert random age float to nearest .5 age
            guess_ages[i, j] = int(age_guess / 0.5 + 0.5) * 0.5
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[(dataset.Age.isnull() & (dataset.Sex == i)) & (dataset.Pclass == j+1), 'Age'] = guess_ages[i, j]
    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()

Let us create Age bands and determine correlations with Survived.

In [None]:
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)  # cut into same width
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

Let us replace Age with ordinals based on these bands.

In [None]:
for dataset in combine:
    dataset.loc[dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[dataset['Age'] > 64, 'Age'] = 4
train_df.head()

We can remove the AgeBand feature

In [None]:
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()

We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.

In [None]:
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We can create another feature called IsAlone.

In [None]:
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

Let us drop Parch, SibSp, and FamilySize features in favor of IsAlone.

In [None]:
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

train_df.head()

We can also create an artificial feature combining Pclass and Age.

In [None]:
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

### Completing a categorical feature
Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.

In [None]:
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port

In [None]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

### Converting categorical feature to numeric
We can now convert the EmbarkedFill feature by creating a new numeric Port feature.

In [None]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

train_df.head()

### Quick completing and converting a numeric feature
We can now complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature. We do this in a single line of code.

Note that we are not creating an intermediate new feature or doing any further analysis for correlation to guess missing feature as we are replacing only a single value. The completion goal achieves desired requirement for model algorithm to operate on non-null values.

We may also want round off the fare to two decimals as it represents currency.



In [None]:
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()

We can not create FareBand.

In [None]:
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

Convert the Fare feature to ordinal values based on the FareBand.


In [None]:
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]

train_df.head(10)

In [None]:
train_df.head(10)

In [None]:
test_df.head(10)

## ....to be continue