# Titanic Top 4% with ensemble modeling
#### Yassine Ghouzam, PhD
13/07/2017
- 1 Introduction
- 2 Load and check data
    - 2.1 load data
    - 2.2 Outlier detection
    - 2.3 joining train and test set
    - 2.4 check for null and missing values
- 3 Feature analysis
    - 3.1 Numerical values
    - 3.2 Categorical values
- 4 Filling missing Values
    - 4.1 Age
- 5 Feature engineering
    - 5.1 Name/Title
    - 5.2 Family Size
    - 5.3 Cabin
    - 5.4 Ticket
- 6 Modeling
    - 6.1 Simple modeling
    - 6.1.1 Cross validate models
    - 6.1.2 Hyperparamater tunning for best models
    - 6.1.3 Plot learning curves
    - 6.1.4 Feature importance of the tree based classifiers
    - 6.2 Ensemble modeling
    - 6.2.1 Combining models
    - 6.3 Prediction
    - 6.3.1 Predict and Submit results

## 1. Introduction
This is my first kernel at Kaggle. I choosed the Titanic competition which is a good way to introduct feature engineering and ensemble modeling. Firstly, I will display some feature analyses then ill focus on the feature engineering. Last part concerns modeling and predicting the survival on the Titanic using an voting procedure.  

This script follows three main parts:
- Feature analysis
- Feature engineering
- Modeling

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from collections import Counter

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

sns.set(style='white', context='notebook', palette='deep')

## 2. Load and check data
### 2.1 Load data

In [2]:
# Load data
##### Load train and Test set

train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")
IDtest = test["PassengerId"]

### 2.2 Outlier detection

In [3]:
# Outlier detection

def detect_outliers(df, n, features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
        
        # append the found outlier indices for col to the list of outlier indices
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n)
        
    return multiple_outliers

# detect outliers from Age, SibSp, Parch and Fare
Outliers_to_drop = detect_outliers(train, 2, ["Age", "SibSp", "Parch", "Fare"])


Since outliers can have a dramatic effect on the prediction (espacially for regerssion problems), i choosed to manage them.  

I used the Tukey method (Tukey JW., 1977) to detect outliers which defines an interquartile range comprised between the 1st and 3rd quartile of the distribution values (IQR). An outlier is a row that have a feature value outside the (IQR +- an outlier step).  

I decided to detect outliers from the numerical values features (Age, SibSp, Farch and Fare). Then, i considered outliers as rows that have at least two outlied numerical values.

파이썬 리스트에 새로운 원소를 추가하는 방법에는 append(x), extend(iterable)가 있고 두 함수의 차이점은 다음과 같다. 
- list.append(x)는 리스트 끝에 x 1개를 그대로 넣는다. 
- list.extend(x)는 리스트 끝에 가장 바깥쪽 iterable의 모든 항목을 넣는다. 

In [4]:
train.loc[Outliers_to_drop]  # Shoe the outliers rows

We detect 10 outliers. The 28, 89 and 342 passenger have an high Ticket Fare the 7 others have very high values of SibSp.

In [5]:
# Drop outliers
train = train.drop(Outliers_to_drop, axis=0).reset_index(drop=True)

### 2.3 joining tran and test set

In [6]:
## Join tran and test datasets in order to obtain the same number of features during categorical conversion
train_len = len(train)
dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

I join train and test datasets to obtain the same number of features during categorical conversion (See features engineering). 

## 2.4 check for null and missing values

In [7]:
# Fill empty and NaNs values with NaN
dataset = dataset.fillna(np.nan)

# Check for Null values
dataset.isnull().sum()

Age and Cabin features have an important part of missing values. 

**Survived missing values correspond to the join testing dataset (Survived column doesn't exist in test set and has been replace by NaN values when concatenating the train and test set)**

In [8]:
# Infos
train.info()
train.isnull().sum()

In [9]:
train.head()

In [10]:
train.dtypes

In [11]:
### Summarize data
# Summarize and statistics
train.describe()

## 3. Feature analysis
### 3.1 Numerical values

In [12]:
# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived
g = sns.heatmap(train[["Survived", "SibSp", "Parch", "Age", "Fare"]].corr(), annot=True, fmt=".2f", cmap="coolwarm")

Only Fare features seems to have a significative correlation with the survival probability.  

It doesn't mean that the other features are not usefull. Subpopulations in these features can be correlated with the survival. To determine this, we need to explore in detail these features.  

#### SibSp

In [13]:
# Explore SibSp feature vs Survived
g = sns.factorplot(x="SibSp", y="Survived", data=train, kind="bar", size=6, palette="muted")
g.despine(left=True)  # 축, 테두리 제거
g = g.set_ylabels("survival probability")

It seems that passengers having a lot of siblings/spouses have less chance to survive.  

Single passengers (0 SibSp) or with two other persons (SibSp 1 or 2) have more chance to survive.  

This observation is quite interesting, we can consider a new feature describing these categories (See feature engineering)  

#### Parch

In [14]:
# Explore Parch feature vs Survived
g = sns.factorplot(x="Parch", y="Survived", data=train, kind="bar", size=6, palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

Small families have more chance to survive, more than single (Parch 0), medium (Parch 3, 4) and large families (Parch 5, 6).  

Be carefull there is an important standard deviation in the survival of passengers with 3 parents/children.  

#### Age

In [15]:
# Explore Age vs Survived
g = sns.FacetGrid(train, col="Survived")  # FaceGrid(data, row, col, hue) : 다중 플롯 그리드를 만들어서 여러가지 쌍 관계를 표현하기 위한 그리드 클래스. 도화지에 축을 나누는 것 과 같다. 
g = g.map(sns.distplot, "Age")

Age distribution seems to be a tailed distribution, maybe a gaussian distribution.  

We notice that age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived.  

So, even if "Age" is not correlated with "Survived", we can see that there is age categories of passengers that of have more or less chance to survive.  

It seems that very young passengers have more chance to survive. 

In [16]:
# Explore Age distribution
g = sns.kdeplot(train["Age"][(train["Survived"] == 0) & (train["Age"].notnull())], color="Red", shade=True)
g = sns.kdeplot(train["Age"][(train["Survived"] == 1) & (train["Age"].notnull())], color="Blue", shade=True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived", "Survived"])

When we superimpose the two densities, we clearly see a peak corresponding (between 0 and 5) to babies and very young childrens.  

#### Fare

In [17]:
dataset["Fare"].isnull().sum()

In [18]:
# Fill Fare missing values with the median value
dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())

Since we have one missing value, i decided to fill it with the median value which will not have an important effect of the prediction.

In [19]:
# Explore Fare distribution
g = sns.distplot(dataset["Fare"], color="m", label="Skewness : %.2f"%(dataset["Fare"].skew()))
g = g.legend(loc="best")

As we can see, Fare distribution is very skewed. This can lead to overweight very high values in the model, even if it is scaled.  

In this case, it is better to transform it with the log function to reduce this skew.

In [20]:
# Apply log to Fare to reduct skewness distribution
dataset["Fare"] = dataset["Fare"].map(lambda i: np.log(i) if i > 0 else 0)

In [21]:
g = sns.distplot(dataset["Fare"], color="b", label="Skewness : %.2f"%(dataset["Fare"].skew()))
g = g.legend(loc="best")

Skewness is clearly reduced after the log transformation.  

### 3.2 Categorical values
#### Sex

In [22]:
g = sns.barplot(x="Sex", y="Survived", data=train)
g = g.set_ylabel("Survival Probability")

In [23]:
train[["Sex", "Survived"]].groupby("Sex").mean()

It is clearly obvious that Male have less chance to survive than Female.  

So Sex, might play an important role in the prediction of the survival.  

For those who have seen the Titanic movie (1997), I am sure, we all remember this sentence during the evacuation : "Women and children first".  

#### Pclass

In [24]:
# Explore Pclass vs Survived
g = sns.factorplot(x="Pclass", y="Survived", data=train, kind="bar", size=6, palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

In [25]:
# Explore Pclass vs Survived by Sex
g = sns.factorplot(x="Pclass", y="Survived", hue="Sex", data=train, size=6, kind="bar", palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

The passenger survival is not the same in the 3 classes, First class passengers have more chance to survive than second class and third class passengers.  

This trend is conserved when we look at both male and female passengers.  

#### Embarked

In [26]:
dataset["Embarked"].isnull().sum()

In [27]:
# Fill Embarked nan values of dataset set with 'S' most frequent value
dataset["Embarked"] = dataset["Embarked"].fillna("S")

Since we have two missing values, i decided to fill them with the most fequent values of "Embarked" (S).

In [28]:
# Explore Embarked vs Survived
g = sns.factorplot(x="Embarked", y="Survived", data=train, size=6, kind="bar", palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

It seems that passenger coming from Cherbourg (C) have more chance to survive.  

My hypothesis is that the proportion of first class passengers is higher for those who came from Cherbourg than Queenstown (Q), Southampton(S).  

Let's see the Pclass distribution vs Embarked.

In [29]:
# Explore Pclass vs Embarked
g = sns.factorplot("Pclass", col="Embarked", data=train, size=6, kind="count", palette="muted")
g.despine(left=True)
g = g.set_ylabels("Count")

Indeed, the third class is the most frequent for passenger coming from Southampton (S), and Queenstown (Q), whereas Cherbourg passengers are mostly in first class which have the highest survival rate.  

At this pint, i can't explain why first class has an higher survival rate. My hypothesis is that first class passengers were prioritised during evacuation due to ther influence.