This kernel is based on the learnings of others kernels and offcourse my own intuitions and methods too. This notebook will be helpful for beginners.

This is my first kernel so any kind of suggestion or appreaciation is heartly welcomed.

In [None]:
import os
print(os.listdir("../input/"))

### Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from xgboost import XGBClassifier

In [None]:
train = pd.read_csv("../input/train.csv")
train['label'] = 'train'
test = pd.read_csv("../input/test.csv")
test['label'] = 'test'
test_passengerId = test.PassengerId  #Save test passengerId. It will be required at the end
df = train.append(test)
df.sample(2)

## EDA

In [None]:
df.info()

In [None]:
df.isnull().sum()

So we have to handle missing values of age, cabin, embarked and fare. Survived has missing values of test set.

In [None]:
df.describe(include = 'all')

### Handling Missing Values

#### Embarked

In [None]:
#Fill missing value
df['Embarked'].fillna('S', inplace = True)    #top value with freq 914

#### Fare

In [None]:
df[df.Fare.isnull()]

In [None]:
df.corr().Fare

Looks like Pclass can help to fill the missing value.

In [None]:
print(df[df.Pclass == 1].Fare.quantile([0.25, 0.50, 0.75]))
print(df[df.Pclass == 2].Fare.quantile([0.25, 0.50, 0.75]))
print(df[df.Pclass == 3].Fare.quantile([0.25, 0.50, 0.75]))

Yup! Values differ totally according to Pclass. Let's look it through visualization.

In [None]:
sns.factorplot(x = 'Pclass', y = 'Fare', data = df)

In [None]:
df['Fare'].fillna(df[df.Pclass == 3].Fare.median(), inplace = True)   #Fare is dependent on Pclass

#### Age

In [None]:
print("Age column has", df.Age.isnull().sum(), "missing values out of", len(df), ". Missing value percentage =", df.Age.isnull().sum()/len(df)*100)

Its high! Thus any value derived statistically (mean or median) based on only Age column can mislead the dataset for the classifier. We will fill them based on the relations with other variables.

In [None]:
df.corr().Age

In [None]:
df.pivot_table(values = 'Age', index = 'Pclass').Age.plot.bar()

In [None]:
df.pivot_table(values = 'Age', index = ['Pclass', 'SibSp'], aggfunc = 'median').Age.plot.bar()

A basic trend can be found from the graph. Thus, we are on right path!

In [None]:
df.pivot_table(values = 'Age', index = ['Pclass', 'SibSp', 'Parch'], aggfunc = 'median')

We will fill missing values based on Pclass and SibSp.

In [None]:
df.Age.isnull().sum()

In [None]:
age_null = df.Age.isnull()
group_med_age = df.pivot_table(values = 'Age', index = ['Pclass', 'SibSp'], aggfunc = 'median')
df.loc[age_null, 'Age'] = df.loc[age_null, ['Pclass', 'SibSp']].apply(lambda x: group_med_age.loc[(group_med_age.index.get_level_values('Pclass') == x.Pclass) & (group_med_age.index.get_level_values('SibSp') == x.SibSp)].Age.values[0], axis = 1)

In [None]:
df.Age.isnull().sum()

#### Cabin

In [None]:
print("Cabin has", df.Cabin.isnull().sum(), "missing values out of", len(df))

So instead of filling those values, form their cluster. We will assume that those people don't have cabin.

In [None]:
df['Cabin'] = df.Cabin.str[0]
df.Cabin.unique()

In [None]:
df.Cabin.fillna('O', inplace = True)

In [None]:
df.isnull().sum()

So, we are done with data cleaning part. Missing Survived are from test set.

In [None]:
df.sample(2)

### Sex

In [None]:
sns.factorplot(data = df, x = 'Sex', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Sex').Survived.plot.bar()
plt.ylabel('Survival Probability')

Females tend to survive more than males.

#### Age

In [None]:
q = sns.kdeplot(df.Age[df.Survived == 1], shade = True, color = 'red')
q = sns.kdeplot(df.Age[df.Survived == 0], shade = True, color = 'blue')
q.set_xlabel("Age")
q.set_ylabel("Frequency")
q = q.legend(['Survived', 'Not Survived'])

In [None]:
q = sns.FacetGrid(df, col = 'Survived')
q.map(sns.distplot, 'Age')

#### Embarked

In [None]:
sns.factorplot(data = df, x = 'Embarked', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Embarked').Survived.plot.bar()
plt.ylabel('Survival Probability')

Cherbourg port is more save as compared to others. Lets look more into this.

In [None]:
df.pivot_table(values = 'Survived', index = ['Sex','Embarked']).Survived.plot.bar()
plt.ylabel('Survival Probability')

We found something interesting! Cherbourg port is very safe for females and Qweenstone and Southampton ports are very dangerous for males.

In [None]:
fig, ax =plt.subplots(1,2)
sns.countplot(data = df[df.Sex == 'female'], x = 'Embarked', hue = 'Survived', ax = ax[0])
sns.countplot(data = df[df.Sex == 'male'], x = 'Embarked', hue = 'Survived', ax = ax[1])
fig.show()

Surprised! Port of embarkation is very safe for females  but more dangerous form males. Similarly other two ports also contradict for males and females.

#### Parch

In [None]:
sns.factorplot(data = df, x = 'Parch', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Parch').Survived.plot.bar()
plt.ylabel('Survival Probability')

#### Pclass

In [None]:
sns.factorplot(data = df, x = 'Pclass', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Pclass').Survived.plot.bar()
plt.ylabel('Survival Probability')

In [None]:
df.pivot_table(values = 'Survived', index = ['Sex', 'Pclass']).Survived.plot.bar()
plt.ylabel('Survival Probability')

Qualty of tickets class assures more safety!

#### Cabin

In [None]:
sns.factorplot(data = df, x = 'Cabin', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Cabin').Survived.plot.bar()
plt.ylabel('Survival Probability')

#### Fare

From dataset description, it was clear that Fare values are not skewed. Lets visualize it.

In [None]:
plt.boxplot(train.Fare, showmeans = True)
plt.title('Fare Boxplot')
plt.ylabel('Fares')

In [None]:
sns.distplot(df.Fare)

Highly right skewed!

In [None]:
df.Fare.skew()    #Measure of skewness level

We will take log transform. This might help classifier in preditions. Also it will help us to find correlation between variables.

In [None]:
df['Fare_log'] = df.Fare.map(lambda i: np.log(i) if i > 0 else 0)

In [None]:
sns.distplot(df.Fare_log)

In [None]:
df.Fare_log.skew()

### Feature Engineering

In [None]:
df['Family_size'] = 1 + df.Parch + df.SibSp
df['Alone'] = np.where(df.Family_size == 1, 1, 0)

In [None]:
print(df.Family_size.value_counts())
print(df.Alone.value_counts())

In [None]:
sns.factorplot(data = df, x = 'Family_size', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Family_size').Survived.plot.bar()
plt.ylabel('Survival Probability')

Now, family size with 2 to 4 members are more likely to survive. Thus, we will form bins for this groups.

In [None]:
df.loc[df['Family_size'] == 1, 'Family_size_bin'] = 0
df.loc[(df['Family_size'] >= 2) & (df['Family_size'] <= 4), 'Family_size_bin'] = 1
df.loc[df['Family_size'] >=5, 'Family_size_bin'] = 2

In [None]:
sns.factorplot(data = df, x = 'Alone', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Alone').Survived.plot.bar()
plt.ylabel('Survival Probability')

People travelling alone are likely to less survive.

In [None]:
df['Title'] = df.Name.str.split(", ", expand = True)[1].str.split(".", expand = True)[0]
df.Title.value_counts()

In [None]:
minor_titles = df.Title.value_counts() <= 4
df['Title'] = df.Title.apply(lambda x: 'Others' if minor_titles.loc[x] == True else x)
df.Title.value_counts()

In [None]:
sns.factorplot(data = df, x = 'Title', hue = 'Survived', kind = 'count')

In [None]:
df.pivot_table(values = 'Survived', index = 'Title').Survived.plot.bar()
plt.ylabel('Survival Probability')

Lets make bins for age and fare also and visualize them.

In [None]:
df['Fare_bin'] = pd.qcut(df.Fare, 4, labels = [0,1,2,3]).astype(int)
df['Age_bin'] = pd.cut(df.Age.astype(int), 5, labels = [0,1,2,3,4]).astype(int)

In [None]:
sns.factorplot(data = df, x = 'Age_bin', hue = 'Survived', kind = 'count')

Youngters are likely to survive more.

In [None]:
sns.factorplot(data = df, x = 'Fare_bin', hue = 'Survived', kind = 'count')

As much you pay, that much you will get security!

In [None]:
fig, axs = plt.subplots(1, 3,figsize=(15,5))

sns.pointplot(x = 'Fare_bin', y = 'Survived',  data=df, ax = axs[0])
sns.pointplot(x = 'Age_bin', y = 'Survived',  data=df, ax = axs[1])
sns.pointplot(x = 'Family_size', y = 'Survived', data=df, ax = axs[2])

### Handling categorical variables

In [None]:
label = LabelEncoder()
df['Title'] = label.fit_transform(df.Title)
df['Sex'] = label.fit_transform(df.Sex)
df['Embarked'] = label.fit_transform(df.Embarked)
df['Cabin'] = label.fit_transform(df.Cabin)

In [None]:
df.sample(2)

In [None]:
#We will look at correlation between variables. So before working with ticket column, save all variables we worked on yet.
#This id because we will use get_dummies on ticket and not label encoding.
corr_columns = list(df.drop(['Name', 'PassengerId', 'Ticket', 'label'], axis = 1).columns)

#### Ticket

In [None]:
df['Ticket'] = df.Ticket.map(lambda x: re.sub(r'\W+', '', x))   #Remove special characters

In [None]:
#If ticket is of digit value, make them a character X
Ticket = []
for i in list(df.Ticket):
    if not i.isdigit():
        Ticket.append(i[:2])
    else:
        Ticket.append("X")
df['Ticket'] = Ticket

In [None]:
df.Ticket.unique()

In [None]:
df = pd.get_dummies(df, columns = ['Ticket'], prefix = 'T')

Now we will select features from the dataset for modelling purpose.

### Feature Selection

In [None]:
cat_variables = [x for x in df.columns if df.dtypes[x] == 'object']
cat_variables

In [None]:
df.drop(['Name', 'PassengerId'], axis = 1, inplace = True)

In [None]:
df.sample(2)

In [None]:
train = df.loc[df.label == 'train'].drop('label', axis = 1)
test = df.loc[df.label == 'test'].drop(['label', 'Survived'], axis = 1)

Lets look correlation between variables.

###### Pearson's

In [None]:
plt.figure(figsize = [14,10])
sns.heatmap(train[corr_columns].corr(), cmap = 'RdBu', annot = True)

1. Pclass and Cabin has some relations.
2. Offcourse Family_size will have relation with Parch and SibSp as it is derived from these two.

Lets look at spearman's correlation also. Note I'm looking this only for Fare variable as it is not skewed.

###### Spearman's

In [None]:
plt.figure(figsize = [14,10])
sns.heatmap(train[corr_columns].corr(method = 'spearman'), cmap = 'RdBu', annot = True)

As expected! Correlation are somewhat more stronger than pearson's (looking at more blue blocks in spearan's). Also, Fare_log and Fare have correlation 1 because one is just the log transformation of another. Thus, rank will be same.[](http://)

## Modeling

Split the data into train and test sets.

In [None]:
X_train = train.drop(['Survived'], axis = 1)
y_train = train['Survived'].astype(int)
X_test = test

### Evaluation

In [None]:
X_train1, X_val, y_train1, y_val = train_test_split(X_train, y_train, test_size = 0.30, random_state = 2)

In [None]:
xgb = XGBClassifier(random_state = 0).fit(X_train1, y_train1)
y_pred = xgb.predict(X_val)
print("Training accuracy", xgb.score(X_train1, y_train1))
print("Evaluation accuracy", xgb.score(X_val, y_val))

#### Cross-Validation

In [None]:
cross_val_score(xgb, X_train, y_train, cv = 5).mean()

#### Hyper-parameter tuning with GridSearchCV

In [None]:
param_grid = {'max_depth' : [3,4,5,6], 'n_estimators' : [100,200,300], 'learning_rate' : [0.001,0.01,0.1,0.5],
              'booster' : ['gbtree', 'dart', 'gblinear']}
grid_search = GridSearchCV(XGBClassifier(random_state = 0), param_grid, scoring = 'roc_auc', cv = 3, n_jobs = -1)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
xgb = XGBClassifier(max_depth = 6, learning_rate = 0.01, random_state = 0).fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print("Training accuracy =", xgb.score(X_train, y_train))

In [None]:
feature_imp = xgb.feature_importances_
feature_imp

#### Plot feature importances

In [None]:
feature = pd.Series(feature_imp, X_train.columns).sort_values(ascending = False)
plt.figure(figsize = (15,6))
feature.plot(kind = 'bar', title = 'Feature Importance')

In [None]:
XGB = pd.DataFrame(test_passengerId, columns = ['PassengerId']).assign(Survived = pd.Series(y_pred))
#XGB.to_csv('../output/XGB.csv', index = None)

### Further work
I've only used XGBoost here. You can apply other models also which gives good generalization for the dataset.
Also, combination of two models can give more performance.

This was my first kernel. Any kind of suggestion or appreciation is heartily welcomed. Also please UPVOTE it if you liked my work and found helpful to you.

Thank you!