# Titanic (End To End ML Workflow)

# Introduction

**Titanic: Machine Learning from Disaster** is one of the best Kaggle competitions for improving data science skills, especially, feature engineering skills. The Titanic dataset is a pretty good choice for beginners who want to improve data science skills.

In this project, I will deal with the end-to-end data science cycle. And this project will include the below sections.

### Table of Contents:

* 1. Preprocessing the data
    * 1.1 Variable Explanations
    * 1.2 Importing Libraries
    * 1.3 Getting Data
    * 1.4 Overview of The Data
    
* 2. Exploratory Data Analaysis
    * 2.1 Missing Values
        * 2.1.1 Age 
        * 2.1.2 Embarked
        * 2.1.3 Cabin
        * 2.1.4 Fare
    * 2.2 Outliers    
    * 2.3 Analyzing Target Variable
    * 2.4 Analyzing Features
        * 2.4.1 Categorical Features
        * 2.4.2 Continuous Features
    * 2.4 Exploring Correlations
        
* 3. Feature Engieering
    * 3.1 Binning Continuous Features
    * 3.2 Creating New Features
    * 3.3 Feature Selecting
    * 3.4 Feature Scaling(Continuous Variables)
    * 3.5 Feature Transformation (Categoric Variables)
    
* 4. Model Selecting And Model Tuning
    * 4.1 Model Training
    * 4.2 Model Tuning

* 5. Making a Submission

# 1. Data Preprocessing

## 1.1 Variable Explanations

First of all we need to get some information about data.
 


 * **PassengerId:**  Unique Id for each passenger (it doesn't have any effect on target)
 
    
 * **Survived(categorical):** Survival (0 : No, 1 : Yes) (*)
 
 
 * **Pclass(categorical-ordinal) :**	Passenger class (1 : 1st, 2 : 2nd, 3 : 3rd)
 
 
 * **Name:** Passenger name
 
 
 * **Sex(categorical) :** Passenger sex
 
 
 * **Age:** Passenger age
 
 
 * **SibSp:** Sibling - Spouse (**)	
 
 
 * **Parch:** Parent - Child (***)
 
 
 * **Ticket:** Ticket number
 
 
 * **Fare:** Passenger fare
 
 
 * **Cabin:** Cabin number
 
 
 * **Embarked(categorical):** Port of Embarkation (C 
 : Cherbourg, Q : Queenstown, S : Southampton)
 

(*) 'Survived 'is the target variable we are trying to predict. So test data doesn't have 'Survived' column.


(**) sibsp: The dataset defines family relations in this way...

           Sibling = brother, sister, stepbrother, stepsister

           Spouse = husband, wife (mistresses and fiancés were ignored)

(***) parch: The dataset defines family relations in this way...

            Parent = mother, father

            Child = daughter, son, stepdaughter, stepson

            ! Some children travelled only with a nanny, therefore parch=0 for them.

## 1.2 Importing Libraries

In [None]:
# data processing
import numpy as np
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

# Model Selection
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import metrics
from sklearn.preprocessing import StandardScaler,minmax_scale

import warnings
warnings.filterwarnings("ignore")

## 1.3 Getting Data

In [None]:
# reading data
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

In [None]:
train_df = train.copy()
test_df = test.copy()
df = train_df.append(test_df,sort=False)

## 1.4 Overview of The Data

In [None]:
train_df.sample(5)

In [None]:
test_df.sample(5)

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
print('train dataset has ' + str(train_df.shape[0]) + ' observations ' + str(train_df.shape[1])+ ' variables.')
print('test dataset has ' + str(test_df.shape[0]) + ' observations ' + str(test_df.shape[1])+ ' variables.')

In [None]:
train_df.dtypes

In [None]:
train_df.describe().T

In [None]:
test_df.describe().T

When I look at the statistical summary of the data, I notice a few things :
* Approximately 38% of the passengers survived. 
* It looks like Fare variable contains outlier observations.
* The majority of passengers travel alone.
* The majority of passengers are less than 40 years old
    
These are the things I understand when I first look at the data. I will review the data in more detail later. 

## 2. Exploratory Data Analysis

## 2.1 Missing Values

In [None]:
train_df.isnull().sum().sort_values(ascending=False)

In [None]:
def explore_missing_values(df) :
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total,percent], axis=1, keys=['Total','Percent'])
    sns.barplot(x=missing_data.index,y=missing_data['Percent'])
    plt.xlabel('Features', fontsize=15)
    plt.ylabel('Percent of Missing Values')
    plt.title('PERCENT MISSING DATA BY FEATURE')
    plt.xticks(rotation='75')
    plt.show()
    print(missing_data.head(20))

In [None]:
explore_missing_values(train_df)

In [None]:
explore_missing_values(test_df)

The number of missing values in the Age, Embarked and Fare columns is relatively low compared to the total number of observations. Therefore, missing values in those columns can simply fill with descriptive statistics measurements.But, it is not the right approach for the 'Cabin' column that includes approximately %80 missing values. 

#### 2.1.1 Age

I will create Title feature for imputing Age columns, but I won't use that feature in the model.

In [None]:
def create_Title(df):
    titles = {
        "Mr" :         "Mr",
        "Mme":         "Mrs",
        "Ms":          "Mrs",
        "Mrs" :        "Mrs",
        "Master" :     "Master",
        "Mlle":        "Miss",
        "Miss" :       "Miss",
        "Capt":        "Rare",
        "Col":         "Rare",
        "Major":       "Rare",
        "Dr":          "Rare",
        "Rev":         "Rare",
        "Jonkheer":    "Rare",
        "Don":         "Rare",
        "Sir" :        "Rare",
        "Countess":    "Rare",
        "Dona":        "Rare",
        "Lady" :       "Rare"
    }
    extracted_titles = df["Name"].str.extract(' ([A-Za-z]+)\.',expand=False)
    df["Title"] = extracted_titles.map(titles)

In [None]:
create_Title(train_df)
create_Title(test_df)

In [None]:
train_df.groupby('Title')['Age'].median()

In [None]:
df.corr()['Age'].abs().sort_values(ascending=False)

In [None]:
train_df.groupby(['Title','Pclass'])['Age'].median()

Pclass and Age have high correlation so decided to group the data by Title and Pclass and fill the Age column with the median of each group.

In [None]:
# Imputing 'Age' features

train_df["Age"] =train_df.groupby(['Title','Pclass'])["Age"].apply(lambda x : x.fillna(x.median()))
test_df["Age"] = test_df.groupby(['Title','Pclass'])["Age"].apply(lambda x : x.fillna(x.median()))

#### 2.1.2 Embarked

In [None]:
train_df[train_df['Embarked'].isnull()]

When I googled that names I learned that they boarded the Titanic in from Southampton. I will fill missing values in 'Embarked' with 'S' representing 'Southampton'.

https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html

In [None]:
#Imputing 'Embarked' features
train_df["Embarked"] = train_df["Embarked"].fillna("S")

#### 2.1.3 Cabin

In [None]:
train_df[train_df['Cabin'].isnull()].head()

Although the 'Cabin' feature has too much missing value, for now, I won't drop the column and I will fill them with 'U0'  representing that they are unknown,  after then I will try to extract useful information from the 'Cabin' column.

In [None]:
#Imputing 'Cabin' features
train_df["Cabin"] = train_df["Cabin"].fillna("S")
test_df["Cabin"] = test_df["Cabin"].fillna("S")

#### 2.1.4 Fare

In [None]:
df.corr()['Fare'].abs().sort_values(ascending = False)

In [None]:
test_df[test_df['Fare'].isnull()]

In [None]:
test_df[test_df['Ticket']=='3701']

In [None]:
#Imputing 'Fare' features
test_df["Fare"] = test_df.groupby(['Pclass'])["Fare"].apply(lambda x : x.fillna(x.median()))

In [None]:
explore_missing_values(train_df)

In [None]:
explore_missing_values(test_df)

## 2.2 Outliers

### Fare

In [None]:
sns.boxplot(x=train_df['Fare']);

In [None]:
Q1 = train_df['Fare'].quantile(0.05)
Q3 = train_df['Fare'].quantile(0.95)
IQR = Q3-Q1

In [None]:
top_border_fare = Q3+1.5*IQR
top_border_fare

In [None]:
train_df.loc[train_df['Fare'] > top_border_fare,'Fare'] = top_border_fare
test_df.loc[test_df['Fare'] > top_border_fare,'Fare'] = top_border_fare

In [None]:
sns.boxplot(x=train_df['Fare']);


## 2.3 Exploring Target Variable

In [None]:
train_df['Survived'].value_counts()

In [None]:
train_df['Survived'].describe().T

In [None]:
sns.countplot(x='Survived',data=train)

## 2.4 Analyzing Features 

In [None]:
train_df.head()

In [None]:
categorical_features = ['Pclass','Sex','SibSp','Parch','Embarked']
continuous_features =['Age','Fare']

### 2.4.1 Categorical Features

In [None]:
def visualize_categorical_columns(df,col_list,hue='Survived'):
    for col in col_list:
        # hue='Survived'
        sns.countplot(x=col,data=df,hue=hue)
        plt.show()
    return

In [None]:
visualize_categorical_columns(train_df, categorical_features)

###  2.4.2 Continuous Features 

In [None]:
def visuzalize_continuous_columns(df,col_list):
    for col in col_list:
        sns.distplot(df[col])
        plt.show()
    return

In [None]:
 visuzalize_continuous_columns(train_df, continuous_features )

## 2.5 Exploring Correlations

In [None]:
corr = train_df.corr()
corr

In [None]:
sns.heatmap(train_df.corr(), annot = True, fmt='.1g')

In [None]:
corr['Survived'].abs().sort_values().abs().sort_values(ascending = False)

# 3. Feature Engineering

## 3.1 Binning Continuous Features

In [None]:
# Binning 'Age' column
train_df['AgeBinCode'] = LabelEncoder().fit_transform(pd.qcut(train_df["Age"],4))
test_df['AgeBinCode'] = LabelEncoder().fit_transform(pd.qcut(test_df["Age"],4))

train_df['AgeBinCode'].value_counts()

In [None]:
sns.countplot(x=train_df['AgeBinCode'], hue='Survived', data=train_df)
plt.xticks(rotation='75')

In [None]:
# Binning 'Fare' column
train_df['FareBinCode'] = LabelEncoder().fit_transform(pd.qcut(train_df["Fare"],5))
test_df['FareBinCode'] = LabelEncoder().fit_transform(pd.qcut(test_df["Fare"],5))

train_df['FareBinCode'].value_counts()

In [None]:
sns.countplot(x=train_df['FareBinCode'], hue='Survived', data=train_df)
plt.xticks(rotation='75')

## 3.2 Creating New Features

In [None]:
# Creating FamilySize features
train_df['FamilySize'] = train_df['Parch'] + train_df['SibSp']
test_df['FamilySize'] = test_df['Parch'] + test_df['SibSp']

In [None]:
sns.countplot(x=train_df['FamilySize'], hue='Survived', data=train_df)
plt.xticks(rotation='75')

In [None]:
# Creating LastName features
train_df['LastName'] = train_df['Name'].apply(lambda x: str.split(x, ",")[0])
test_df['LastName'] =test_df['Name'].apply(lambda x: str.split(x, ",")[0]) 
df['LastName'] = df['Name'].apply(lambda x: str.split(x, ",")[0])

This feature is from Konstantin's kernel. FamilySurvival variable gathers together families and people with the same ticket and gives a ratio about group survival.
https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83

In [None]:
#Creating FamilySurvival features

DEFAULT_SURVIVAL_VALUE = 0.5

df['FamilySurvival'] = DEFAULT_SURVIVAL_VALUE

for grp, grp_df in df[['Survived', 'Name', 'LastName', 'Fare', 'Ticket', 'PassengerId',
                            'SibSp', 'Parch', 'Age', 'Cabin']].groupby(['LastName', 'Fare']):
    

    if (len(grp_df) != 1):
        # A Family group is found.
        for ind, row in grp_df.iterrows():
            smax = grp_df.drop(ind)['Survived'].max()
            smin = grp_df.drop(ind)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                df.loc[df['PassengerId'] == passID, 'Family_Survival'] = 1
            elif (smin == 0.0):
                df.loc[df['PassengerId'] == passID, 'Family_Survival'] = 0

print("Number of passengers with family survival information:",
      df.loc[df['FamilySurvival'] != 0.5].shape[0])



for _, grp_df in df.groupby('Ticket'):
    if (len(grp_df) != 1):
        for ind, row in grp_df.iterrows():
            if (row['FamilySurvival'] == 0) | (row['FamilySurvival'] == 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    df.loc[df['PassengerId'] == passID, 'FamilySurvival'] = 1
                elif (smin == 0.0):
                    df.loc[df['PassengerId'] == passID, 'FamilySurvival'] = 0

print("Number of passenger with family/group survival information: "
      + str(df[df['FamilySurvival'] != 0.5].shape[0]))



train_df['FamilySurvival'] = df['FamilySurvival'][:891]
test_df['FamilySurvival'] = df['FamilySurvival'][891:]


##  3.3 Feature Selecting

In [None]:
sns.heatmap(train_df.corr())

In [None]:
train_df.corr()['Survived'].abs().sort_values(ascending=False)

In [None]:
features = ['Pclass', 'Sex', 'AgeBinCode', 'FareBinCode', 'FamilySurvival','FamilySize']
target = ['Survived']


In [None]:
# X_train, X_test, y_train

X_train = train_df[features]
y_train = train_df[target]
X_test = test_df[features]


## 3.5 Feature Transformation (Categoric Variables)

In [None]:
X_train.head()

In [None]:
def create_dummies(df,categorical_features):
    for column_name in categorical_features:
        dummies = pd.get_dummies(df[column_name], prefix=column_name, drop_first=True)
        df = pd.concat([df,dummies],axis=1)
    return df

In [None]:
X_train['Sex'] = LabelEncoder().fit_transform(X_train['Sex'])
X_test['Sex'] = LabelEncoder().fit_transform(X_test['Sex'])

In [None]:
X_test.head()

## 3.4 Feature Scaling(Continuous Variables)

In [None]:
X_train =StandardScaler().fit_transform(X_train)
y_train = train_df[target]
X_test = StandardScaler().fit_transform(X_test)

# 4. Modelling

## 4.1 Model Training

In [None]:
models = [
    ('KNN',KNeighborsClassifier()),
    ('DT', DecisionTreeClassifier()),
    ('NB', GaussianNB()),
    ('SVM',SVC()),
    ('RF', RandomForestClassifier()),
]

results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=7)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring="accuracy")
    results.append(cv_results)
    names.append(name)
    print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))


## 4.2 Model Tuning

In [None]:
def tuning_model(model,hyperparams_dict):
    grid = GridSearchCV(model,
                        param_grid=hyperparams_dict,
                        cv=10,
                        n_jobs=-1,
                        verbose=2)
    grid.fit(X_train, y_train)
    best_params = grid.best_params_
    best_score = grid.best_score_
  
    print("Best Score: {}".format(best_score))
    print("Best Parameters: {}\n".format(best_params))

    return grid.best_estimator_

#### KNN

In [None]:
knn_hyperparams = { "n_neighbors" : list(range(1,30,1)),
"algorithm" : ['auto'],
"weights" : ['uniform', 'distance'],
"leaf_size" : list(range(1,50,5))
}

knn_tuned = tuning_model(KNeighborsClassifier(),knn_hyperparams)

#### RANDOM FOREST

In [None]:
rf_hyperparams = {"n_estimators": [40, 60, 90],
"criterion": ["entropy", "gini"],
"max_depth": [2, 5, 10],
"max_features": ["log2", "sqrt"],
"min_samples_leaf": [1, 5, 8],
"min_samples_split": [2, 3, 5] }

rf_tuned = tuning_model(RandomForestClassifier(), rf_hyperparams)

# 5. Making A Submission

In [None]:
def save_submission_file(model,X_test,filename="submission.csv"):
    submission_df = {"PassengerId": test['PassengerId'],
                     "Survived": model.predict(X_test)}
    submission = pd.DataFrame(submission_df)
    submission.to_csv(filename,index=False)

In [None]:
save_submission_file(knn_tuned,X_test,filename='knn_submission.csv')

In [None]:
save_submission_file(rf_tuned,X_test,filename='rf_submission.csv')