# Tabular Playground Series - Apr 2021
In this notebook, we perform and analyse the `Titanic Dataset` generated using the CTGAN. We need to create the machine learning model that predict the `Survived` field using the 11 different variables. Evaluation is depend upon the `accuracy` of the model. 

# Data Dictionary
| Variable | Definition | Key |
| -------- | ---------- | --- |
| survival | Survival  |0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

# Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Load the Dataset 
In this section, we import all the useful libraries and load the dataset into the notebook.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import boxcox

import matplotlib.pyplot as plt
import seaborn as sns

import missingno

plt.style.use('dark_background')

from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier, RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
train_df.head()

In [None]:
test_df = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
test_df.head()

# Perform statistics opertaion
In this section, we perform the basic statistics operation like mean, standardization, min, max, etc.

In [None]:
train_df.describe()

One of the weird observation in the dataset is in the Age column, as it had a minimum age of 0.080 which is really not possible. We need to handle this errorness in the dataset and replace it with something else.

In [None]:
test_df.describe()

In [None]:
missingno.bar(train_df, color='orangered');

Since, we have lots of missing value in `Cabin` column so filling out with some random value doesnot make a good call. So we going to drop out the column from the dataset and fill the rest of the missing column with the help of the EDA.

# Exploratory Data Analysis
In this section, we perform the Exploratory Data Analysis or EDA to understand the dataset and find the useful patterns within the dataset between the different variables.

## Univariate

In [None]:
plt.pie(train_df.Sex.value_counts(), labels=['Male', 'Female'], colors=['orangered', 'lightsalmon'], autopct="%1.2f%%")
plt.title('Sex Distribution Graph', fontweight='bold', fontsize=18);

In [None]:
def univariate_graph(title, xlabel, x, y, ylabel='Frequency'):
    plt.bar(x, y, color='orangered')
    plt.title(title, fontweight='bold', fontsize=14)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show();

In [None]:
univariate_graph(x=['Survived', 'Not Survived'],
                y=train_df.Survived.value_counts(),
                title='Survived Distribution',
                xlabel='Survived')

In [None]:
univariate_graph(x=['Lower', 'Upper', 'Middle'],
                y=train_df.Pclass.value_counts(),
                title='Pclass Distribution',
                xlabel='Pclass')

In [None]:
plt.hist(train_df.Age, bins=10, color='orangered')
plt.title('Age Distribution', fontweight='bold', fontsize=14)
plt.xlabel('Age')
plt.ylabel('Frequency');

So most of the passengers are from the age 20 to 40 years. But again, passenger with the age below the 0 or 5 is not possible that they are travelling on the ship. We need to handle such case before fitting the model.

In [None]:
univariate_graph(x=train_df.SibSp.value_counts().index,
                y=train_df.SibSp.value_counts(),
                title='Sibling/Spouse Distribution',
                xlabel='SibSp')

So, most of the passengers on the Titanic are came alone. Somwe of them are come in couple or sibling while some of them come with their family.

In [None]:
univariate_graph(x=train_df.Parch.value_counts().index,
                y=train_df.Parch.value_counts(),
                title='Parch Distribution',
                xlabel='Parch')

In [None]:
univariate_graph(x=['Southampton', 'Cherbourg', 'Queenstown'],
                y=train_df.Embarked.value_counts(),
                title='Embarked Distribution',
                xlabel='Embarked')

So most of the passenger are going to the `Southampton`.

In [None]:
plt.hist(train_df.Fare, bins=5, color='orangered')
plt.title('Fare Distribution', fontweight='bold', fontsize=14)
plt.xlabel('Fare')
plt.ylabel('Frequency');

## Bivariate

In [None]:
sample_col = [col for col in train_df.columns if pd.api.types.is_numeric_dtype(train_df[col])]
plt.style.use('dark_background')
data = train_df.dropna()
plt.boxplot(data[sample_col[1:]], patch_artist=True, labels=sample_col[1:])
plt.title('Outlier Chart', fontsize=24, fontweight='bold');

So, we have outlier value in Fare. We have to see more deeply in SibSp and Parch column but seeing the dataset only we can say that it don't have any outlier.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.boxplot(train_df['SibSp'], patch_artist=True, labels=['SibSp'])
ax1.set_title('SibSp Outlier Chart', fontsize=18, fontweight='bold')
ax2.boxplot(train_df['Parch'], patch_artist=True, labels=['Parch'])
ax2.set_title('Parch Outlier Chart', fontsize=18, fontweight='bold')
data = train_df.dropna()
ax3.boxplot(data['Age'], patch_artist=True, labels=['Age'])
ax3.set_title('Age Outlier Chart', fontsize=18, fontweight='bold');

Yeah!! We found the outlier in the `SibSp`, `Parch` and `Age` when we check these column more closely.

In [None]:
sns.heatmap(train_df.corr(), annot=True, cmap="YlOrBr");

## TODO:
   * Handle the outliers in columns: (Fare, Age, SibSp, Parch)

In [None]:
plt.bar(['female', 'male'], train_df['Sex'][train_df['Survived'] == 1].value_counts(), width=0.3, color='orangered')
plt.bar(['female', 'male'], train_df['Sex'][train_df['Survived'] == 0].value_counts().sort_values(), bottom=train_df['Sex'][train_df['Survived'] == 1].value_counts(), width=0.3, color='lightsalmon')
plt.legend(['Survived', 'NotSurvived'])
plt.title('Sex Survived Relationship', fontsize=18, fontweight='bold')
plt.show();

In [None]:
plt.bar(['Upper', 'Middle', 'Lower'], train_df['Pclass'][train_df['Survived'] == 1].value_counts().sort_values(), color='orangered')
plt.bar(['Upper', 'Middle', 'Lower'], train_df['Pclass'][train_df['Survived'] == 0].value_counts(), color='lightsalmon', bottom=train_df['Pclass'][train_df['Survived'] == 1].value_counts().sort_values())
plt.title('Pclass Survived Relationship', fontsize=18, fontweight='bold')
plt.legend(['Survived', 'NotSurvived'])
plt.show();

In [None]:
plt.bar([0, 1, 2, 3, 4, 8, 5], train_df['SibSp'][train_df['Survived'] == 1].value_counts(), color='orangered')
plt.bar([0, 1, 2, 3, 4, 8, 5], train_df['SibSp'][train_df['Survived'] == 0].value_counts(), color='lightsalmon', bottom=train_df['SibSp'][train_df['Survived'] == 1].value_counts())
plt.title('SibSp Survived Relationship', fontsize=18, fontweight='bold')
plt.legend(['Survived', 'NotSurvived'])
plt.show();

# Handle Missing Value
In this section, we handle the missing value present in the dataset. In some case we drop the column from the dataset or in some column we handle using the median and the mode.

In [None]:
train_df['train_test'] = 1
test_df['train_test'] = 0
train_copy = train_df.drop('Survived', axis=1)
combine_df = pd.concat([train_copy, test_df])
combine_df.head()

In [None]:
for col in combine_df.columns:
    if(combine_df.isna().sum()/len(combine_df) > 0.0).sum() != 0:
        if (combine_df[col].isna().sum()/len(combine_df) > 0.0):
            print(f"{col}: {combine_df[col].isna().sum()/len(combine_df)}")
    else:
        print('No missing value found!!')

So, as we stated above that we are going to drop the `cabin` column from the dataset.

In [None]:
combine_df.Cabin.fillna('X', inplace=True)
combine_df.head()

In [None]:
data = [f[0] for f in combine_df.Cabin]
combine_df['Update_Cabin'] = data
combine_df.head()

In [None]:
combine_df.drop('Cabin', axis=1, inplace=True)

In [None]:
combine_df['Age'] = combine_df['Age'].fillna(combine_df['Age'].median())
combine_df['Embarked'] = combine_df['Embarked'].fillna(combine_df['Embarked'].mode()[0])
combine_df['Fare'] = combine_df['Fare'].fillna(combine_df['Fare'].median())
combine_df['Ticket'] = combine_df['Ticket'].fillna(combine_df['Ticket'].mode()[0])

In [None]:
for col in combine_df.columns:
    if(combine_df.isna().sum()/len(combine_df) > 0.0).sum() != 0:
        if (combine_df[col].isna().sum()/len(combine_df) > 0.0):
            print(f"{col}: {combine_df[col].isna().sum()/len(combine_df)}")
    else:
        print('No missing value found!!')
        break

We had deal with all the missing value present in our dataset. Now, its time to perform more EDA to find the normalization and the linear relationship in our dataset. This will help to choose the estimators for training the ml model.

# Handle Qunatile in Dataset
In this section, we handle the quantile value present in the dataset for making the dataset more consistent.

In [None]:
combine_df['Update_Age'] = boxcox(combine_df.Age)[0]
combine_df['Update_Fare'] = boxcox(combine_df.Fare)[0]

In [None]:
combine_df.head()

In [None]:
quantile_col = ['Parch', 'SibSp', 'Update_Fare']
for i in range(len(quantile_col)):
    q1 = combine_df[quantile_col[i]].quantile(0.25)
    q3 = combine_df[quantile_col[i]].quantile(0.75)
    IQR = q3 - q1
    combine_df[quantile_col[i]] = np.where(combine_df[quantile_col[i]] < q1, q1 - (1.5 * IQR), combine_df[quantile_col[i]])
    combine_df[quantile_col[i]] = np.where(combine_df[quantile_col[i]] > q3, q3 + (1.5 * IQR), combine_df[quantile_col[i]])

In [None]:
new_combine_df = combine_df.drop(['Age', 'Fare'], axis=1)

In [None]:
sample_col = [col for col in new_combine_df.columns if pd.api.types.is_numeric_dtype(new_combine_df[col])]
plt.style.use('dark_background')
plt.boxplot(new_combine_df[sample_col[1:]], patch_artist=True, labels=sample_col[1:])
plt.title('Outlier Chart', fontsize=24, fontweight='bold');

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(20, 5))
ax1.boxplot(new_combine_df['SibSp'], patch_artist=True, labels=['SibSp'])
ax1.set_title('SibSp Outlier Chart', fontsize=18, fontweight='bold')
ax2.boxplot(new_combine_df['Parch'], patch_artist=True, labels=['Parch'])
ax2.set_title('Parch Outlier Chart', fontsize=18, fontweight='bold')
ax3.boxplot(new_combine_df['Update_Age'], patch_artist=True, labels=['Age'])
ax3.set_title('Age Outlier Chart', fontsize=18, fontweight='bold');
ax4.boxplot(new_combine_df['Update_Fare'], patch_artist=True, labels=['Fare'])
ax4.set_title('Fare Outlier Chart', fontsize=18, fontweight='bold');

# More EDA
In this section, we perform more EDA to find out the normalization graph in the univariate columns and the linear relationship between the different features in the dataset.

## TODO:
* Handle the Age column and perform normalization.
* Handle the Fare column and perform normalization.

In [None]:
new_combine_df.head()

In [None]:
normal_col = ['Update_Age', 'Update_Fare', 'SibSp', 'Parch', 'Pclass']
for col in normal_col:
    sns.distplot(new_combine_df[col], color='orangered')
    plt.show();

In [None]:
new_combine_df['Family'] = new_combine_df['SibSp'] + new_combine_df['Parch']
new_combine_df.head()

In [None]:
sns.heatmap(new_combine_df.corr(), annot=True, cmap='YlOrBr');

# Feature Engineering
In this section, we create some new columns from the existing one.

In [None]:
new_combine_df.head()

In [None]:
value = new_combine_df.groupby('Pclass')['Pclass'].value_counts().to_dict()
new_combine_df['Pclass_Count'] = new_combine_df.Pclass.apply(lambda x: value.get((x,x), 0))

value = new_combine_df.groupby('Embarked')['Embarked'].value_counts().to_dict()
new_combine_df['Embarked_Count'] = new_combine_df.Embarked.apply(lambda x: value.get((x, x), 0))

value = new_combine_df.groupby('Sex')['Sex'].value_counts().to_dict()
new_combine_df['Sex_Count'] = new_combine_df.Sex.apply(lambda x: value.get((x, x), 0))

value = new_combine_df.groupby('Update_Cabin')['Update_Cabin'].value_counts().to_dict()
new_combine_df['Cabin_Count'] = new_combine_df.Update_Cabin.apply(lambda x: value.get((x, x), 0))

value_list = []
for value in new_combine_df.Ticket:
    if (len(value.split(' ')) > 1):
        value_list.append(value.split(' ')[0])
    else:
        value_list.append('X')
new_combine_df['Ticket_Category'] = value_list


In [None]:
new_combine_df.head()

# Handling Categorical Datatyes
In this section, we transform the categorical data into numerical dataset.

In [None]:
sample_ds = []
for value in new_combine_df.Ticket:
    if(len(value.split(' ')) > 1):
        if(value.split(' ')[1] == ''):
            sample_ds.append(np.nan)
        else:
            sample_ds.append(value.split(' ')[1])
    else:
        sample_ds.append(value)
new_combine_df['Update_Ticket'] = sample_ds

In [None]:
new_combine_df.fillna(new_combine_df.Update_Ticket.mode()[0], inplace=True)

In [None]:
new_combine_df.Update_Ticket = new_combine_df.Update_Ticket.astype('int32')

In [None]:
min_value = new_combine_df.Update_Ticket.min()
max_value = new_combine_df.Update_Ticket.max()
value_ds = []
for value in new_combine_df.Update_Ticket:
    value_ds.append((value - min_value)/(max_value - min_value))

In [None]:
new_combine_df['Update_Ticket'] = value_ds
new_combine_df.head()

In [None]:
quantile_col = ['Update_Ticket']
for i in range(len(quantile_col)):
    q1 = new_combine_df[quantile_col[i]].quantile(0.25)
    q3 = new_combine_df[quantile_col[i]].quantile(0.75)
    IQR = q3 - q1
    new_combine_df[quantile_col[i]] = np.where(new_combine_df[quantile_col[i]] < q1, q1 - (1.5 * IQR), new_combine_df[quantile_col[i]])
    new_combine_df[quantile_col[i]] = np.where(new_combine_df[quantile_col[i]] > q3, q3 + (1.5 * IQR), new_combine_df[quantile_col[i]])

In [None]:
encoder = LabelEncoder()
value = encoder.fit_transform(new_combine_df.Sex)
new_combine_df.Sex = value

In [None]:
value = encoder.fit_transform(new_combine_df.Embarked)
new_combine_df.Embarked = value

In [None]:
value = encoder.fit_transform(new_combine_df.Update_Cabin)
new_combine_df.Update_Cabin = value

In [None]:
value = encoder.fit_transform(new_combine_df.Ticket_Category)
new_combine_df.Ticket_Category = value

In [None]:
new_combine_df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

In [None]:
new_combine_df.head()

In [None]:
new_combine_df['PSib'] = new_combine_df['Pclass']
new_combine_df['PSib'] = new_combine_df['PSib'].map(new_combine_df.groupby('Pclass')['SibSp'].mean())

In [None]:
new_combine_df.head()

In [None]:
new_combine_df['ASex'] = new_combine_df['Ticket_Category']
new_combine_df['ASex'] = new_combine_df['ASex'].map(new_combine_df.groupby('Ticket_Category')['Update_Cabin'].mean())

In [None]:
new_combine_df['PCabin'] = new_combine_df['Pclass']
new_combine_df['PCabin'] = new_combine_df['PCabin'].map(new_combine_df.groupby('Pclass')['Update_Cabin'].mean())

# Prepare Training and Testing Data
In this section, we seperate the dataset into training and testing which we had combine earlier for preprocessing and transformation purpose.

In [None]:
df_train = new_combine_df[new_combine_df['train_test'] == 1]
df_test = new_combine_df[new_combine_df['train_test'] == 0]
df_train.drop(['train_test'], axis=1, inplace=True)
df_test.drop(['train_test'], axis=1, inplace=True)

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df_train.corr(), annot=True);

In [None]:
del_col = ['Pclass', 'SibSp', 'Embarked', 'Sex', 'Cabin_Count', 'Parch']
df_train.drop(del_col, axis=1, inplace=True)
df_test.drop(del_col, axis=1, inplace=True)

# Model Training
In this section, we train the classification model.

In [None]:
df_train.head()

In [None]:
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(df_train, train_df.Survived, test_size=0.2)
len(X_train), len(y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier
decision_model = DecisionTreeClassifier()
decision_model.fit(X_train, y_train)
plt.barh(X_train.columns, decision_model.feature_importances_)
plt.title('Feature Importance');

In [None]:
col = ['Update_Ticket', 'Sex_Count', 'Update_Fare', 'Update_Age']
train = df_train[col]
X0, X1, y0, y1 = train_test_split(train, train_df.Survived, test_size=0.2)

In [None]:
params = {'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1058986682719916,
 'loss': 'exponential',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

gradient_model = GradientBoostingClassifier(**params)
gradient_model.fit(X_train, y_train)
gradient_model.score(X_test, y_test)

In [None]:
gradient_model = GradientBoostingClassifier(learning_rate=0.1)
gradient_model.fit(X_train, y_train)

In [None]:
gradient_model.score(X_test, y_test)

In [None]:
lgbm_model = LGBMClassifier(n_estimators=40)
lgbm_model.fit(X_train, y_train)

In [None]:
lgbm_model.score(X_test, y_test)

In [None]:
from xgboost import XGBRFClassifier

xg_model = XGBRFClassifier()
xg_model.fit(X_train, y_train)
xg_model.score(X_test, y_test)

In [None]:
voting_model = VotingClassifier(estimators=[('gm', gradient_model), 
                                            ('xg', xg_model),
                                            ('lgbm', lgbm_model)
                                           ], voting='hard', verbose=True)
voting_model.fit(X_train, y_train)

In [None]:
voting_model.score(X_test, y_test)

# Submision

In [None]:
def submission_file(model, filename='submission.csv'):
    y_preds = model.predict(df_test)
    submission = pd.DataFrame(y_preds, columns=['Survived'])
    submission.index = test_df.PassengerId
    submission.to_csv(filename)

In [None]:
submission = submission_file(voting_model, filename='submission3.csv')
submission