<p style="text-align: center; font-weight: 500; font-size: 2em;"> 
Titanic Problem
</p>
<p style="text-align: center; font-weight: 500; font-size: 1.2em;"> 
February, 2022
</p>

# Introduction
In this notebook, I will go through the steps of solving a typical classification problem and try different scikit-learn ML Classifiers to **predict which passengers survived the Titanic shipwreck**.   

### Data
 
|No |Column       |Description                                                         |Dtype  |
|---|:------------|:-------------------------------------------------------------------|:-----:|
|0  |PassengerId  |Unique ID                                                           |int64  |
|1  |Survived     |Survival: 0 = No, 1 = Yes                                           |int64  |
|2  |Pclass       |Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd                             |int64  |
|3  |Name         |Passenger Name                                                      |object |
|4  |Sex          |Sex: male or female                                                 |object |
|5  |Age          |Age in years                                                        |float64|
|6  |SibSp        |number of siblings / spouses aboard the Titanic                     |int64  |
|7  |Parch        |number of parents / children aboard the Titanic                     |int64  |
|8  |Ticket       |Ticket number                                                       |object |
|9  |Fare         |Passenger fare                                                      |float64|
|10 |Cabin        |Cabin number                                                        |object |
|11 |Embarked     |Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton |object |

### Plan
1. Data cleaning (duplicates, data types, NAs)
2. Data exploration (distributions, pivot tables, correlations)
3. Feature Engineering
4. Data preprocessing for ML (fixing skewness, scaling, encoding, balancing)
5. Building and tuning models
    * LogisticRegression
    * Support Vector Machines
    * K-Nearest Neighbors
    * Random Forest
    * Extra Trees
    * Gradient Boosting
    * AdaBoost
    * Voting Classifier   
    

**My best result: 0.78468 accuracy (Top 16%)** for Voting Classifier
   
#### References
1. [**Titanic Project Example**](https://www.kaggle.com/kenjee/titanic-project-example) by KEN JEE
2. [**Titanic Data Science Solutions**](https://www.kaggle.com/startupsci/titanic-data-science-solutions) by MANAV SEHGAL
3. [**How to Choose a Feature Selection Method For Machine Learning**](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/) by Jason Brownlee
4. [**Imbalanced Classes: Part 2**](https://towardsdatascience.com/imbalanced-class-sizes-and-classification-models-a-cautionary-tale-part-2-cf371500d1b3) by Becca R

In [None]:
%pylab inline
%config InlineBackend.figure_formats = ['retina']

import numpy as np # linear algebra
import pandas as pd # data processing
import seaborn as sns 

sns.set_style('whitegrid')
import matplotlib.pyplot as plt

from numpy.random import choice
from collections import Counter
import itertools

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import FunctionTransformer
#from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA, KernelPCA
from imblearn.over_sampling import ADASYN
#from sklearn.feature_selection import SelectKBest
#from sklearn.feature_selection import f_classif, chi2, mutual_info_classif

from sklearn.metrics import accuracy_score, classification_report, f1_score, roc_auc_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import roc_curve, precision_recall_curve, confusion_matrix

from imblearn.pipeline import Pipeline
from sklearn.cluster import AgglomerativeClustering
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.ensemble import VotingClassifier

In [None]:
# function to plot confusion matrix
def vis_conf_matrix(conf_martix, model_name):
    group_names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
    group_counts = ["{0:0.0f}".format(value) for value in
                    conf_martix.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         conf_martix.flatten()/np.sum(conf_martix)]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
              zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)

    _, ax = plt.subplots(figsize = (6,6))
    ax = sns.heatmap(conf_martix, annot=labels, fmt = '', 
                     annot_kws = {"size": 20, "weight": "bold"}, cmap = 'Blues')  
    labels = ['False', 'True']
    ax.set_title('Confusion Matrix for {}'.format(model_name), fontsize = 15)
    ax.set_xticklabels(labels, fontsize = 10)
    ax.set_yticklabels(labels, fontsize = 10)
    ax.set_xlabel('Prediction', fontsize = 15)
    ax.set_ylabel('Ground Truth', fontsize = 15)

In [None]:
# define function that fits clustering model and returns data + clustering labels column
def agg_cluster(data, n_clusters, linkage = 'ward'):
    if n_clusters <= 0:
        return data
    else:
        agg = AgglomerativeClustering(n_clusters = n_clusters, linkage = linkage)
        new_col = agg.fit_predict(data)
        new_col = new_col.reshape(len(new_col), 1)
        data = np.append(data, new_col, axis=1)
        return data

# Loading data

In [None]:
# Load data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
print("Train data size: ", train_data.shape)
print("Train data: ", train_data.columns.tolist())
print("-" * 40)
print("Test data size: ", test_data.shape)
print("Test data: ", test_data.columns.tolist())
print("-" * 40)
train_data.head()

In [None]:
# check data types
train_data.info()

In [None]:
test_data.info()

Columns with NA values:
* in train data: Age, Cabin, Embarked
* in test data: Age, Fare, Cabin

# Cleaning data
Check for:
1. Duplicates
2. Data formats, typos
3. Missing values

In [None]:
# check for duplicate rows on Name
duplicate_train = train_data[train_data.duplicated(['Name'])]
duplicate_test = test_data[test_data.duplicated(['Name'])]

print("Duplicate rows in Name column (train data): ", len(duplicate_train))
print(f"PassengerId column in train data: {train_data.PassengerId.nunique()} unique values out of {len(train_data.PassengerId)}")
print("Duplicate rows in Name column (test data): ", len(duplicate_test))
print(f"PassengerId column in test data: {test_data.PassengerId.nunique()} unique values out of {len(test_data.PassengerId)}")

No duplicates in the Name column, all Passenger IDs are unique.

In [None]:
# calculate how many missing values
missing_values_count_train = train_data.isnull().sum()
missing_values_count_test = test_data.isnull().sum()
print("Train data:\n", missing_values_count_train)
print('-'*40)
print("Test data:\n", missing_values_count_test)

**Plan for data cleaning and feature engineering**:
1. Impute NAs in:
    * Age column - with median age within the group (possible groups: sex, pclass)
    * Embarked column in the train data - with top value 
    * Fare column in the test data - with median value within the pclass
2. Look closer into Cabin values:
    * How many unique values?
    * Can we split cabins into categories?
    * Can we use missing values as its own ctegory?
3. Look closer into Ticket values:
    * How many unique values?
    * Can we split tickets into categories?
    * What could we learn from it?
4. PassengerId and Name columns carry little useful information for model to learn. 
    * Can we extract titles from Name and build categories based on them?
5. What can we do with SibSp and Parch columns?
    * SibSp (integer)
        > number of siblings or spouses aboard the Titanic  
        > Sibling = brother, sister, stepbrother, stepsister  
        > Spouse = husband, wife (mistresses and fiancés were ignored)
    * Parch (integer)
        > number of parents or children aboard the Titanic  
        > Parent = mother, father  
        > Child = daughter, son, stepdaughter, stepson  
        > Some children travelled only with a nanny, therefore parch=0 for them.
    * Can we create FamilySize column? (FamilySize = SibSp + Parch + 1)

## Imputing NAs
### Age column

In [None]:
# Plot age distribution according to groups by sex and Pclass
sns.boxplot(y = 'Age', x = 'Pclass',
           data = train_data,
           palette = 'colorblind',
           hue = 'Sex')

In [None]:
# Calculate guess_age matrix for train data to later impute into both train and test datasets
guess_age = np.zeros((2,3))
sex = ['male', 'female']

for i in range(0, 2):
    for j in range(0, 3):
        guess_age[i,j] = int(train_data[(train_data['Sex'] == sex[i]) 
                                        & (train_data['Pclass'] == j + 1)]['Age']
                            .dropna()
                            .median())
guess_age

In [None]:
# Copy train data into new data frame
train1 = train_data.copy()
test1 = test_data.copy()
datasets = [train1, test1]

In [None]:
# Impute NAs in Age with guess_age
for dataset in datasets:
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == sex[i]) 
                        & (dataset.Pclass == j + 1), 'Age'] = guess_age[i,j]
            
print("Train data Age NAs: ", train1.Age.isnull().sum())
print("Test data Age NAs: ", test1.Age.isnull().sum())

### Embarked and Fare columns

In [None]:
# Impute NAs in Embarked column of the train dataset with the most frequent value (mode value)
train1['Embarked'].fillna(train1['Embarked'].mode()[0], inplace = True)
print("Train data Embarked NAs: ", train1.Embarked.isnull().sum())

In [None]:
# Plot Fare distribution according to groups Pclass
sns.boxplot(y = 'Fare', x = 'Pclass',
           data = train1)

In [None]:
# Find a Pclass value for the NA in the test data
NA_pclass = int(test1[(test1.Fare.isnull())]['Pclass'])
print("Pclass of NA value in the test data: ", NA_pclass)

# Calculate median value for the Fare within that pclass
fare_to_impute = train1[(train1['Pclass'] == NA_pclass)]['Fare'].median()
print("Median value to impute: ", fare_to_impute)

In [None]:
# Impute NAs in Fare column of the train dataset with the median value within the pclass
test1['Fare'].fillna(fare_to_impute, inplace = True)
print("Test data Fare NAs: ", test1.Fare.isnull().sum())

In [None]:
# Check again for NAs
print("Train data:\n", train1.isnull().sum())
print('-'*40)
print("Test data:\n", test1.isnull().sum())

## Closer look at Cabin values

In [None]:
print("Unique cabin numbers for train data:\n", train1.Cabin.unique())
print("Total: ", train1.Cabin.nunique())
print("-"*40)
print("Unique cabin numbers for test data:\n", test1.Cabin.unique())
print("Total: ", test1.Cabin.nunique())

Information I used to understand Cabin numbers:
1. https://www.encyclopedia-titanica.org/cabins.html
2. https://www.dummies.com/education/history/titanic-facts-the-layout-of-the-ship/
3. https://www.encyclopedia-titanica.org/titanic-deckplans 
4. https://titanic.fandom.com/wiki/Category:Locations_on_board
5. https://www.encyclopedia-titanica.org/passenger-accommodation.html
  
There were 10 decks on the Titanic: Boat (T), Promenade (A), Bridge (B), Shelter (C), Saloon (D), Upper (E), Middle (F), Lower (G), Orlop, and Tank Top. 
* Decks Boat, A, B, and C had only first class cabins; 
* Deck D had cabins of all 3 classes (depending on the location); 
* Deck E had mostly 3d and 2nd class cabins, but there were some cabins that could be alternative 1st or 2nd class;
* Deck F had 2nd and 3d class cabins;
* Deck G had 3d class cabins;
* Orlop and Tank Top didn't have cabins.  
  
Cabins with even numbers were located on the left side, with odd numbers - on the right side.
  
1st and 2nd class cabins were named with letter for deck and number for room (like D37, B73, etc.). Some passengers seem to have more that 1 cabin (cabins separated by space).  
3d class cabins initially didn't have the letter before the number on the deck plans, but they were divided into sections. Therefore, the 3d class cabin located on deck F in section G would be called 'F G64'.  
  
**Plan**:
1. Create 'Deck' column by separating first letter of the Cabin value.
2. Count how many cabins the passenger had. For that we have to replace 'F ' for 3rd class cabins and then count the number of parts separated by space.

In [None]:
# create 2 new columns from Cabin
for dataset in datasets:
    dataset['Deck'] = dataset.Cabin.apply(lambda x: str(x)[0])
    dataset['multi_cabin'] = dataset.Cabin.apply(lambda x: 
                                                x if pd.isna(x) else len(x.replace('F ','').split(' ')))

In [None]:
for dataset in datasets:
    print(dataset.Deck.value_counts())
    print(dataset.multi_cabin.value_counts())
    print("NaN for multi_cabin: ", dataset.multi_cabin.isna().sum())
    print('-'*40)

Can we impute Deck values based on passenger classes?  
I looked at how many passengers of every class could be accomodated on each deck and calculated the probabilities of each Deck to appear for a passenger of certain class.
  
#### First class
* A-Deck: 42 passengers, p=0.06 
  > probability of a first class passenger to be accomodated on Deck A = (42 first class passengers on Deck A) / (689 first class passengers total)
* B-Deck: 123 passengers, p=0.18
* C-Deck: 310 passengers, p=0.45
* D-Deck: 117 passengers, p=0.17
* E-Deck: 97 passengers, p=0.14  
Total: 689 

population_first = ['A', 'B', 'C', 'D', 'E']  
weights_first = [0.06, 0.18, 0.45, 0.17, 0.14]  
  
#### Second class
* D-Deck: 118 passengers, p=0.17
* E-Deck: 226 passengers, p=0.34
* F-Deck: 218 passengers, p=0.32
* G-Deck: 112 passengers, p=0.17  
Total: 674  

population_second = ['D', 'E', 'F', 'G']   
weights_second = [0.17, 0.34, 0.32, 0.17]  
  
#### Third class
* D-Deck: 50 passengers, p=0.05
* E-Deck: 260 passengers, p=0.25
* F-Deck: 466 passengers, p=0.45
* G-Deck: 86 + 164 passengers, p=0.25  
Total: 1026  

population_third = ['D', 'E', 'F', 'G']  
weights_third = [0.05, 0.25, 0.45, 0.25]  

In [None]:
# first class
population_first = ['A', 'B', 'C', 'D', 'E']
weights_first = [0.06, 0.18, 0.45, 0.17, 0.14]

# second class
population_second = ['D', 'E', 'F', 'G']
weights_second = [0.17, 0.34, 0.32, 0.17]

# third class
population_third = ['D', 'E', 'F', 'G']
weights_third = [0.05, 0.25, 0.45, 0.25]

for dataset in datasets:
    dataset['Deck_new'] = dataset.apply(lambda row: row['Deck'] if row['Deck'] != 'n'
                                        else (choice(population_first, 1, weights_first).tolist()[0] if int(row['Pclass']) == 1
                                              else (choice(population_second, 1, weights_second).tolist()[0] if int(row['Pclass']) == 2
                                                   else choice(population_third, 1, weights_third).tolist()[0])), axis = 1)

In [None]:
for dataset in datasets:
    print(dataset.Deck_new.value_counts())
    print(dataset.Deck_new.isnull().sum())
    print('-'*40)

In [None]:
# Look at the distribution for 'Deck' and 'Deck_new'
fig, axes = plt.subplots(2, 2)
axes = axes.flatten()

ax = sns.barplot(x=train1[train1.Survived == 1]['Deck'].value_counts().index, 
                 y=train1[train1.Survived == 1]['Deck'].value_counts(), 
                 order=list(train1['Deck'].value_counts().index.sort_values()), ax=axes[0])
ax.title.set_text("Survived = 1")
ax = sns.barplot(x=train1[train1.Survived == 0]['Deck'].value_counts().index, 
                 y=train1[train1.Survived == 0]['Deck'].value_counts(), 
                 order=list(train1['Deck'].value_counts().index.sort_values()), ax=axes[1])
ax.title.set_text("Survived = 0")
ax = sns.barplot(x=train1[train1.Survived == 1]['Deck_new'].value_counts().index, 
                 y=train1[train1.Survived == 1]['Deck_new'].value_counts(), 
                 order=list(train1['Deck'].value_counts().index.sort_values()), ax=axes[2])
ax = sns.barplot(x=train1[train1.Survived == 0]['Deck_new'].value_counts().index, 
                 y=train1[train1.Survived == 0]['Deck_new'].value_counts(), 
                 order=list(train1['Deck'].value_counts().index.sort_values()), ax=axes[3])

fig.tight_layout()

* Replace Deck T with Deck A in train set
* Create Deck_num to capture information on passengers who didn't have their cabin listed
* Create Deck_new_num to capture relationship between Deck (imputed based on class) and survival rate

In [None]:
# Replace Deck T with Deck A in train set
deck = ['Deck', 'Deck_new']
to_replace = {'T': 'A'}
for i in deck:
    train1[i] = train1[i].replace(to_replace = to_replace)  

In [None]:
# Create Deck_num and Deck_new_num
to_replace = {'n': 0,
              'A': 1,
              'B': 2,
              'C': 3,
              'D': 4,
              'E': 5,
              'F': 6,
              'G': 7}
for dataset in datasets:
    for i in deck:
        dataset['{}_num'.format(i)] = dataset[i].replace(to_replace = to_replace)

In [None]:
# check correlation between new columns for Deck and target column 'Survived'
deck_num = ['{}_num'.format(i) for i in deck]
sns.set_context('notebook', font_scale=1)
fig, ax = pyplot.subplots(figsize=(6,6))
sns.heatmap(ax=ax, data=train1[deck_num + ['Survived']].corr(), annot=True, fmt= '.2f', cmap='coolwarm')

Both deck columns carry some useful information.

In [None]:
# Can we impute NaNs in multi_cabin based on grouping by Fare? More cabins per person would be more expensive.
sns.stripplot(x="multi_cabin", y="Fare", hue="Pclass",
              data=train1, jitter=True,
              palette="Set2", dodge=True, linewidth=1, edgecolor='gray')

sns.boxplot(x="multi_cabin", y="Fare", hue="Pclass",
            data=train1, palette="Set2", fliersize=0)

Price bands:
* more than 250 - 1
* 100 to 250 - 2
* 50 to 100 - 3
* less than 50 - 4

In [None]:
# define price_band
for dataset in datasets:
    dataset['Price_band'] = dataset.Fare.apply(lambda x: 1 if x > 250 else 
                                               (2 if x <= 250 and x > 100 else
                                               (3 if x <= 100 and x > 50 else 4)))

In [None]:
# Calculate median_multi_cabin matrix for train data to later impute into both train and test datasets
median_multi_cabin = np.zeros((4,3))

for i in range(0, 4):
    for j in range(0, 3):
        median_multi_cabin[i,j] = train1[(train1['Price_band'] == i + 1) 
                                           & (train1['Pclass'] == j + 1)]['multi_cabin'].dropna().median()
        if pd.isna(median_multi_cabin[i,j]):
            median_multi_cabin[i,j] = 1
median_multi_cabin

In [None]:
# Impute NAs in multi_cabin with median_multi_cabin
for dataset in datasets:
    for i in range(0, 4):
        for j in range(0, 3):
            dataset.loc[(dataset.multi_cabin.isnull()) & (dataset.Price_band == i + 1) 
                        & (dataset.Pclass == j + 1), 'multi_cabin'] = median_multi_cabin[i,j]
            
print("Train data multi_cabin NAs: ", train1.multi_cabin.isnull().sum())
print("Train data multi_cabin values:\n", train1.multi_cabin.value_counts())
print('-'*40)
print("Test data multi_cabin NAs: ", test1.multi_cabin.isnull().sum())
print("Test data multi_cabin values:\n", test1.multi_cabin.value_counts())

In [None]:
sns.stripplot(x="multi_cabin", y="Fare", hue="Pclass",
              data=train1, jitter=True,
              palette="Set2", dodge=True, linewidth=1, edgecolor='gray')

sns.boxplot(x="multi_cabin", y="Fare", hue="Pclass",
            data=train1, palette="Set2", fliersize=0)

In [None]:
# merge passengers with more than 1 cabin into one group
train1['multi_cabin'] = train1['multi_cabin'].apply(lambda x: 1 if x > 1 else 0)
test1['multi_cabin'] = test1['multi_cabin'].apply(lambda x: 1 if x > 1 else 0)

In [None]:
print("Train data multi_cabin NAs: ", train1.multi_cabin.isnull().sum())
print("Train data multi_cabin values:\n", train1.multi_cabin.value_counts())
print('-'*40)
print("Test data multi_cabin NAs: ", test1.multi_cabin.isnull().sum())
print("Test data multi_cabin values:\n", test1.multi_cabin.value_counts())

In [None]:
for dataset in datasets:
    print(dataset.info())
    print('-'*40)

## Closer look at Ticket column

Sources:
1. https://www.kaggle.com/c/titanic/discussion/11127
    > All Tickets have:
    > * an optional string prefix TktPre and
    > * a number TktNum number (except for the special case Ticket=='LINE', for which we can assign some arbitrary TktNum e.g. -1). Should not treat TktNum directly as an integer; it is seriously non-contiguous.  
    
    > Both of these are predictive: TktNum can be compared for equality (tells you who was sharing a cabin, or traveling together on joint ticket) or compared for closeness (might allow us to fill in missing Cabin/Deck values, also using Pclass/ individual Fare). TktPre seems to tell you who the issuing ticket office and/or embarkation point were.
2. https://www.kaggle.com/c/titanic/discussion/14919#136279

In [None]:
print("Train data Ticket values: ", train1.Ticket.unique())
print(train1.Ticket.nunique())
print("-"*40)
print("Test data Ticket values: ", test1.Ticket.unique())
print(test1.Ticket.nunique())

* Some passengers have same Ticket value - joined tickets for groups traveling together?
* Can remove symbols like '.' and '/'
* If the first symbol is letter, then Ticket has prefix, separated from number by white space (special case with 'LINE' values)

In [None]:
for dataset in datasets:
    # Remove symbols like '.' and '/'
    dataset['Ticket_new'] = dataset.Ticket.apply(lambda x: x.replace('/','').replace('.','').upper())
    # Extract prefix
    dataset['PreTkt'] = dataset.Ticket_new.apply(lambda x: '-' if x.isdigit() else x.split(' ')[0])
    # Extract number
    dataset['NumTkt'] = dataset.Ticket_new.apply(lambda x: '-1' if x == 'LINE' else x.split(' ')[-1])

In [None]:
# Look at prefix values
print("Train data")
print(train1.PreTkt.value_counts())
print("Total unique: ", train1.PreTkt.nunique())
print('-'*40)
print("Test data")
print(test1.PreTkt.value_counts())
print("Total unique: ", test1.PreTkt.nunique())

In [None]:
# Look at ticket numbers
print(f"Train data: {train1.NumTkt.nunique()} unique out of {len(train1.NumTkt)}")
print(f"Test data: {test1.NumTkt.nunique()} unique out of {len(test1.NumTkt)}")

New variable for travel group sizes: count how many people had the same ticket number.  
It would allow us to capture information on the group sizes of passengers traveling together regardless of their family relations.

In [None]:
# count how many times ticket number appears in both train and test datasets
ticket_count = Counter(train1['NumTkt'].tolist() + test1['NumTkt'].tolist())
ticket_count.most_common(5)

In [None]:
# create new column for travel group size
for dataset in datasets:
    dataset['Group_size'] = dataset.NumTkt.apply(lambda x: ticket_count[x])

In [None]:
print("Train data")
print(train1.Group_size.value_counts())
print("Total unique: ", train1.Group_size.nunique())
print("Train data NAs: ", train1.Group_size.isnull().sum())
print('-'*40)
print("Test data")
print(test1.Group_size.value_counts())
print("Total unique: ", test1.Group_size.nunique())
print("Test data NAs: ", test1.Group_size.isnull().sum())

In [None]:
for dataset in datasets:
    print(dataset.info())
    print('-'*40)

## Closer look at Name column
We will extract Titles from names to create new categories.

In [None]:
# extract titles
for dataset in datasets:
    dataset['Title'] = dataset.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
    print(dataset.Title.value_counts())

In [None]:
# Put rare titles in the category Rare and unify the rest
for dataset in datasets:
    dataset['Title'] = dataset['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Sir', 'the Countess',
                                                'Don', 'Capt', 'Lady', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    print(dataset.Title.value_counts())

## Family size column
Family_size (the number of family members from a family om board) = SibSp + Parch + 1

In [None]:
# create Family_size column
for dataset in datasets:
    dataset['Family_size'] = dataset.apply(lambda row: int(row['SibSp']) + int(row['Parch']) + 1, axis = 1)
    print(dataset.Family_size.value_counts())

# Visualizing data

Let's drop columns that we won't use anymore (except PassengerId, I will leave it for now): 
* Name
* Ticket
* Cabin
* Deck
* Ticket_new

In [None]:
# columns to drop defenetly
to_drop = ['Name', 'Ticket', 'PreTkt', 'NumTkt', 'Cabin', 'Deck', 'Deck_new', 'Ticket_new']
train2 = train1.copy().drop(to_drop, axis=1)
test2 = test1.copy().drop(to_drop, axis=1)
datasets2 = [train2, test2]

for dataset in datasets2:
    print(dataset.info())
    print('-'*40)

In [None]:
# fix data type for PassengerId
for dataset in datasets2:
    dataset['PassengerId'] = dataset['PassengerId'].astype('object')

# split columns into numerical and categorical (exclude the PassengerId)
float_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'float64'].index.tolist()
int_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'int64'].index.tolist()
cat_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'object'].index.tolist()
num_cols = float_cols + int_cols
binary = [x for x in num_cols if len(train2[x].unique()) == 2]
num_not_binary = [x for x in num_cols if x not in binary]

print("Numerical variables: ", num_cols)
print("Numerical continuous variables: ", float_cols)
print("Numerical discrete variables: ", int_cols)
print("Numerical and not binary variables: ", num_not_binary)
print("Binary variables: ", binary)
print("Categorical variables: ", cat_cols)

## Numeric variables

In [None]:
# build histograms for numeric nonbinary variables
fig, axList = plt.subplots(2, 5, sharex=False, sharey=False)
axList = axList.flatten()
fig.set_size_inches(12, 4)

for ax in axList[len(num_not_binary):]:
    ax.axis('off')

for i,ax in enumerate(axList[0:len(num_not_binary)]):
    train2.hist(column = num_not_binary[i], bins = 10, ax=ax)
            
fig.tight_layout()
fig.show()

In [None]:
# look at age distribution vs survival
g = sns.FacetGrid(train2, col='Survived')
g.map(plt.hist, 'Age', bins=20)

In [None]:
ax = sns.violinplot(x="Sex", y="Age", hue="Survived",
                    data=train1, palette="Set2", split=True,
                    scale="count")
ax.title.set_text('Survival rate vs Age and Sex')

*For males*:
 - Age from 0 to about 12-15: more survived
 - Age from 15: more didn't survive   
   
*For females*: 
- more survived at any age

In [None]:
ax = sns.violinplot(x="Pclass", y="Age", hue="Survived",
                    data=train1, palette="Set2", split=True,
                    scale="count")
ax.title.set_text('Survival rate vs Age and Passenger class')

*For 1st class*:
- Age from 0 to about 60: more survived
- Age from 60: more didn't survive   
   
*For 2nd class*:
- Age from 0 to about 16-18: more survived
- Age from 18-20: similar survival rate  
  
*For 3d class*:
- Age from 0 to about 12-15: similar survival rate
- Age from 18-20: more didn't survive   
   
Let's create age bands to capture these observations.

In [None]:
print("Maximum age in train set: ", max(train2['Age']))
print("Maximum age in test set: ", max(test2['Age']))

Age categories:
* 0-2: Toddler/baby - 0
* 2-12: Child - 1
* 12-18: Teen - 2
* 18-25: Young adult - 3
* 25-60: Adult - 4
* 60+: Senior - 5

In [None]:
# create age categories
for dataset in datasets2:
    dataset['Age_band'] = pd.cut(train2.Age, bins = [0, 2, 12, 18, 25, 60, 99], labels = False)
    dataset.hist(column = 'Age_band', bins = 6)

In [None]:
df = train2.groupby(['Age_band'])['Survived'].mean()
ax = df.plot(kind='bar')
ax.title.set_text('Average survival rate vs Age band')

In [None]:
pd.pivot_table(train2, index = 'Sex', columns = 'Age_band', values = 'Survived' ,aggfunc ='mean')

Let's try to combine 'Age_band' and 'Sex' columns.

In [None]:
# Encode Sex column
to_replace = {'male': 0,
              'female': 1}
for dataset in datasets2:
    dataset['Sex'] = dataset['Sex'].replace(to_replace = to_replace)

In [None]:
for dataset in datasets2:
    dataset['Age_Sex'] = dataset['Sex'] * dataset['Age_band']

In [None]:
df = train2.groupby(['Age_Sex'])['Survived'].mean()
ax = df.plot(kind='bar')
ax.title.set_text('Average survival rate vs Age band multiplied by Sex')

In [None]:
pd.pivot_table(train2, index = 'Pclass', columns = 'Age_band', values = 'Survived' ,aggfunc ='mean')

In [None]:
# combine 'Age_band' and 'Pclass'
for dataset in datasets2:
    dataset['Age_Pclass'] = dataset['Pclass'] * dataset['Age_band']

In [None]:
df = train2.groupby(['Age_Pclass'])['Survived'].mean()
ax = df.plot(kind='bar')
ax.title.set_text('Average survival rate vs Age band multiplied by Passenger class')

In [None]:
# look at Fare distribution vs survival
f = sns.FacetGrid(train2, col='Survived')
f.map(plt.hist, 'Fare', bins=20)

In [None]:
ax = sns.violinplot(x="Sex", y="Fare", hue="Survived",
                    data=train1, palette="Set2", split=True,
                    scale="count")
ax.title.set_text('Survival rate vs Fare and Sex')

In [None]:
ax = sns.violinplot(x="Pclass", y="Fare", hue="Survived",
                    data=train1, palette="Set2", split=True,
                    scale="count")
ax.title.set_text('Survival rate vs Fare and Passenger class')

In [None]:
print("Minimum Fare in train set: ", min(train2['Fare']))
print("Maximum Fare in train set: ", max(train2['Fare']))
print("Minimum Fare in test set: ", min(test2['Fare']))
print("Maximum Fare in test set: ", max(test2['Fare']))

Redefine Price bands:
* 0-20: 0
* 20-50: 1
* 50-100: 2
* 100-200: 3
* 200-600: 4

In [None]:
for dataset in datasets2:
    dataset['Price_band'] = pd.cut(train2.Fare, bins = [-1, 20, 50, 100, 200, 600], labels = False)
    dataset.hist(column = 'Price_band', bins = 5)

In [None]:
df = train2.groupby(['Price_band'])['Survived'].mean()
ax = df.plot(kind='bar')
ax.title.set_text('Average survival rate vs Price band')

In [None]:
pd.pivot_table(train2, index = 'Sex', columns = 'Price_band', values = 'Survived' ,aggfunc ='mean')

In [None]:
for dataset in datasets2:
    dataset['Price_sex'] = dataset['Price_band'] * (dataset['Sex'] + 1)

In [None]:
df = train2.groupby(['Price_sex'])['Survived'].mean()
ax = df.plot(kind='bar')
ax.title.set_text('Average survival rate vs Price band multiplied by Sex')

In [None]:
# Heatmap (correlation)
sns.set_context('notebook', font_scale=0.8)
fig, ax = pyplot.subplots(figsize=(10,10))
sns.heatmap(ax=ax, data=train2.corr(), annot=True, fmt= '.2f', cmap='coolwarm')

We have some multicollinearity with new features.  
I will apply Dimensionality Reduction methods to fix it while keeping as much information and variance as possible.

In [None]:
# update lists of columns (numerical and categorical)
float_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'float64'].index.tolist()
int_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'int64'].index.tolist()
cat_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'object'].index.tolist()
num_cols = float_cols + int_cols
binary = [x for x in num_cols if len(train2[x].unique()) == 2]
num_not_binary = [x for x in num_cols if x not in binary]

print("Numerical variables: ", num_cols)
print("Numerical continuous variables: ", float_cols)
print("Numerical discrete variables: ", int_cols)
print("Numerical and not binary variables: ", num_not_binary)
print("Binary variables: ", binary)
print("Categorical variables: ", cat_cols)

In [None]:
# calculate skewness scores for numerical and not binary variables
skewness = train2[float_cols].skew(axis=0, numeric_only = True).to_dict()

# define skewness threshold
skewness_threshold = 0.5

# create lists of columns that require normalizing 
# positively and negatively skewed variables to be processed by different transformations
pos_skewed_cols = []
neg_skewed_cols = []
for i in skewness:
    if abs(skewness[i]) > skewness_threshold:
        if skewness[i] > 0:
            pos_skewed_cols.append(i)
        else:
            neg_skewed_cols.append(i)

# print results
for i in pos_skewed_cols:
    print(f"Column {i} is positively skewed: score {round(skewness[i], 2)}")

print('-'*40)

for i in neg_skewed_cols:
    print(f"Column {i} is negatively skewed: score {round(skewness[i], 2)}")

In [None]:
# create a data frame with skewness coefficients before and after different transformations to choose the ones to use

# skewness before transformations
pos_no_transf_dict = train2[pos_skewed_cols].skew(axis=0, numeric_only = True).to_dict()
pos_skew = pd.DataFrame(list(pos_no_transf_dict.items()), columns = ['vars','skew_no_transform'])

# skewness after Logarithmic Transformation
pos_log_transf_dict = train2[pos_skewed_cols].apply(np.log1p).skew(axis=0, numeric_only = True).to_dict()
pos_skew = pd.merge(pos_skew, 
                    pd.DataFrame(list(pos_log_transf_dict.items()), columns = ['vars','skew_log_transform']), 
                    on = ['vars'])

pos_skew

In [None]:
to_log_transform = ["Fare"]

In [None]:
for dataset in datasets2:
    for i in to_log_transform:
        dataset[i] = dataset[i].apply(np.log1p)
    
    print(dataset[float_cols].skew(axis=0, numeric_only = True).sort_values())
    print('-' * 40)

In [None]:
# build histograms for continuous numeric variables
fig, axList = plt.subplots(3, 5, sharex=False, sharey=False)
axList = axList.flatten()
fig.set_size_inches(12, 6)

for ax in axList[len(num_not_binary):]:
    ax.axis('off')

for i,ax in enumerate(axList[0:len(num_not_binary)]):
    train2.hist(column = num_not_binary[i], bins = 10, ax=ax)
            
fig.tight_layout()
fig.show()

## Categorical variables

In [None]:
# Barcharts for categotical
for i in cat_cols + binary:
    sns.barplot(x=train2[i].value_counts().index, y=train2[i].value_counts())
    plt.title(i)
    plt.show()

In [None]:
for i in ['Embarked', 'Title', 'Sex', 'multi_cabin']:
    print(pd.pivot_table(train2, columns = i, values = 'Survived', aggfunc ='mean'))
    print("-" * 40)

In [None]:
# Encode 'Embarked' as an ordinal variable
to_replace = {'S': 0,
              'Q': 1,
              'C': 2}
for dataset in datasets2:
    dataset['Embarked'] = dataset['Embarked'].replace(to_replace = to_replace)

# Preprocessing Data

In [None]:
# update lists of columns (numerical and categorical)
float_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'float64'].index.tolist()
int_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'int64'].index.tolist()
cat_cols = train2.drop('PassengerId', axis=1).dtypes[train2.dtypes == 'object'].index.tolist()
num_cols = float_cols + int_cols
binary = [x for x in num_cols if len(train2[x].unique()) == 2]
num_not_binary = [x for x in num_cols if x not in binary]

print("Numerical variables: ", num_cols)
print("Numerical continuous variables: ", float_cols)
print("Numerical discrete variables: ", int_cols)
print("Numerical and not binary variables: ", num_not_binary)
print("Binary variables: ", binary)
print("Categorical variables: ", cat_cols)

In [None]:
target = ["Survived"]
features = [x for x in num_cols + cat_cols if x not in target]
cat_features = [x for x in features if x in cat_cols]
num_features = [x for x in features if x in num_cols]
print("Features: ", features)
print("Categorical features: ", cat_features)
print("Numerical features: ", num_features)

In [None]:
# get training data features and target
X = train2[features].copy()
y = train2[target].copy()
X.head()

In [None]:
y.value_counts(normalize=True)

The set is unbalanced, so I will add oversampling method to the estimator Pipeline.

In [None]:
X = pd.get_dummies(X, columns = cat_features, drop_first=True)
X.columns

In [None]:
# split training data into train and test sets
# Get the split indexes
strat_shuf_split = StratifiedShuffleSplit(n_splits=1, 
                                          test_size=0.4, 
                                          random_state=42)

train_idx, test_idx = next(strat_shuf_split.split(X, y))

# Create the dataframes for train and test
X_train = X.loc[train_idx,]
y_train = y.loc[train_idx,]

X_test  = X.loc[test_idx,]
y_test  = y.loc[test_idx,]

In [None]:
# scale numeric columns
mm = MinMaxScaler()

for column in num_features:
    X_train[[column]] = mm.fit_transform(X_train[[column]])
    X_test[[column]] = mm.transform(X_test[[column]])

round(X_train.describe(), 3)

In [None]:
round(X_test.describe(), 3)

In [None]:
# make sure that we have the same columns in train and test sets
print(X_train.columns)
print(X_test.columns)

# ML Classifiers
## Logistic Regression

In [None]:
# Logistic Regression
estimator_lr = Pipeline([#("feature_selector", KernelPCA(random_state = 42, kernel = 'rbf')),
                         #("clustering", FunctionTransformer(agg_cluster)),
                         ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                         ("clasifier", LogisticRegression(class_weight = 'balanced', solver='liblinear'))])

params_lr = {
    #'feature_selector__n_components': list(range(17, 22)),
    #'clustering__kw_args': [{'n_clusters': i} for i in range(3,8)],
    'sampling__n_neighbors': [4, 5, 6],
    'clasifier__penalty': ['l2'],
    'clasifier__C': np.geomspace(0.001, 40, 20)
}

In [None]:
# KFold for Grid Search
skf = StratifiedKFold(n_splits = 6)

# do grid search
grid_lr = GridSearchCV(estimator_lr, params_lr, 
                       scoring = 'f1', 
                       cv = skf, 
                       n_jobs = -1, verbose = 1)
grid_lr.fit(X_train, y_train.values.ravel())

In [None]:
grid_lr.best_score_, grid_lr.best_params_

In [None]:
# predict target values
y_pred_lr = grid_lr.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_lr = confusion_matrix(y_test, y_pred_lr)
vis_conf_matrix(c_matrix_lr, "Logistic Regression")

In [None]:
print(classification_report(y_test, y_pred_lr))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_lr), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_lr), 2))

In [None]:
scores = pd.DataFrame(data = {'model': ['logistic regression'], 
                              #'features': [grid_lr.best_params_['feature_selector__n_components']],  
                              'f1': [f1_score(y_test, y_pred_lr)], 
                              'accuracy': [accuracy_score(y_test, y_pred_lr)]})

scores.loc[scores.model == 'logistic regression', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_lr.ravel()/np.sum(c_matrix_lr)*100, 
                                                 decimals=2)
scores

## Support Vector Machines

In [None]:
# Support Vector Machines
estimator_svc = Pipeline([#("clustering", FunctionTransformer(agg_cluster)),
                          ("feature_selector", KernelPCA(random_state = 42, kernel = 'rbf')),
                          ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                          ("clasifier", SVC(kernel = 'rbf', probability = True))])

params_svc = {#'clustering__kw_args': [{'n_clusters': i} for i in range(3,8)],
              'feature_selector__n_components': [16, 17],
              'sampling__n_neighbors': [5],
              'clasifier__gamma': np.geomspace(0.1, 4, 5), 
              'clasifier__C': np.geomspace(1, 50, 10)}

grid_svc = GridSearchCV(estimator_svc, params_svc, 
                       scoring = 'f1', 
                       cv=skf, 
                       n_jobs = -1, verbose = 1)
grid_svc.fit(X_train, y_train.values.ravel())

In [None]:
grid_svc.best_score_, grid_svc.best_params_

In [None]:
# predict target values
y_pred_svc = grid_svc.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_svc = confusion_matrix(y_test, y_pred_svc)
vis_conf_matrix(c_matrix_svc, "SVC")

In [None]:
print(classification_report(y_test, y_pred_svc))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_svc), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_svc), 2))

In [None]:
scores = scores.append({'model': 'SVC', 
                        #'features': grid_svc.best_params_['feature_selector__n_components'], 
                        'f1': f1_score(y_test, y_pred_svc),
                        'accuracy': accuracy_score(y_test, y_pred_svc)}, 
                        ignore_index=True)

scores.loc[scores.model == 'SVC', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_svc.ravel()/np.sum(c_matrix_svc)*100, 
                                                 decimals=2)
scores

## K-Nearest Neighbors

In [None]:
# K-Nearest Neighbors
estimator_knn = Pipeline([("clustering", FunctionTransformer(agg_cluster)),
                          ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                          ("clasifier", KNeighborsClassifier())])

params_knn = {
    'clustering__kw_args': [{'n_clusters': i} for i in range(2,5)],
    'sampling__n_neighbors': [3, 4, 5],
    'clasifier__n_neighbors': [i for i in range(5, 10)], 
    'clasifier__weights': ['uniform'], 
    'clasifier__algorithm': ['auto'],
    'clasifier__p': [1, 2]
}

grid_knn = GridSearchCV(estimator_knn, params_knn, 
                       scoring = 'f1', 
                       cv=skf, 
                       n_jobs = -1, verbose = 1)
grid_knn.fit(X_train, y_train.values.ravel())

In [None]:
grid_knn.best_score_, grid_knn.best_params_

In [None]:
# predict target values
y_pred_knn = grid_knn.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_knn = confusion_matrix(y_test, y_pred_knn)
vis_conf_matrix(c_matrix_knn, "KNN")

In [None]:
print(classification_report(y_test, y_pred_knn))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_knn), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_knn), 2))

In [None]:
scores = scores.append({'model': 'KNN', 
                        #'features': grid_knn.best_params_['feature_selector__n_components'],  
                        'f1': f1_score(y_test, y_pred_knn),
                        'accuracy': accuracy_score(y_test, y_pred_knn)}, 
                        ignore_index=True)

scores.loc[scores.model == 'KNN', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_knn.ravel()/np.sum(c_matrix_knn)*100, 
                                                 decimals=2)
scores

## Random Forest

In [None]:
# Random Forest Classifier
estimator_rf = Pipeline([#("feature_selector", KernelPCA(random_state = 42, kernel = 'rbf')),
                         #("clustering", FunctionTransformer(agg_cluster)),
                         ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                         ("clasifier", RandomForestClassifier(random_state = 42))])

params_rf = {#'feature_selector__n_components': [18, 19, 20],
             #'clustering__kw_args': [{'n_clusters': i} for i in range(2,8)],
             'sampling__n_neighbors': [3],
             'clasifier__n_estimators': [1000], 
             'clasifier__criterion': ['gini'], 
             'clasifier__bootstrap': [False],
             'clasifier__max_depth': [3],
             'clasifier__max_features': ['auto'],
             'clasifier__min_samples_leaf': [2, 3],
             'clasifier__min_samples_split': [3]}

# do grid search
grid_rf = GridSearchCV(estimator_rf, params_rf, 
                       scoring = 'f1', 
                       cv=skf, 
                       n_jobs = -1, verbose = 2)
grid_rf.fit(X_train, y_train.values.ravel())

In [None]:
grid_rf.best_score_, grid_rf.best_params_

In [None]:
# predict target values
y_pred_rf = grid_rf.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_rf = confusion_matrix(y_test, y_pred_rf)
vis_conf_matrix(c_matrix_rf, "Random Forest")

In [None]:
print(classification_report(y_test, y_pred_rf))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_rf), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_rf), 2))

In [None]:
scores = scores.append({'model': 'random forest', 
                        #'features': grid_rf.best_params_['feature_selector__n_components'],   
                        'f1': f1_score(y_test, y_pred_rf),
                        'accuracy': accuracy_score(y_test, y_pred_rf)}, 
                        ignore_index=True)

scores.loc[scores.model == 'random forest', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_rf.ravel()/np.sum(c_matrix_rf)*100, 
                                                 decimals=2)
scores

## Extra Trees

In [None]:
# Extra Trees Classifier
estimator_et = Pipeline([("feature_selector", KernelPCA(random_state = 42, kernel = 'rbf')),
                         #("clustering", FunctionTransformer(agg_cluster)),
                         ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                         ("clasifier", ExtraTreesClassifier(random_state = 42))])

params_et = {#'clustering__kw_args': [{'n_clusters': i} for i in range(2,8)],
             'feature_selector__n_components': [19, 20],
             'sampling__n_neighbors': [3],
             'clasifier__n_estimators': [1000], 
             'clasifier__criterion': ['gini'], 
             'clasifier__bootstrap': [False],
             'clasifier__max_depth': [7],
             'clasifier__max_features': ['auto'],
             'clasifier__min_samples_leaf': [2, 3],
             'clasifier__min_samples_split': [3]}

# do grid search
grid_et = GridSearchCV(estimator_et, params_et, 
                       scoring = 'f1', 
                       cv=skf, 
                       n_jobs = -1, verbose = 2)
grid_et.fit(X_train, y_train.values.ravel())

In [None]:
grid_et.best_score_, grid_et.best_params_

In [None]:
# predict target values
y_pred_et = grid_et.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_et = confusion_matrix(y_test, y_pred_et)
vis_conf_matrix(c_matrix_et, "Extra Trees")

In [None]:
print(classification_report(y_test, y_pred_et))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_et), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_et), 2))

In [None]:
scores = scores.append({'model': 'extra trees', 
                        #'features': grid_et.best_params_['feature_selector__n_components'],   
                        'f1': f1_score(y_test, y_pred_et),
                        'accuracy': accuracy_score(y_test, y_pred_et)}, 
                        ignore_index=True)

scores.loc[scores.model == 'extra trees', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_et.ravel()/np.sum(c_matrix_et)*100, 
                                                 decimals=2)
scores

## Gradient Boosting

In [None]:
# Gradient Boosting
estimator_gbc = Pipeline([("feature_selector", KernelPCA(random_state = 42, kernel = 'rbf')),
                          ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                          ("clasifier", GradientBoostingClassifier(random_state = 42))])

params_gbc = {'feature_selector__n_components': [19, 20],
             'sampling__n_neighbors': [5],
             'clasifier__n_estimators': [1000],
             'clasifier__max_depth': [2],
             'clasifier__learning_rate': [0.01],
             'clasifier__loss': ['exponential'],
             'clasifier__subsample': [0.5],
             'clasifier__min_samples_split': [2, 3],
             'clasifier__min_samples_leaf': [2],
             'clasifier__max_features': ['sqrt']}

# do grid search
grid_gbc = GridSearchCV(estimator_gbc, params_gbc, 
                       scoring = 'f1', 
                       cv=skf, 
                       n_jobs = -1, verbose = 2)
grid_gbc.fit(X_train, y_train.values.ravel())

In [None]:
grid_gbc.best_score_, grid_gbc.best_params_

In [None]:
# predict target values
y_pred_gbc = grid_gbc.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_gbc = confusion_matrix(y_test, y_pred_gbc)
vis_conf_matrix(c_matrix_gbc, "Gradient Boosting")

In [None]:
print(classification_report(y_test, y_pred_gbc))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_gbc), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_gbc), 2))

In [None]:
scores = scores.append({'model': 'gradient boosting', 
                        #'features': grid_gbc.best_params_['feature_selector__n_components'], 
                        'f1': f1_score(y_test, y_pred_gbc),
                        'accuracy': accuracy_score(y_test, y_pred_gbc)}, 
                        ignore_index=True)

scores.loc[scores.model == 'gradient boosting', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_gbc.ravel()/np.sum(c_matrix_gbc)*100, 
                                                 decimals=2)
scores

## AdaBoost

In [None]:
# AdaBoost
estimator_ada = Pipeline([("feature_selector", KernelPCA(random_state = 42, kernel = 'rbf')),
                          ("sampling", ADASYN(sampling_strategy = 'minority', random_state = 42)),
                          ("clasifier", AdaBoostClassifier(random_state = 42))])

params_ada = {'feature_selector__n_components': [20],
             'sampling__n_neighbors': [5],
             'clasifier__n_estimators': [1500, 2000],
             'clasifier__algorithm': ['SAMME.R'],
             'clasifier__learning_rate': [0.01]}

# do grid search
grid_ada = GridSearchCV(estimator_ada, params_ada, 
                       scoring = 'f1', 
                       cv=skf, 
                       n_jobs = -1, verbose = 2)
grid_ada.fit(X_train, y_train.values.ravel())

In [None]:
grid_ada.best_score_, grid_ada.best_params_

In [None]:
# predict target values
y_pred_ada = grid_ada.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_ada = confusion_matrix(y_test, y_pred_ada)
vis_conf_matrix(c_matrix_ada, "AdaBoost Classifier")

In [None]:
print(classification_report(y_test, y_pred_ada))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred_ada), 2))
print('F1 Score: ', round(f1_score(y_test, y_pred_ada), 2))

In [None]:
scores = scores.append({'model': 'adaboost', 
                        #'features': grid_ada.best_params_['feature_selector__n_components'], 
                        'f1': f1_score(y_test, y_pred_ada),
                        'accuracy': accuracy_score(y_test, y_pred_ada)}, 
                        ignore_index=True)

scores.loc[scores.model == 'adaboost', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_ada.ravel()/np.sum(c_matrix_ada)*100, 
                                                 decimals=2)
scores

Models that performed well:
* Logistic Regression
* Support Vector Machines
* Random Forest
* Extra Trees
* AdaBoost  
   
I will combine them in Voting Classifier.

## Voting Classifier

In [None]:
# save best classifiers
best_lr = grid_lr.best_estimator_.named_steps['clasifier']
best_svc = grid_svc.best_estimator_.named_steps['clasifier']
best_knn = grid_knn.best_estimator_.named_steps['clasifier']
best_rf = grid_rf.best_estimator_.named_steps['clasifier']
best_et = grid_et.best_estimator_.named_steps['clasifier']
best_gbc = grid_gbc.best_estimator_.named_steps['clasifier']
best_ada = grid_ada.best_estimator_.named_steps['clasifier']

# set Voting Classifiers with 'hard' and 'soft' voting
vc_hard = VotingClassifier(estimators = [('lr', best_lr),('svc', best_svc),('rf',best_rf),('ada',best_ada),('et',best_et)], voting = 'hard') 
vc_soft = VotingClassifier(estimators = [('lr', best_lr),('svc', best_svc),('rf',best_rf),('ada',best_ada),('et',best_et)], voting = 'soft') 

vc_hard = vc_hard.fit(X_train, y_train.values.ravel())
vc_soft = vc_soft.fit(X_train, y_train.values.ravel())

y_pred_vc_hard = vc_hard.predict(X_test)
y_pred_vc_soft = vc_soft.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_vc_hard = confusion_matrix(y_test, y_pred_vc_hard)
vis_conf_matrix(c_matrix_vc_hard, "Voting Classifier (hard)")

In [None]:
# Plot confusion matrix
c_matrix_vc_soft = confusion_matrix(y_test, y_pred_vc_soft)
vis_conf_matrix(c_matrix_vc_soft, "Voting Classifier (soft)")

In [None]:
# add scores to the dataframe
scores = scores.append({'model': 'voting hard', 
                        #'features': np.nan, 
                        'f1': f1_score(y_test, y_pred_vc_hard),
                        'accuracy': accuracy_score(y_test, y_pred_vc_hard)}, 
                        ignore_index=True)

scores = scores.append({'model': 'voting soft', 
                        #'features': np.nan, 
                        'f1': f1_score(y_test, y_pred_vc_soft),
                        'accuracy': accuracy_score(y_test, y_pred_vc_soft)}, 
                        ignore_index=True)

scores.loc[scores.model == 'voting hard', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_vc_hard.ravel()/np.sum(c_matrix_vc_hard)*100, 
                                                 decimals=2)

scores.loc[scores.model == 'voting soft', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_vc_soft.ravel()/np.sum(c_matrix_vc_soft)*100, 
                                                 decimals=2)

scores

In [None]:
# generate all possible weight combinations
x = [1, 2]
params = {'weights' : [p for p in itertools.product(x, repeat=5)]}

# do grid search to find best weights
grid_vc_weight = GridSearchCV(vc_soft, param_grid = params, cv = skf, scoring = 'f1', 
                              verbose = True, n_jobs = -1)
grid_vc_weight.fit(X_train, y_train.values.ravel())

In [None]:
# look at the best parameters
best_vc_w = grid_vc_weight.best_estimator_
grid_vc_weight.best_score_, grid_vc_weight.best_params_

In [None]:
# predict target values
y_pred_vc_w = grid_vc_weight.predict(X_test)

In [None]:
# Plot confusion matrix
c_matrix_vc_w = confusion_matrix(y_test, y_pred_vc_w)
vis_conf_matrix(c_matrix_vc_w, "Voting Classifier (weighted)")

In [None]:
# save scores
scores = scores.append({'model': 'voting weighted', 
                        #'features': np.nan, 
                        'f1': f1_score(y_test, y_pred_vc_w),
                        'accuracy': accuracy_score(y_test, y_pred_vc_w)}, 
                        ignore_index=True)

scores.loc[scores.model == 'voting weighted', 
           ['tn', 'fp', 'fn', 'tp']] = np.around(c_matrix_vc_w.ravel()/np.sum(c_matrix_vc_w)*100, 
                                                 decimals=2)
scores

Best models:
* Voting Classifier with soft voting
* Voting Classifier with hard voting
* Voting Classifier with weighted voting

## Make final predictions

In [None]:
# prepare full train set and submission test set
X_fin_train = train2[features].copy()
y_fin_train = train2[target].copy()

X_fin_test = test2[features].copy()

X_fin_train = pd.get_dummies(X_fin_train, columns = cat_features, drop_first=True)
X_fin_test = pd.get_dummies(X_fin_test, columns = cat_features, drop_first=True)

mm = MinMaxScaler()

for column in num_features:
    X_fin_train[[column]] = mm.fit_transform(X_fin_train[[column]])
    X_fin_test[[column]] = mm.transform(X_fin_test[[column]])

print("Train X: ", X_fin_train.shape)
print("Train y: ", y_fin_train.shape)
print("Test X: ", X_fin_test.shape)

In [None]:
# make sure that we have the same columns in train and test sets
X_fin_train.columns

In [None]:
X_fin_test.columns

In [None]:
# fit voting classifier models and predict final results for submission
best_vc_w.fit(X_fin_train, y_fin_train.values.ravel())
vc_hard.fit(X_fin_train, y_fin_train.values.ravel())
vc_soft.fit(X_fin_train, y_fin_train.values.ravel())

y_fin_vc_w = best_vc_w.predict(X_fin_test).astype(int)
y_fin_vc_hard = vc_hard.predict(X_fin_test).astype(int)
y_fin_vc_soft =  vc_soft.predict(X_fin_test).astype(int)

In [None]:
# combine predictions and Passenger Ids into dataframes 
final_data = {'PassengerId': test2.PassengerId, 'Survived': y_fin_vc_w}
submission = pd.DataFrame(data=final_data)

final_data_2 = {'PassengerId': test2.PassengerId, 'Survived': y_fin_vc_hard}
submission_2 = pd.DataFrame(data=final_data_2)

final_data_3 = {'PassengerId': test2.PassengerId, 'Survived': y_fin_vc_soft}
submission_3 = pd.DataFrame(data=final_data_3)

In [None]:
# save submission files 
submission.to_csv('submission_vc_weighted.csv', index = False)
submission_2.to_csv('submission_vc_hard.csv',index = False)
submission_3.to_csv('submission_vc_soft.csv', index = False)

<p style="text-align: center; font-weight: 700;"> 
Thank you for reading!
</p>