<html>
    <body>
        <h1 class="alert alert-info" style="text-align: center;">Surviving the Titanic: A Machine Learning Approach to Predicting Passenger Survival</h1>
        <h2 id="contents">Table of Contents</h2>
        <ol>
            <a href="#section1"><li>Importing libraries and loading the dataset</li></a>
            <a href="#section2"><li>Exploring the dataset</li></a>
            <a href="#section3"><li>Data cleaning</li></a>
            <a href="#section4"><li>Exploratory data analysis</li></a>
            <ol>
                <a href="#sub_section1_1"><li type="i">Univariate analysis</li></a>
                <a href="#sub_section1_2"><li type="i">Bivariate analysis</li></a>
            </ol>        
            <a href="#section5"><li>Data Prepocessing</li></a>
            <a href="#section6"><li>Model Building and Evaluation</li></a>
            <ol>
                <a href="#sub_section2_1"><li type="i">KNN Classifier</li></a>
                <a href="#sub_section2_2"><li type="i">Logistic Regreassion</li></a>
                <a href="#sub_section2_3"><li type="i">Decision Tree Classifier</li></a>
                <a href="#sub_section2_4"><li type="i">MPLC Classifier</li></a>
                <a href="#sub_section2_5"><li type="i">Support Vector Machine</li></a>
            </ol> 
        </ol>
    </body>
</html>

<div class="col-md-8">
    <h2 id="section1">1. Importing libraries and loading the dataset</h2>
    <p>Let's start by importing the necessary libraries and loading the dataset.</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate
import warnings
import re

color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

warnings.filterwarnings(action = 'ignore')

In [None]:
# Read the data
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
df_submission  = pd.read_csv('../input/titanic/gender_submission.csv')

<div class="col-md-8">
    <h2 id="section2">2. Exploring the dataset</h2>
    <p>Let's explore the datasets:</p>
</div>
<div class="col-md-4">
    <a href="#contens">Back to top</h2>
</div>

In [None]:
# Shape of the data
print('Shape of the train data: %s', df_train.shape)
print('Shape of the test data: %s', df_test.shape)
print('Shape of the submission data: %s', df_submission.shape)

In [None]:
# Sample train data
df_train.head()

In [None]:
# Smaple of test data
df_test.head()

In [None]:
# Sample submission data
df_submission.head()

<p>Let's explore the train dataset to get a better understanding of its structure and content:</p>

In [None]:
# Data types
df_train.dtypes

In [None]:
# Summary statistics
df_train.describe()

In [None]:
# Unique values in each column
df_train.nunique()

<div class="col-md-8">
    <h3 id="section3">3. Data cleaning</h3>
    <p>Nice! We have a dataset with <b>891</b> rows and <b>12</b> columns. Let's clean the dataset by handling missing values, duplicates, irrelevant columns, and converting data types.</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Let's create a copy of the train and test data to perform data cleaning
df_train_copy = df_train.copy()
df_test_copy = df_test.copy()

In [None]:
# Missing values in train data
df_train_copy.isna().sum()

In [None]:
# Missing values in test data
df_test_copy.isna().sum()

<p>We have ver high number of missing values in <b>Cabin</b> followed by <b>Age</b> and just 2 in <b>Embarked</b> column.</p>
<p>Lets impute missing values in <b>Embarked</b> column by mode. and check the other columns if they have any patterns</p>

In [None]:
# Impute missing values in Age column with median
# df_train_copy['Age'] = df_train_copy['Age'].fillna(df_train_copy['Age'].median())
# df_test_copy['Age'] = df_test_copy['Age'].fillna(df_test_copy['Age'].median())

# Impute missing values in Embarked column with mode
df_train_copy['Embarked'] = df_train_copy['Embarked'].fillna(df_train_copy['Embarked'].mode()[0])
df_test_copy['Embarked'] = df_test_copy['Embarked'].fillna(df_test_copy['Embarked'].mode()[0])


<p>Here empty values in <b>Cabin</b> column may indicate that passenger didn't have a cabin. So we can explore further if emplty values in cabin has relation with Survived</p>

In [None]:
# Check if there is any relations between the missing values in Cabin column and Survived column
df_train_copy[df_train_copy['Cabin'].isna()]['Survived'].value_counts()

<p>We can clearly see that passengers with empty values in <b>Cabin</b> column have less chance of survival.</p>
<p>Cabin has signficant effect on Survival soo let's create a new category for the missing Cabins called "Missing"</p>

In [None]:
# Impute missing values in Cabin column with 'Missing'
df_train_copy['Cabin'] = df_train_copy['Cabin'].fillna('Missing')
df_test_copy['Cabin'] = df_test_copy['Cabin'].fillna('Missing')

<p>Now let's check what variable can impact Age. In the Name column we can see the Initial which can give some information about the Age, let's extract that.</p>

In [None]:
# Let's extract the name title from the Name column
df_train_copy['Name_Title'] = df_train_copy['Name'].apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
df_test_copy['Name_Title'] = df_test_copy['Name'].apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))

df_train_copy['Name_Title'].value_counts()

In [None]:
# Let's combine the similar titles, we will combine Mlle, Jonkheer and Ms with Miss, Mme, Countess and Lady with Mrs and Rev, Dr, Col, Major, Don, Sir, Capt with Rare
df_train_copy['Name_Title'] = df_train_copy['Name_Title'].replace(['Mlle', 'Ms', 'Jonkheer'], 'Miss')
df_train_copy['Name_Title'] = df_train_copy['Name_Title'].replace(['Mme', 'Countess', 'Lady', 'Dona'], 'Mrs')
df_train_copy['Name_Title'] = df_train_copy['Name_Title'].replace(['Rev', 'Dr', 'Col', 'Major', 'Don', 'Sir', 'Capt'], 'Rare')

df_test_copy['Name_Title'] = df_test_copy['Name_Title'].replace(['Mlle', 'Ms', 'Jonkheer'], 'Miss')
df_test_copy['Name_Title'] = df_test_copy['Name_Title'].replace(['Mme', 'Countess', 'Lady', 'Dona'], 'Mrs')
df_test_copy['Name_Title'] = df_test_copy['Name_Title'].replace(['Rev', 'Dr', 'Col', 'Major', 'Don', 'Sir', 'Capt'], 'Rare')

df_train_copy['Name_Title'].value_counts()

In [None]:
# Now let's see the Age of the passengers with respect to their Name_Title
df_train_copy.groupby('Name_Title')['Age'].median()

In [None]:
# Great, now let's impute the missing values in Age column with the median of the respective Name_Title
df_train_copy.loc[(df_train_copy['Age'].isna()) & (df_train_copy['Name_Title'] == 'Master'), 'Age'] = df_train_copy[df_train_copy['Name_Title'] == 'Master']['Age'].median()
df_train_copy.loc[(df_train_copy['Age'].isna()) & (df_train_copy['Name_Title'] == 'Mr'), 'Age'] = df_train_copy[df_train_copy['Name_Title'] == 'Mr']['Age'].median()
df_train_copy.loc[(df_train_copy['Age'].isna()) & (df_train_copy['Name_Title'] == 'Mrs'), 'Age'] = df_train_copy[df_train_copy['Name_Title'] == 'Mrs']['Age'].median()
df_train_copy.loc[(df_train_copy['Age'].isna()) & (df_train_copy['Name_Title'] == 'Miss'), 'Age'] = df_train_copy[df_train_copy['Name_Title'] == 'Miss']['Age'].median()
df_train_copy.loc[(df_train_copy['Age'].isna()) & (df_train_copy['Name_Title'] == 'Rare'), 'Age'] = df_train_copy[df_train_copy['Name_Title'] == 'Rare']['Age'].median()

df_test_copy.loc[(df_test_copy['Age'].isna()) & (df_test_copy['Name_Title'] == 'Miss'), 'Age'] = df_test_copy[df_test_copy['Name_Title'] == 'Miss']['Age'].median()
df_test_copy.loc[(df_test_copy['Age'].isna()) & (df_test_copy['Name_Title'] == 'Mr'), 'Age'] = df_test_copy[df_test_copy['Name_Title'] == 'Mr']['Age'].median()
df_test_copy.loc[(df_test_copy['Age'].isna()) & (df_test_copy['Name_Title'] == 'Mrs'), 'Age'] = df_test_copy[df_test_copy['Name_Title'] == 'Mrs']['Age'].median()
df_test_copy.loc[(df_test_copy['Age'].isna()) & (df_test_copy['Name_Title'] == 'Master'), 'Age'] = df_test_copy[df_test_copy['Name_Title'] == 'Master']['Age'].median()
df_test_copy.loc[(df_test_copy['Age'].isna()) & (df_test_copy['Name_Title'] == 'Rare'), 'Age'] = df_test_copy[df_test_copy['Name_Title'] == 'Rare']['Age'].median()


In [None]:
# Let see if we still have any missing values in the train data
df_train_copy.isna().sum()

In [None]:
# Imputing missing values in Fare column with median in test data
df_test_copy['Fare'] = df_test_copy['Fare'].fillna(df_test_copy['Fare'].median())

# Let see if we still have any missing values in test data
df_test_copy.isna().sum()

In [None]:
# Check for duplicates in train data
df_train_copy.duplicated().sum()

In [None]:
# Check for duplicates in test data
df_test_copy.duplicated().sum()

<p>No duplicates in the dataset! Let's move on to the next step.</p>

<p>Let's convert Survived, Pclass, Sex, SibSp, Parch, and Embarked to categorical variables.</p>

In [None]:
# Convert Name, Survived, Pclass, Sex, SibSp, Parch, Embarked, Ticket, Cabin to categorical variables in train data
df_train_copy['Survived'] = df_train_copy['Survived'].astype('category')
df_train_copy['Pclass'] = df_train_copy['Pclass'].astype('category')
df_train_copy['Sex'] = df_train_copy['Sex'].astype('category')
df_train_copy['SibSp'] = df_train_copy['SibSp'].astype('category')
df_train_copy['Parch'] = df_train_copy['Parch'].astype('category')
df_train_copy['Embarked'] = df_train_copy['Embarked'].astype('category')
df_train_copy['Ticket'] = df_train_copy['Ticket'].astype('category')
df_train_copy['Cabin'] = df_train_copy['Cabin'].astype('category')
df_train_copy['Name'] = df_train_copy['Name'].astype('category')


In [None]:
# Convert Name, Pclass, Sex, SibSp, Parch, Embarked, Ticket, Cabin to categorical variables in test data
df_test_copy['Pclass'] = df_test_copy['Pclass'].astype('category')
df_test_copy['Sex'] = df_test_copy['Sex'].astype('category')
df_test_copy['SibSp'] = df_test_copy['SibSp'].astype('category')
df_test_copy['Parch'] = df_test_copy['Parch'].astype('category')
df_test_copy['Embarked'] = df_test_copy['Embarked'].astype('category')
df_test_copy['Ticket'] = df_test_copy['Ticket'].astype('category')
df_test_copy['Cabin'] = df_test_copy['Cabin'].astype('category')
df_test_copy['Name'] = df_test_copy['Name'].astype('category')

<div class="col-md-8">
    <h3 id="section4">4. Exploratory data analysis</h3>
    <p>Let's perform exploratory data analysis to extract insights from the cab trips dataset:</p>
    <h4 id="sub_section1_1" >i. Univariate analysis</h4>
    <p>We will start by exploring the distribution of the numerical and categorical variables in the dataset:</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Function for calculating descriptives of numeric variable and plotting the distribution
def plot_dist(df, col, x_label, y_label, plot_title):
    _min = df[col].min()
    _max = df[col].max()
    ran = df[col].max()-df[col].min()
    mean = df[col].mean()
    median = df[col].median()
    st_dev = df[col].std()
    skew = df[col].skew()
    kurt = df[col].kurtosis()

    # calculating points of standard deviation
    points = mean-st_dev, mean+st_dev
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.histplot(data=df, x=col, bins=30, kde=True, color='dodgerblue')
    sns.lineplot(x=points, y=[0,0], color = 'black', label = "std_dev")
    sns.scatterplot(x=[_min,_max], y=[0,0], color = 'orange', label = "min/max")
    sns.scatterplot(x=[mean], y=[0], color = 'red', label = "mean")
    sns.scatterplot(x=[median], y=[0], color = 'blue', label = "median")
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)

    # Creating a DataFrame for the descriptive statistics
    variable_stats = pd.DataFrame({'Statistics': ['Minimum Value', 'Maximum Value', 'Range', 'Mean', 
                                                    'Median', 'Standard Deviation', 'Skewness', 'Kurtosis'], 
                                        'Value': [_min, _max, ran, mean, median, st_dev, skew, kurt]})
    
    plt.show()

    display(tabulate(variable_stats, headers='keys', showindex=False, tablefmt='html'))


In [None]:
# Function for plolting the distribution of categorical variables
def plot_cat(df, col, x_label, y_label, plot_title):
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.countplot(data=df, x=col, color='dodgerblue')
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.show()

<p>Here Survived column is our target variable. Let's explore its distribution.</p>

In [None]:
# Plot distribution of Survived column
plot_cat(df_train_copy, 'Survived', 'Survived', 'Count', 'Distribution of Survived column')

In [None]:
# Plotting distribution of Pclass column
plot_cat(df_train_copy, 'Pclass', 'Pclass', 'Count', 'Distribution of Pclass column')

In [None]:
# Plotting distribution of Sex column
plot_cat(df_train_copy, 'Sex', 'Sex', 'Count', 'Distribution of Sex column')

In [None]:
# Plottting distribution of SibSp column
plot_cat(df_train_copy, 'SibSp', 'SibSp', 'Count', 'Distribution of SibSp column')

In [None]:
# Plottting distribution of Parch column
plot_cat(df_train_copy, 'Parch', 'Parch', 'Count', 'Distribution of Parch column')

In [None]:
# Plottting distribution of Embarked column
plot_cat(df_train_copy, 'Embarked', 'Embarked', 'Count', 'Distribution of Embarked column')

<p>Lets explore the distribution of numerical variables.</p>

In [None]:
# Plotting distribution of Age column
plot_dist(df_train, 'Age', 'Age', 'Count', 'Distribution of Age column')

<ul>
    <li>Median Age is 28.</li>
    <li>Mean and Median Age are almost same. So Age is normally distributed.</li>
</ul>

In [None]:
# Let's see how Fare column is distributed
plot_dist(df_train, 'Fare', 'Fare', 'Count', 'Distribution of Fare column')

<p>Kurtosis of Fare is very high. So Fare is highly skewed.</p>
<p>Which is as expected, because certain classes will have higher fares than others and limited seats.</p>

In [None]:
# Let's create a new variable log_Fare by taking log of Fare column
df_train_copy['log_Fare'] = np.log(df_train_copy['Fare']+1)
df_test_copy['log_Fare'] = np.log(df_test_copy['Fare']+1)

# Plotting distribution of log_Fare column
plot_dist(df_train_copy, 'log_Fare', 'log_Fare', 'Count', 'Distribution of log_Fare column')

<div class="col-md-8">
    <h4 id="sub_section1_2">ii. Bivariate analysis</h4>
    <p>Let's explore the relationship between the trip duration and other variables in the dataset:</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Function for plotting the distribution of numeric variables against the target variable
# Here target variable is assumed to be categorical
def plot_num_vs_target(df, col, target, x_label, y_label, plot_title):
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.boxplot(data=df, x=target, y=col, color='dodgerblue')
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.show()

In [None]:
# Relationship between Survived and Age
plot_num_vs_target(df_train_copy, 'Age', 'Survived', 'Survived', 'Age', 'Relationship between Survived and Age')

In [None]:
# Relationship between Survived and Fare
plot_num_vs_target(df_train_copy, 'Fare', 'Survived', 'Survived', 'Fare', 'Relationship between Survived and Fare')

In [None]:
# Function for plotting the distribution of categorical variables against the target variable
# Here target variable and categorical variable are assumed to be categorical
def plot_cat_vs_target(df, col, target, x_label, y_label, plot_title):
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.countplot(data=df, x=col, hue=target, palette='Set1')
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.show()

In [None]:
# Relationship between Survived and Pclass
plot_cat_vs_target(df_train_copy, 'Pclass', 'Survived', 'Pclass', 'Count', 'Relationship between Survived and Pclass')

In [None]:
# Relationship between Survived and Sex
plot_cat_vs_target(df_train_copy, 'Sex', 'Survived', 'Sex', 'Count', 'Relationship between Survived and Sex')

In [None]:
# Relationship between Survived and SibSp
plot_cat_vs_target(df_train_copy, 'SibSp', 'Survived', 'SibSp', 'Count', 'Relationship between Survived and SibSp')

In [None]:
# Relationship between Survived and Parch
plot_cat_vs_target(df_train_copy, 'Parch', 'Survived', 'Parch', 'Count', 'Relationship between Survived and Parch')

In [None]:
# Relationship between Survived and Embarked
plot_cat_vs_target(df_train_copy, 'Embarked', 'Survived', 'Embarked', 'Count', 'Relationship between Survived and Embarked')

In [None]:
# Relationship between Survived and Name_Title
plot_cat_vs_target(df_train_copy, 'Name_Title', 'Survived', 'Name_Title', 'Count', 'Relationship between Survived and Name_Title')

<div class="col-md-8">
    <h3 id="section5">5. Data Preprocessing</h3>
    <p>Before we use variables in our model, we need to preprocess them. We will perform the following steps:</p>
    <ul>
        <li>One-hot encode categorical variables</li>
        <li>Lable encode categorical variables</li>
    </ul>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Function to encode categorical variables, we will use scikit-learn's LabelEncoder for label encoding and pandas get_dummies for one-hot encoding
from sklearn.preprocessing import LabelEncoder
def encode_cat(df, col, encoding_type):
    if encoding_type == 'label':
        label_encoder = LabelEncoder()
        df[col] = label_encoder.fit_transform(df[col])
    elif encoding_type == 'onehot':
        df = pd.get_dummies(df, columns=[col], prefix=[col])
    return df

In [None]:
# Let's check train data before encoding
df_train_copy.Pclass.value_counts(), df_train_copy.SibSp.value_counts(), df_train_copy.Parch.value_counts(), df_train_copy.Embarked.value_counts(), df_train_copy.Cabin.value_counts(), df_train_copy.Name_Title.value_counts()

In [None]:
# Let's check test data after encoding
df_test_copy.Pclass.value_counts(), df_test_copy.SibSp.value_counts(), df_test_copy.Parch.value_counts(), df_test_copy.Embarked.value_counts(), df_test_copy.Cabin.value_counts(), df_test_copy.Name_Title.value_counts()

In [None]:
# Encoding variables in the training dataset and create a new dataframe called df_train_encoded
df_train_encoded = df_train_copy.copy()
df_train_encoded = encode_cat(df_train_encoded, 'Survived', 'label')
df_train_encoded = encode_cat(df_train_encoded, 'Cabin', 'label')
df_train_encoded = encode_cat(df_train_encoded, 'Pclass', 'label')
df_train_encoded = encode_cat(df_train_encoded, 'Sex', 'onehot')
df_train_encoded = encode_cat(df_train_encoded, 'SibSp', 'label')
df_train_encoded = encode_cat(df_train_encoded, 'Parch', 'label')
df_train_encoded = encode_cat(df_train_encoded, 'Embarked', 'label')
df_train_encoded = encode_cat(df_train_encoded, 'Name_Title', 'label')

# Encoding variables in the test dataset and create a new dataframe called df_test_encoded
df_test_encoded = df_test_copy.copy()
df_test_encoded = encode_cat(df_test_encoded, 'Cabin', 'label')
df_test_encoded = encode_cat(df_test_encoded, 'Pclass', 'label')
df_test_encoded = encode_cat(df_test_encoded, 'Sex', 'onehot')
df_test_encoded = encode_cat(df_test_encoded, 'SibSp', 'label')
df_test_encoded = encode_cat(df_test_encoded, 'Parch', 'label')
df_test_encoded = encode_cat(df_test_encoded, 'Embarked', 'label')
df_test_encoded = encode_cat(df_test_encoded, 'Name_Title', 'label')

In [None]:
# Let's check train data before encoding
df_train_encoded.Pclass.value_counts(), df_train_encoded.SibSp.value_counts(), df_train_encoded.Parch.value_counts(), df_train_encoded.Embarked.value_counts(), df_train_encoded.Cabin.value_counts(), df_train_encoded.Name_Title.value_counts()

In [None]:
# Let's check train data before encoding
df_test_encoded.Pclass.value_counts(), df_test_encoded.SibSp.value_counts(), df_test_encoded.Parch.value_counts(), df_test_encoded.Embarked.value_counts(), df_test_encoded.Cabin.value_counts(), df_test_encoded.Name_Title.value_counts()

<p>Now that we have preprocessed the variables, let's check the correlation between them:</p>

In [None]:
# Function to plot correlation between variables
def plot_corr(df, size=10):
    corr = df.corr()
#     print(corr)
    fig, ax = plt.subplots(figsize=(size, size))
    sns.heatmap(corr, annot=True, linewidths=.5, ax=ax, cmap='crest')
    plt.show() 

In [None]:
# Correlation between variables in the training set
plot_corr(df_train_encoded.drop(['PassengerId', 'Name', 'Ticket'], axis=1))

<p>Let's check the correlation between the variables and the target variable:</p>

In [None]:
# Function to plot correlation of variables with the target variable as a barplot
def plot_corr_target(df, target, size=10):
    corr = df.corr()
    corr_target = corr[target]
    corr_target = corr_target.sort_values(ascending=False)
    corr_target = corr_target.drop(target)
    plt.figure(figsize=(size, size))
    corr_target.plot.barh()
    plt.show()

In [None]:
# Check correlation of variables with the target variable
plot_corr_target(df_train_encoded.drop(['Name', 'Ticket', 'PassengerId'], axis=1), 'Survived')

<div class="col-md-8">
    <h3 id="section6">6. Model Building</h3>
    <p>Let's build a model to predict the Survival of passengers on the Titanic:</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# We will first separate the target variable from the features
y = df_train_encoded['Survived']
x = df_train_encoded.drop(['Survived', 'Name', 'Ticket', 'PassengerId'], axis=1)
x.shape, y.shape

<p>Let's scale the features usinf scikit-learn's MinMax scaler:</p>

In [None]:
## Importing the MinMax Scaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)

In [None]:
x = pd.DataFrame(x_scaled, columns = x.columns)

In [None]:
# Check data after scaling
x.head()

<p>Now, let's split the dataset into training and test sets:</p>

In [None]:
# Importing the train test split function
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 50 , stratify=y)


<div class="col-md-8">
    <h3 id="section7">7. Model Generation and Evaluation</h3>
    <p>We will use different classification algorithms to build models and evaluate them using F1 score:</p>
    <h4 id="sub_section2_1">i. KNN Classifier</h4>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Import KNN classifier and metric F1 score
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score

<p>Let's use KNN classifier to build a model and check consistency using cross validation:</p>

In [None]:
from sklearn.model_selection import cross_val_score
# Function to cross validation for different values of k

def cross_val_knn(n_neighbors):
    '''Takes in a value of k and returns the average and standard deviation of the F1 score for 10-fold cross validation'''
    average = []
    std = []
    for i in n_neighbors:
        knn = KNN(n_neighbors=i)
        scores = cross_val_score(knn, train_x, train_y, cv=10, scoring='accuracy')
        average.append(scores.mean())
        std.append(scores.std())
    return average, std

In [None]:
# Let's check the scores for a range of k values
n_neighbors = range(1,50)
mean, std = cross_val_knn(n_neighbors)

In [None]:
# Let's plot the average F1 score for each value of k
plt.plot(n_neighbors[10:20], mean[10:20], color = 'green', label = 'mean' )
plt.xlabel('n_neighbors')
plt.ylabel('Mean Score')
plt.title('Mean Validation score')

In [None]:
# Let's plot the standard deviation of the F1 score for each value of k
plt.plot(n_neighbors[10:20], std[10:20], color = 'red', label = 'std' )
plt.xlabel('n_neighbors')
plt.ylabel('Standard Deviation')
plt.title('Standard Deviation of Validation score')

In [None]:
# Try fiiting the model on the test set
knn = KNN(n_neighbors=14)
knn.fit(train_x, train_y)

# Predict on the train set
score1 = knn.score(train_x, train_y)

# Predict on the test set
score2 = knn.score(test_x, test_y)

print('Train score: ', score1)
print('Test score: ', score2)


In [None]:
test_x.head(2)

In [None]:
df_test_encoded_dropped = df_test_encoded.drop(['Name', 'Ticket', 'PassengerId'], axis=1)
df_test_encoded_dropped.head(2)

In [None]:
test_scaled = scaler.fit_transform(df_test_encoded_dropped)
df_test_scaled = pd.DataFrame(test_scaled, columns= df_test_encoded_dropped.columns)
submission_predictions = knn.predict(df_test_scaled)

# train_x.shape, df_test_encoded.drop(['Name', 'Ticket', 'PassengerId'], axis=1).shape

In [None]:
train_x.shape, test_x.shape, df_test_encoded_dropped.shape

In [None]:
submission_predictions

In [None]:
df_submission['Survived'] = submission_predictions

In [None]:
df_submission.head()

In [None]:
df_submission.to_csv('submission.csv', index=False)

<div class="col-md-8">
    <h4 id="sub_section2_2">ii. Logistic Regression</h4>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Importing Logistic Regression
from sklearn.linear_model import LogisticRegression

In [None]:
# Creating instance of Logistic Regression
log_reg = LogisticRegression()

# Fitting the model
log_reg.fit(train_x, train_y)

# Predicting over the Test Set and calculating F1
test_predict_log = log_reg.predict(test_x)
k_log = accuracy_score(test_predict_log, test_y)

print('Accuracy Score    ', k_log )

In [None]:
submission_predictions_log = log_reg.predict(df_test_scaled)

In [None]:
submission_predictions_log

In [None]:
# Combine predics with df_submission and save to csv
df_submission['Survived'] = submission_predictions_log
df_submission.to_csv('submission_log.csv', index=False)

<div class="col-md-8">
    <h4 id="sub_section2_3">iii. Decision Tree Classifier</h4>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Importing Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Creating instance of Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=4)

# Fitting the model
clf.fit(train_x, train_y)

# Predict on the train set
score1 = clf.score(train_x, train_y)

# Predict on the test set
score2 = clf.score(test_x, test_y)

# Predicting over the Test Set and calculating F1
test_predict_dt = clf.predict(test_x)
k_dt = accuracy_score(test_predict_dt, test_y)

print('Accuracy Score 1   ', score1 )
print('Accuracy Score 2   ', score2)
print('Accuracy Score    ', k_dt )

In [None]:
df_test_scaled.head(2)

In [None]:
submission_predictions_dt = clf.predict(df_test_scaled)

# 

In [None]:
# Combine predictions with df_submission and save to csv
df_submission['Survived'] = submission_predictions_dt
df_submission.to_csv('submission_dt.csv', index=False)

<div class="col-md-8">
    <h4 id="sub_section2_4">iv. Random Forest Classifier</h4>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Importing MLPC Classifier
from sklearn.neural_network import MLPClassifier

In [None]:
# Creating instance of MLPC Classifier
clf = MLPClassifier()

# Fitting the model
clf.fit(train_x, train_y)

# Predicting over the Test Set and calculating F1
test_predict_mlpc = clf.predict(test_x)
k_mlpc = accuracy_score(test_predict_mlpc, test_y)

print('Accuracy Score    ', k_mlpc )

In [None]:
submission_predictions_mlpc = clf.predict(df_test_scaled)

In [None]:
# Combine predictions with df_submission and save to csv
df_submission['Survived'] = submission_predictions_mlpc
df_submission.to_csv('submission_mlpc.csv', index=False)

<div class="col-md-8">
    <h4 id="sub_section2_5">v. Support Vector Machine</h4>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Importing Support Vector Classifier
from sklearn.svm import SVC

In [None]:
# Creating instance of Support Vector Classifier
clf = SVC()

# Fitting the model
clf.fit(train_x, train_y)

# Predicting over the Test Set and calculating F1
test_predict_svc = clf.predict(test_x)
k_svc = accuracy_score(test_predict_svc, test_y)

print('Accuracy Score    ', k_svc )