**Created by: Dalton R. Burton**

**Project: Kaggle Competion**

**Date: 23/09/2023**

# <center> Titanic - Machine Learning from Disaster<center/>

#### <p><center> by: Dalton R. Burton <center/><p/>
    
***

#### The goal of this notebook is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

***

## The Question

Our target question is: 
***
### “What sorts of people were more likely to survive?"
***
In order to determine this we must analyze passenger data and determine the right features for survival.

Let's get started!

***

# Library Imports
***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # data visualization library
import plotly.express as px
%matplotlib inline
import seaborn as sns
from IPython.display import display_html
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings('ignore')
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

***
# Set styles
Let's set the style of the notebook.
***

In [None]:
# Adjusting plot style
rc = {
    "axes.facecolor": "#F0F5F6",
    "figure.facecolor": "#F0F5F6",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7" + "30",
    "font.family": "sans-serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4,
    "ytick.labelsize": 8,
    "xtick.labelsize": 8,
    "legend.title_fontsize": 8,
    "legend.fontsize": 7
}

sns.set(rc=rc)
palette = ['#302c36', '#037d97', '#E4591E', '#C09741',
           '#EC5B6D', '#90A6B1', '#6ca957', '#D8E3E2']

from colorama import Style, Fore
blk = Style.BRIGHT + Fore.BLACK
mgt = Style.BRIGHT + Fore.MAGENTA
red = Style.BRIGHT + Fore.RED
blu = Style.BRIGHT + Fore.BLUE
res = Style.RESET_ALL

def set_alternate_colors(df):
    return [
        'background-color: #ABDBD5' if i % 2 == 0 else 'background-color: #f2f2f2'
        for i in range(len(df))
    ]

def dstyle(df):
    styled_df = df.style.apply(set_alternate_colors)
    display_html(styled_df._repr_html_(), raw=True)

***
# Functions
Initiate functions used for analyzing and machine learning
***

In [None]:
def confirm(a):
    name = a.__name__
    print(name+ " function has been initialized!")
    

# --------------------------------------------------------------------------------
# Returns a summary of data
def summary(df):
    summ = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summ['missing#'] = df.isna().sum()
    summ['missing%'] = (df.isna().sum())/len(df)
    summ['uniques'] = df.nunique().values
    summ['count'] = df.count().values
    confirm(summary)
    return summ


# --------------------------------------------------------------------------------
# Returns a series of countplots to compare categorical data
def cat_count_compare(data):
    fig, axes = plt.subplots(1,data.shape[1],figsize = (16, 3),sharey=False)
    for i in range(len(data.columns)):
        ax = axes[i]
        ax.grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75)
        sns.countplot(x=data.columns[i],data = data, width = .4,ax = ax)
        ax.set(xlabel = '', ylabel = '')
        ax.set_title(data.columns[i],fontsize = 8)
    confirm(cat_count_compare)
    plt.show()
    

# --------------------------------------------------------------------------------
# Returns a series of KDEplots
def kde_plot(data1, data2):
    if isinstance(data1, np.ndarray):
        if data1.shape:
            X = pd.DataFrame(data1)
            kde_plot_execute(X, data2)
    if isinstance(data2, np.ndarray):
        if data2.shape:
            Y = pd.DataFrame(data2)
            kde_plot_execute(data1, Y)
    else:
        kde_plot_execute(data1, data2)


# --------------------------------------------------------------------------------
def kde_plot_execute(data1, data2):
    width_ratios = [1] * data2.shape[1]  # Initialize with equal width
    total_width = sum(width_ratios)
    width_ratios = [width / total_width for width in width_ratios]

    fig, axes = plt.subplots(data1.shape[1],data2.shape[1],figsize = (16, 8),sharey=False,\
                gridspec_kw = {'hspace': 0.35, 'wspace': 0.3,'width_ratios': width_ratios})
    if data1.shape[1] and data1.shape[1] == 1:
        for i,col in enumerate(data1):
            for j,col2 in enumerate(data2):
                ax = axes
                ax.grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75)
                sns.kdeplot(data = data1, x= data1[col],hue = data2[col2],palette=palette[0:8], shade=True,ax = ax)
                ax.set_title(col + ' by '+ col2 +' Comparison',fontsize = 8)
                ax.set(xlabel = '', ylabel = '')
    else:
        for i,col in enumerate(data1):
            for j,col2 in enumerate(data2):
                ax = axes[i,j]
                ax.grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75)
                sns.kdeplot(data = data1, x= data1[col],hue = data2[col2],palette=palette[0:8], shade=True,ax = ax)
                ax.set_title(col + ' by '+ col2 +' Comparison',fontsize = 8)
                ax.set(xlabel = '', ylabel = '')
    confirm(kde_plot)
        # Show the plot
    plt.show()
    

# --------------------------------------------------------------------------------
# Gets the summary for a Cabin
def cabin_summary(a):
    deck_x = cabin_data.loc[cabin_data['New_Cabin_data'] == a]
    return [deck_x.describe(),deck_x]


# --------------------------------------------------------------------------------
def val_summary(data):
    summ = data.describe().T\
    .style.bar(subset=['mean'], color=px.colors.qualitative.G10[1])\
    .background_gradient(subset=['std'], cmap='Reds')\
    .background_gradient(subset=['50%'], cmap='Reds')
    return summ


# --------------------------------------------------------------------------------
# define functions to fill missing values with mode or mean
def nan_to_mode(df_col):
    mode = df_col.mode()[0]
    df_col.fillna(mode, inplace = True)
    return df_col


# --------------------------------------------------------------------------------
def nan_to_mean(df_col):
    mean = df_col.mean()
    df_col.fillna(mean, inplace = True)
    return df_col


# --------------------------------------------------------------------------------
# defines distribution plot function
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 3
    height = 3
    plt.figure(figsize=(width, height))
    
    ax1 = sns.kdeplot(RedFunction, color='blue', label=RedName)
    ax2 = sns.kdeplot(BlueFunction, color='green', label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.show()
    plt.close()


# --------------------------------------------------------------------------------
# best pr value and R2    
def best_score(model_list,order_list,xtrain,xtest,ytrain,ytest):
    for i in range(len(model_list)):
        Rsqu_test = []
        for n in order:
            pr = PolynomialFeatures(degree=n)
            X_train_pr = pr.fit_transform(xtrain)
            X_test_pr = pr.fit_transform(xtest)    
            model_list[i].fit(X_train_pr, ytrain)

            #train_score = model_list[i].score(X_train_pr, ytrain)
            Rsqu_test.append(model_list[i].score(X_test_pr, ytest))
            #Rsqu_test.append(model_list[i].score(X_train_pr, ytrain))
        print("Max Score: ",model_list[i].named_steps['model'], max(Rsqu_test),"Max Order: ",order_list[Rsqu_test.index(max(Rsqu_test))]) 
        plt.plot(order, Rsqu_test)
        plt.xlabel('order')
        plt.ylabel('R^2')
        plt.title('R^2 Using Test Data')
        
        
# --------------------------------------------------------------------------------        
# Let's take a quick look at the data
def side_by_side(dtr,dte):
    a = summary(dtr)
    b = summary(dte)
    test_col = 'Test_'

    new_names = [test_col + col for col in b.columns]
    b.columns = new_names

    new_df = pd.concat([a,b],axis = 1)
    return new_df

***
# Import Machine Learning Models
We will load the libraries for model learning and testing.
***

In [None]:
# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# data processing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.pipeline import Pipeline

***
# Data Wrangling
We will load the train and test data.
***

In [None]:
# Use pandas read_csv to load csv into notebook
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")

# Use pandas read_csv to load csv into notebook
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

# prints the first 5 rows of the data
dstyle(train_data.head())

In [None]:
# prints the first 5 rows of the test data
dstyle(test_data.head())

In [None]:
train_data.drop(['Name','PassengerId','Ticket'],axis=1,inplace=True)
test_data.drop(['Name','PassengerId','Ticket'],axis=1,inplace=True)

# prints the first 5 rows of the data
dstyle(train_data.head())

In [None]:
# prints the first 5 rows of the test data
dstyle(test_data.head())

In [None]:
summary_1 = side_by_side(train_data,test_data)
summary_1.style.background_gradient(cmap='Reds')

### Observations:

1. At a glance, we see that 'Cabin' and 'Age' are missing a lot of data.
2. `Survived`, `Pclass`, `Sex`, `Embarked` stand out as Categorical data.
3. We can drop `PassengerID` and `Name` .
4. `SibSP`,`Parch`,`Age`and`Fare` stand out as Quantitative data.
***

### Replace missing values

In [None]:
# Let's replace missing values with the mode or average values in each set
nan_to_mode(train_data.Embarked)

nan_to_mean(test_data.Fare)

nan_to_mean(train_data.Age)
nan_to_mean(test_data.Age)

train_data.Cabin.fillna(0, inplace = True)
test_data.Cabin.fillna(0, inplace = True)

print("Missing values replaced successfully!")

### Map categorial data to integers

In [None]:
sex_mapping = {'male': 0, 'female': 1}
embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
Pclass_mapping = {1: 2, 2: 1, 3: 0}

test_data['Pclass'] = test_data['Pclass'].replace(Pclass_mapping)
train_data['Pclass'] = train_data['Pclass'].replace(Pclass_mapping)

train_data['Sex'] = train_data['Sex'].replace(sex_mapping)
test_data['Sex'] = test_data['Sex'].replace(sex_mapping)

train_data['Embarked'] = train_data['Embarked'].replace(embarked_mapping)
test_data['Embarked'] = train_data['Embarked'].replace(embarked_mapping)

for x in range(len(train_data.Cabin)):
    if train_data.Cabin[x] != 0:
        train_data.Cabin[x] = 1
train_data['Cabin'] = train_data['Cabin'].astype(int)       
for x in range(len(test_data.Cabin)):
    if test_data.Cabin[x] != 0:
        test_data.Cabin[x] = 1
test_data['Cabin'] = test_data['Cabin'].astype(int)        
print("Values replaced successfully!")

In [None]:
summary2 = side_by_side(train_data,test_data)
summary2.style.background_gradient(cmap='Reds')

***
# Exploratory Data Analysis
In this section we will take a look at the data, it's structure and try to identify patterns or irregularities.
***

### Numerical Data
Let's try to analyze the behavior of the numerical data.
***

In [None]:
num_data = train_data[['Fare','Age','SibSp','Parch']]
# Lets summarize the remainding columns
val_summary(num_data)

In [None]:
plt.figure(figsize=(12, len(num_data.columns) * 2.5))

for idx, column in enumerate(num_data):
    count, bins = np.histogram(train_data[column])
    plt.subplot(len(num_data.columns), 2, idx*2+1)
    sns.histplot(x=column, hue="Survived", data=train_data, bins=bins, kde=True)
    plt.title(f"{column} Distribution for Population",fontsize=10)
    plt.ylim(0, train_data[column].value_counts().max() + 10)

plt.tight_layout()
plt.show()

In [None]:
# Let's see categorical data
# First we remove the numerical data
cat_data = train_data.drop(num_data.columns,axis=1)
kde_plot(num_data[['Age','Fare']], cat_data)

In [None]:
train_data.corr()

In [None]:
corr_matrix = train_data.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(5, 5))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='YlOrBr', fmt='.2f', linewidths=4, square=True, annot_kws={"size": 8} )
plt.title('Correlation Matrix', fontsize=10)
plt.show()

In [None]:
corr_matrix = train_data.corr()['Survived'].sort_values().to_frame()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(5, 5))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='YlOrBr', fmt='.2f', linewidths=4, square=True, annot_kws={"size": 8} )
plt.title('Correlation Matrix', fontsize=10)
plt.show()

## Training the model

In [None]:
m_features = ["Pclass", "Sex", "SibSp", "Parch","Embarked","Cabin","Fare"]

X_dta = train_data[m_features]
y_dta = train_data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X_dta, y_dta, test_size=.35, random_state=15)

### Determine Hyperparameters

We will use 3 models, RandomForestClassifier, LogisticRegression and KNeighborsClassifier

In [None]:
m_models = [RandomForestClassifier(n_estimators=100), LogisticRegression(random_state = 15), KNeighborsClassifier(n_neighbors=2)]

param_grid_R = {'n_estimators': [100,200,300,400], 'max_depth': [None, 4, 8,12,14]}
param_grid_L = {'C': [0.001, 0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2']}
param_grid_k = param_grid = {'n_neighbors': [3, 5, 7,9],'weights': ['uniform', 'distance'],'p': [1, 2], 'leaf_size':[2,4,6,8]}
param = [param_grid_R,param_grid_L,param_grid_k]

In [None]:
grid_list = []
for i in range(len(m_models)):
    grid_search = GridSearchCV(estimator=m_models[i], param_grid=param[i], scoring='accuracy', cv=5)
    grid_search.fit(X_train, y_train)
    grid_list.append(grid_search)

In [None]:
best_models=[]
for i in range(len(grid_list)):
    best_params = grid_list[i].best_params_
    best_models.append(grid_list[i].best_estimator_)

In [None]:
for i in range(len(best_models)):
    scores = cross_val_score(best_models[i], X_train, y_train, cv=5, scoring='accuracy')
    print("Cross-validated Accuracy:", scores.mean())

In [None]:
rf_input=[('scale',StandardScaler()), ('model',best_models[0])]
lre_input=[('scale',StandardScaler()), ('model',best_models[1])]
kn_input=[('scale',StandardScaler()), ('model',best_models[2])]

lre = Pipeline(lre_input)
rf = Pipeline(rf_input)
kn = Pipeline(kn_input)

models = [lre, rf, kn]

In [None]:
def pred_plot(Xtr, Ytr, i):
    P = models[i].predict(Xtr)
    accuracy = models[i].score(Xtr, Ytr)
    title = 'Distribution plot'
    DistributionPlot(Ytr,P,'Actual','Predicted',title)
    print("Train Accuracy:", accuracy,'i val: ',i) 

def pred_plot2(Xtr, Ytr, ml):
    P = ml.predict(Xtr)
    accuracy = ml.score(Xtr, Ytr)
    title = 'Distribution plot'
    DistributionPlot(Ytr,P,'Actual','Predicted',title)
    print("Train Accuracy:", accuracy) 

In [None]:
m_features = ["Pclass", "Sex", "SibSp", "Parch","Embarked","Cabin","Fare"]

X_dta = train_data[m_features]
y_dta = train_data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X_dta, y_dta, test_size=.35, random_state=15)

In [None]:
#order = [2,6,10,14,18]
# Function takes a list of models, a list of order = models,orders, xtrain,xtest,ytrain,ytest
#best_score(models,order,X_train,X_test,y_train,y_test)

In [None]:
# Normalizing the values of the training and test set
#X = (X - X.mean())/X.std()
#X_test = (X_test - X_test.mean())/X_test.std()

#model = RandomForestClassifier(n_estimators=100, max_depth = 4, random_state = 0)
#model.fit(X, y)
#predictions = model.predict(X_test)

#output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
#output.to_csv('submission.csv', index = False)
#print("Your submission was successfully saved!")

In [None]:
Rlist = []
index1 = []
for i in range (2,10):
    Rcross = cross_val_score(lre, X_test, y_test, cv=i)
    Rlist.append(Rcross.mean())
    index1.append(i)
print("The mean of the folds are", max(Rlist))
print("The best val", index1[Rlist.index(max(Rlist))])
best_val = index1[Rlist.index(max(Rlist))]

In [None]:
pr = PolynomialFeatures(degree=2)
X_train_pr = pr.fit_transform(X_train)
X_test_pr = pr.fit_transform(X_test)
lre.fit(X_train, y_train)

pred_plot2(X_train, y_train, lre)
pred_plot2(X_test, y_test, lre)

lre.fit(X_train_pr, y_train)
pred_plot2(X_train_pr, y_train, lre)
pred_plot2(X_test_pr, y_test, lre)

# Train Model on Full Data

In [None]:
X_dta_tr = train_data[m_features]
X_dta_te = test_data[m_features]
y_dta_tr = train_data['Survived']

In [None]:
pr = PolynomialFeatures(degree=2)
X_train_pr1 = pr.fit_transform(X_dta_tr)
X_test_pr1 = pr.fit_transform(X_dta_te)

lre.fit(X_dta_tr, y_dta_tr)
pred_plot2(X_dta_tr, y_dta_tr, lre)
#pred_plot2(X_dta_te, y_dta_te, rf)

lre.fit(X_train_pr1, y_dta_tr)
pred_plot2(X_train_pr1, y_dta_tr, lre)
#pred_plot2(X_test_pr1, y_dta_te, rf)

In [None]:
predictions = lre.predict(X_test_pr1)
test_data2 = pd.read_csv("/kaggle/input/titanic/test.csv")
output = pd.DataFrame({'PassengerId': test_data2.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index = False)
print("Your submission was successfully saved!")

In [None]:
test_data2 = pd.concat([test_data2,output['Survived']],join='outer',axis=1)
test_data2['Age'] = test_data['Age']
test_data2.head()

In [None]:
test_data2 = pd.concat([test_data2,output['Survived']],join='outer',axis=1)
test_data2['Age'] = test_data['Age']
test_data2.head()