### Titanic Kaggle Competition ###

-----
*Data description:*

|Columns|Meaning|
|-|-|
|PassengerId|Id
|Survived|0 = No, 1 = Yes|
|Pclass|1 = 1st, 2 = 2nd, 3 = 3rd|
|Name|Name
|Sex|Sex
|Age|Age in years
|SibSp|# of siblings / spouses aboard the Titanic|
|Parch|# of parents / children aboard the Titanic|
|Ticket|Ticket number|
|Fare|Passenger fare|
|Cabin|Cabin number|
|Embarked|Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton|


Training data:

https://raw.githubusercontent.com/ThomasJewson/datasets/master/Titanic%20dataset/train.csv

Test data:

https://raw.githubusercontent.com/ThomasJewson/datasets/master/Titanic%20dataset/test.csv

Gender Sub:

https://raw.githubusercontent.com/ThomasJewson/datasets/master/Titanic%20dataset/gender_submission.csv


**Imports**

In [136]:
import plotly.express as px # 3d interactive graphs
import pandas as pd
import numpy as np
import warnings

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

**Data Import and initialisation**

In [122]:
traindf = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/Titanic%20dataset/train.csv")
testdf = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/Titanic%20dataset/test.csv")

xtrain = traindf.drop("Survived",1)
ytrain = traindf["Survived"]

**Initial Analysis**

In [123]:
#scatter_3d
fig = px.scatter(
    traindf,
    x="Sex",
    y="Age",
    color="Survived"
)
fig.show()

In [124]:
# Percentage of missing data for each column on the traindf
pd.isnull(traindf).mean() * 100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

In [125]:
# Percentage of missing data for each column on the testdf
pd.isnull(testdf).mean() * 100

PassengerId     0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            20.574163
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.239234
Cabin          78.229665
Embarked        0.000000
dtype: float64

**Cleaning data**

In [126]:
def cleandf(x):
    """ This function cleans the data. Removing unneeded columns and turning things into numerics."""
    x = x.replace({"male": 1, "female": 0})
    y = x.drop(["PassengerId","Name","Ticket","Cabin","Embarked"],1)
    return y


# Returning clean data
cxtrain = cleandf(xtrain) 
cxtest = cleandf(testdf)

#Giving median values to Nan
cxtrain["Age"] =cxtrain["Age"].fillna(cxtrain["Age"].median())
cxtest["Age"] =cxtest["Age"].fillna(cxtest["Age"].median())


# Splitting the training data into 2 sets to allow evaluation of the model
cxtrtr, cxtrte, ytrtr, ytrte = train_test_split(cxtrain, ytrain, test_size=0.25, random_state=0)

""" 
KEY

cxtrtr -> c = clean, x = data, tr = from training set, tr = to train the model on
ytrte -> y = result, tr = from training set, te = to evaluate the model on

"""



print('There are {} samples in the training set and {} samples in the test set'.format(
cxtrtr.shape[0], cxtrte.shape[0]))

There are 668 samples in the training set and 223 samples in the test set


**Preprocessing data / Standardisation**

In [127]:
# Standardising the data
sc = StandardScaler()
sc.fit(cxtrtr)

# s = standardised
scxtrtr = sc.transform(cxtrtr)
scxtrte = sc.transform(cxtrte)
scxtrain = sc.transform(cxtrain)

**Perceptron ML model** 

In [130]:
tit_clf = Perceptron(
    max_iter=40,
    eta0=0.1,
    random_state=0
)

tit_clf.fit(scxtrtr,ytrtr)
y_pred = tit_clf.predict(scxtrte)

print("Misclassified samples: %d" % (ytrte != y_pred).sum())
print("Accuracy: %.2f" % accuracy_score(ytrte,y_pred))

Misclassified samples: 87
Accuracy: 0.61


*Hyperparameter tuning*

In [138]:
warnings.filterwarnings('ignore')
tuned_parameters = [{'alpha': [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3],
                     'max_iter': [1, 10, 20, 40, 80, 160, 320, 640, 1280, 12800],
                     'eta0':[0.1,0.2,0.4]}]

tit_clf_grid = GridSearchCV(Perceptron(), tuned_parameters)
tit_clf_grid.fit(scxtrtr, ytrtr)

print("Best result found for", tit_clf_grid.best_params_)

Best result found for {'alpha': 0.0001, 'eta0': 0.1, 'max_iter': 1}
