## Titanic survival prediction



In this part we will attempt to predict who survived on the Titanic. The data set we use has the following vairbales (features/attributes):


|  Variable   |          Definition          |              Key/Values                 |
|:-----------:|:----------------------------:|:---------------------------------------:|
| PassengerId | Index                        |integer                                  |
| Pclass      | Ticket class                 |1=1st, 2=2nd, 3=3rd                      |
| Name        | Name of passenger            |string                                   |
| Sex         | Sex                          |male, female                             |
| Age         | Age in years                 |integer                                  |
| SibSp       | # of siblings/spouses aboard |integer                                  |
| Parch       | # of parents/children aboard |integer                                  |
| Ticket      | Ticket number                |string                                   |
| Fare        | Ticket fare                  |float                                    |
| Cabin       | Cabin number                 |a code                                   |
| Embarked    | Port of Embarkation          |C=Cherbourg, Q=Queenstown, S=Southampton |
| **Survived**| Predicted varibale           |0=No, 1=Yes                              |


**Goal:**

This part of the project is a competetive one. The goal is to produce the best prediciton you can
<br><br>


**Methodogology**

So far you only know few methods for classification: KNN, logistic regression and SVM. You can use each one of them. You can also use linear regression, but then you need to convert the output to 0s and 1s (this is not a straight forward use of linear regression but a possible one). You may want to choose which features to use (it could be that some features are not useful). Further, some features have missing values. You will need to decide how to handle this (for instance: drop rows with missing values, place an avergae value in those rows, or some other method of your choosing). Another matter to consider is handling non-numeric values. For example, sex is non-numeric. You may choose to drop non-numeric features, or you could convert them to numeric values (if such conversion makes sense).
Also you will need to consider splitting the data into a training set and a test set so as to avoid tailoring the solution (overfitting) to the data you have.
<br><br>


**Model Output**

Your model needs to produce a prediciton for each data sample (row), which is 0 (did not survive) or 1 (survived). 
<br><br>

**Scoring your model**

Your model performance will be assessed on test data that is not available to you. As mentioned above, this part of the project is a competition, where the goal is to achieve highest model accuracy

<br><br>
**Final Note**

This part of the project is open ended, in that you are not given small and specific tasks. However, you are already familiar with all the components needed to succeed. Specifically, reading data into pandas dataframe, dropping columns, dropping rows, changing value of features, splitting data into train and test subsets, performing model using sklearn library, and using cross validation. So don't panic...


GOOD LUCK

## Part 1: Loading Libraries, Reading Data

In [1]:
# Importing Initial Libraries
import numpy as np
import pandas as pd 
import re 
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.impute import SimpleImputer # used for handling missing data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # used for encoding categorical data
from sklearn.model_selection import train_test_split # used for splitting training and testing data
from sklearn.preprocessing import StandardScaler # used for feature scaling
from sklearn.linear_model import LogisticRegression # logistic regression model
from sklearn.svm import SVC # svm model
from sklearn.neighbors import KNeighborsClassifier # KNN model
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import KFold b
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV

In [2]:
# Importing Data
titanic = pd.read_csv('titanic_train.csv')

## Part 2: Feature Engineering

##### a ) Embarked - One Hot Encoding
We changed Embarked in to three binary features.

In [3]:
Embarked = pd.get_dummies(titanic.Embarked, prefix='Embarked_')
titanic['Embarked_C'] = Embarked.iloc[:,0]
titanic['Embarked_Q'] = Embarked.iloc[:,1]
titanic['Embarked_S'] = Embarked.iloc[:,2]

# Removing rows where embarked are NA:
# There were only two so we decided its better to remove them.
titanic = titanic[titanic['Embarked'].notna()]

#####  b) Family Size - Sib+ Parch
Adding a feature represting each passengers family size on board

In [4]:
# Fam_Size = SibSp + Parch
fam_size = []
for i in range(len(titanic)):
    fam_size.append(titanic.iloc[i,5] + titanic.iloc[i,6] + 1)
titanic.insert(7,"Family_Size",fam_size, True)

##### c) Title - One Hot Encoding
A new feature defining the Title of a passenger, we believe this adds information about each passengers Life.

In [5]:
# We start by extracting the title from the name feature

title = titanic['Name'].str.extract(r" ([A-Za-z]+)\. ")
titanic['Title'] = title

# Minimizing titles
## We wanted to mkae specified groups so that we can better evaluate each group (given the amount of data)

titanic['Title'] = titanic['Title'].replace(['Lady', 'Capt', 'Col',
    'Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
titanic['Title'] = titanic['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
titanic['Title'] = titanic['Title'].replace('Mlle', 'Miss')
titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Mme', 'Mrs')

# Using One Hot Encoding we transformed each title into its own binary feature
Title = pd.get_dummies(titanic.Title, prefix='Title_')
titanic['Title_Master'] = Title.iloc[:,0]
titanic['Title_Miss'] = Title.iloc[:,1]
titanic['Title_Mr'] = Title.iloc[:,2]
titanic['Title_Mrs'] = Title.iloc[:,3]
titanic['Title_Rare'] = Title.iloc[:,4]
titanic['Title_Royal'] = Title.iloc[:,5]

##### d) Sex- changing this column to a numerical (0/1)

In [6]:
cleanup_vals = {"Sex":     {"male": 1, "female": 0}}
titanic = titanic.replace(cleanup_vals)

##### e) Age - Addressing NA's in the age feature:
We chose to predict the missing ages based on their Gender and Pclass

In [7]:
index_NaN_age = list(titanic["Age"][titanic["Age"].isnull()].index)
# A list of all the indexs wher age =NaN
for i in index_NaN_age:
    age_pred = np.nanmedian(titanic["Age"][(titanic['Sex'] == titanic.iloc[i]["Sex"]) & 
                                           (titanic['Pclass'] == titanic.iloc[i]["Pclass"])])
    titanic.loc[i, 'Age'] = age_pred

##### f) Pclass - One Hot Encoding
We wanted to change Pclass from 1 catergorial feature to 3 binary ones.

In [9]:
# Turning "Pclass" into three seperate binary columns.
Pclass = pd.get_dummies(titanic.Pclass, prefix='Pclass_')
titanic['Pclass_1'] = Pclass.iloc[:,0]
titanic['Pclass_2'] = Pclass.iloc[:,1]
titanic['Pclass_3'] = Pclass.iloc[:,2]

#### we also tried adding polynomial of the continuoes values, but they lowered the performance of the model

## Part 3 : Data Cleaning, Scaling and Splitting into Train & Test

##### Reordering Our Features 

In [10]:
titanic = titanic[['PassengerId',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Family_Size',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked',
 'Embarked_C',
 'Embarked_Q',
 'Embarked_S',
 'Title',
 'Title_Master',
 'Title_Miss',
 'Title_Mr',
 'Title_Mrs',
 'Title_Rare',
 'Title_Royal',
 'Pclass_1',
 'Pclass_2',
 'Pclass_3','Survived']]

##### Cleaning the Data

In [11]:
# STEP 1:
# Removing Features that are non numeric.
tbf = titanic.drop(['Name','Ticket','Cabin','Embarked','Pclass', 'Title'] , axis =1)
tbf_clean = tbf.dropna()

# STEP 2:
# Normilizing our Data , We are using Min/Max scaling.

scaler = preprocessing.MinMaxScaler()
names = tbf_clean.columns
d = scaler.fit_transform(tbf_clean)
scaled_tbf = pd.DataFrame(d, columns=names)
scaled_tbf.head(10)

Unnamed: 0,PassengerId,Sex,Age,SibSp,Parch,Family_Size,Fare,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare,Title_Royal,Pclass_1,Pclass_2,Pclass_3,Survived
0,0.352081,1.0,0.533123,0.125,0.2,0.2,0.051237,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.366704,0.0,0.444795,0.0,0.0,0.0,0.025374,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.986502,1.0,0.305994,0.0,0.0,0.0,0.015412,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.32171,1.0,0.268139,0.0,0.0,0.0,0.015412,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.312711,0.0,0.432177,0.125,0.2,0.2,0.039525,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
5,0.975253,1.0,0.305994,0.0,0.0,0.0,0.018543,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,0.119235,1.0,0.305994,0.0,0.0,0.0,0.015176,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
7,0.910011,1.0,0.318612,0.0,0.0,0.0,0.015395,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8,0.440945,0.0,0.280757,0.125,0.0,0.1,0.221098,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
9,0.92126,0.0,0.646688,0.125,0.2,0.2,0.1825,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


##### Splitting Data to Training and Test

In [12]:
X = scaled_tbf.iloc[:,0:19]
y = scaled_tbf.iloc[:,19] 
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size = 0.2)
# We chose to to split 80/20 to training/test

## Part 4: Modeling 

#### Logistic Regression

In [34]:
####################################################################################
# OPTIMAL SVM MODEL #
####################################################################################

# Implimenting cross validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=1)
LogReg = LogisticRegression(solver= 'lbfgs')
 
acc_score = []
 
for train_index , test_index in kf.split(X):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    
# Modeling:  
    LogReg.fit(X_train,y_train)
    pred_values = LogReg.predict(X_test)
    acc = accuracy_score(pred_values , y_test)
    acc_score.append(acc)
     
avg_acc_score = sum(acc_score)/k
 
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy Logistic Regression: {}'.format(avg_acc_score))

accuracy of each fold - [0.8380281690140845, 0.7887323943661971, 0.8309859154929577, 0.8309859154929577, 0.8098591549295775]
Avg accuracy Logistic Regression: 0.8197183098591548


#### K Nearest Neighbors

In [40]:
####################################################################################
# OPTIMAL KNN MODEL #
####################################################################################

param = [
    {'n_neighbors': range(2, 20, 1)}
]

knn = KNeighborsClassifier()

# for this model were using the GridSearch function to optimize model parameters. 
gs_knn = GridSearchCV(knn, param, cv = 3 , n_jobs = -1)
gs_knn.fit(X_train, y_train)

knn_best = gs_knn.best_estimator_
gs_knn.best_estimator_, gs_knn.score(X_test, y_test)

(KNeighborsClassifier(n_neighbors=19), 0.823943661971831)

#### SVM

In [45]:
####################################################################################
# OPTIMAL SVM MODEL #
####################################################################################

param = [
    {
        'kernel': ['rbf'], 'C': [0.1, 0.3, 1, 2, 3, 4], 
        'gamma': [0.3, 1, 3, 10, 12, 15, 25, 28]
    }, 
]

svc = SVC(probability = True)

# for this model were using the GridSearch function to optimize model parameters.
gs_svc = GridSearchCV(svc, param, cv = 5, n_jobs = -1, verbose = 1)
gs_svc.fit(X_train, y_train)
svc_best = gs_svc.best_estimator_
gs_svc.best_estimator_, gs_svc.score(X_test, y_test)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


(SVC(C=3, gamma=0.3, probability=True), 0.8661971830985915)

#### Model Comparrison

In [47]:
# test on o.g data:
print("Titanic:")
print("________________________________________")
print("L.G Score:")
print(LogReg.score(X_test, y_test))
print("KNN Score:")
print(gs_knn.score(X_test, y_test))
print("SVM Score:")
print(gs_svc.score(X_test, y_test))
print("________________________________________")

Titanic:
________________________________________
L.G Score:
0.8098591549295775
KNN Score:
0.823943661971831
SVM Score:
0.8661971830985915
________________________________________


## Part 5: Final Choice
After all of our optimization attemps we've concluded that our best Model is the SVM.