## Machine Learning - Titanic

The titanic dataset is a popular dummy dataset. 

In notebook, we will explore the basic concepts of Machine Learning using the Python library [SciKit Learn](https://scikit-learn.org/stable/index.html) and the Titanic dataset. You will find many tutorials online that use this dataset and library to explore Machine Learning concepts.



In [1]:
import pandas as pd
import numpy as np 
from plotnine import *

import warnings
warnings.filterwarnings('ignore')


### Load the data

Read in the `titanic.csv` data set again.

In [2]:
# Load titanic.csv
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,gender
0,1st,1,"Allen, Miss Elisabeth Walton",29.0000,Southampton,"St Louis, MO",B-5,24160 L221,2,female
1,1st,0,"Allison, Miss Helen Loraine",2.0000,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
2,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0000,Southampton,"Montreal, PQ / Chesterville, ON",C26,,-135,male
3,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0000,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
4,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male
...,...,...,...,...,...,...,...,...,...,...
1308,3rd,0,"Zakarian, Mr Artun",,,,,,,male
1309,3rd,0,"Zakarian, Mr Maprieder",,,,,,,male
1310,3rd,0,"Zenn, Mr Philip",,,,,,,male
1311,3rd,0,"Zievens, Rene",,,,,,,female


The first thing we need to do is code the pclass and gender variables numerically. Let's use the following scheme:
- pclass: 1,2,3
- gender: 0=male, 1=female, and let's call the column called "female" to remind us which is which

In [3]:
# recode the pclass and gender variables so they are numeric
df['pclass'] = df.pclass.replace({'1st': 1, '2nd': 2, '3rd': 3})
df['female'] = df.gender.replace({'male': 0, 'female': 1})
df.head(3)

Unnamed: 0,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,gender,female
0,1,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2.0,female,1
1,1,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female,1
2,1,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,-135.0,male,0


## 1. Logistic Regression with Scikit-Learn

Let's look at the documentation and use various functions from there!
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Also helpful if the sklearn documentation seems overwhelming, check out investigate.ai
- https://investigate.ai/classification/intro-to-classification/#Logistic-Classifier


In [88]:
# Import the classifier from scikit-learn
from sklearn.linear_model import LogisticRegression

# Create a new classifier (in this case it is just a logisitic regression)
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)

In [5]:
# randomly shuffle the dataframe (will explain this later)
df = df.sample(len(df), random_state=1) 

# Fit the data to the model
X = df[['pclass', 'female']]
y = df['survived']

clf.fit(X, y)

LogisticRegression(C=1000000000.0, max_iter=4000)

In [6]:
# predictions of who survived
clf.predict(X)

array([0, 1, 1, ..., 0, 0, 0])

In [7]:
# predictions as probabilities
clf.predict_proba(X)

array([[0.60260982, 0.39739018],
       [0.11793495, 0.88206505],
       [0.11793495, 0.88206505],
       ...,
       [0.92350585, 0.07649415],
       [0.60260982, 0.39739018],
       [0.92350585, 0.07649415]])

In [8]:
# probabilities of what? (survival)
# this helps interpret the results above
clf.classes_

array([0, 1])

In [9]:
# coefficients (logs of odds ratios)
clf.coef_

array([[-1.03730579,  2.42848395]])

In [10]:
# coefficients of what?
# ...coefficients of features
clf.feature_names_in_

array(['pclass', 'female'], dtype=object)

In [11]:
# accruacy...
# how is this calculated?
clf.score(X,y)

0.814927646610815

### 2. Metrics for what makes a good model

In [12]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, \
                            accuracy_score, f1_score

In [13]:
# calculate from below
matrix = confusion_matrix(y, clf.predict(X))
pd.DataFrame(matrix, 
             columns = ['pred_died', 'pred_survived'], 
             index=['died', 'survived'])\
    .assign(Total=lambda x: x.sum(axis=1))\
    .sort_values(by='Total')[['pred_survived', 'pred_died', 'Total']]\
    .T.assign(Total = lambda x: x.sum(axis=1)).T

# from sklearn.metrics import ConfusionMatrixDisplay
# ConfusionMatrixDisplay(matrix).plot(colorbar=False, cmap='binary')

Unnamed: 0,pred_survived,pred_died,Total
survived,228,221,449
died,22,842,864
Total,250,1063,1313


In [14]:
# (tp + fn) / (tp + fp + tn + fn)
# (228 + 842) / (228 + 22 + 842 + 221)
# (228 + 842) / 1313
accuracy_score(clf.predict(X), y)

0.814927646610815

In [15]:
# (tp) / (tp + fp)
# 228 / (228 + 22)
# 228 / 250
precision_score(y, clf.predict(X))

0.912

In [16]:
# tp / (tp + fn)
# 228 / (228 + 221)
# 228 / 449
recall_score(y, clf.predict(X))

0.5077951002227171

In [17]:
# 2 * (prescision * recall) / (precision + recall)
f1_score(clf.predict(X), y)

0.6523605150214592

In [18]:
import numpy as np
from sklearn.model_selection import train_test_split

X = df[['pclass', 'female']]
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)

# define model
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000) 

# fit model on training data
clf.fit(X_train, y_train)

# scores
print("accuracy_score", accuracy_score(y_test, clf.predict(X_test)).round(2))
print("precision_score", precision_score(y_test, clf.predict(X_test)).round(2))
print("recall_score", recall_score(y_test, clf.predict(X_test)).round(2))
print("f1_score", f1_score(y_test, clf.predict(X_test)).round(2))


accuracy_score 0.75
precision_score 0.63
recall_score 0.65
f1_score 0.64


### 4. Cross Validation

In [19]:
from sklearn.model_selection import cross_val_score

In [20]:
scores = cross_val_score(clf, X, y, cv=10)

# Cross validation on accuracy scores
scores

array([0.87878788, 0.84090909, 0.86363636, 0.80916031, 0.77862595,
       0.81679389, 0.81679389, 0.75572519, 0.81679389, 0.77099237])

In [21]:
print(f"{scores.mean().round(2)} accuracy with a standard deviation of {scores.std().round(2)}")

0.81 accuracy with a standard deviation of 0.04


#### Cross Validation with other scores

In [22]:
from sklearn.model_selection import cross_validate

In [23]:
# 5-fold cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1']
scores = cross_validate(clf, X, y, scoring=scoring, cv=10)
pd.DataFrame(scores).round(2)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
0,0.0,0.0,0.88,0.94,0.69,0.79
1,0.0,0.0,0.84,0.9,0.6,0.72
2,0.0,0.0,0.86,0.97,0.62,0.76
3,0.0,0.0,0.81,0.88,0.5,0.64
4,0.0,0.0,0.78,0.94,0.38,0.54
5,0.0,0.0,0.82,0.92,0.51,0.66
6,0.0,0.0,0.82,0.92,0.51,0.66
7,0.0,0.0,0.76,0.81,0.38,0.52
8,0.0,0.0,0.82,0.92,0.51,0.66
9,0.0,0.0,0.77,0.89,0.38,0.53


In [24]:
# 5-fold cross-validation summary
pd.DataFrame(scores).describe().round(2)[1:3]

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
mean,0.0,0.0,0.81,0.91,0.51,0.65
std,0.0,0.0,0.04,0.04,0.11,0.1


## Comparing models (Part I)

In [25]:
print("Remember, this is X")
display(X.head())
print("And this is y")
display(y)

Remember, this is X


Unnamed: 0,pclass,female
201,1,0
115,1,1
255,1,1
1040,3,1
195,1,0


And this is y


201     0
115     1
255     1
1040    1
195     0
       ..
715     0
905     0
1096    0
235     0
1061    0
Name: survived, Length: 1313, dtype: int64

In [26]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X,y)
scores = cross_validate(clf, X, y, scoring=scoring, cv=10)
pd.DataFrame(scores).describe().round(2)[1:3]

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
mean,0.0,0.0,0.81,0.91,0.51,0.65
std,0.0,0.0,0.04,0.04,0.11,0.1


In [27]:
# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,y)
scores = cross_validate(clf, X, y, scoring=scoring, cv=10)
pd.DataFrame(scores).describe().round(2)[1:3]

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
mean,0.0,0.0,0.79,0.8,0.57,0.65
std,0.0,0.0,0.05,0.16,0.12,0.07


In [28]:
# Multi-layer perceptron (a type of Neural Network ¯\_(ツ)_/¯)
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier()
clf.fit(X,y)
scores = cross_validate(clf, X, y, scoring=scoring, cv=10)
pd.DataFrame(scores).describe().round(2)[1:3]

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
mean,0.22,0.0,0.81,0.91,0.51,0.65
std,0.02,0.0,0.04,0.04,0.11,0.1


## Comparing models (part II) 🤖

Here i switch back to StatsModels for a quick second. 

Let's compare the various logistic regressions we ran using our new metric.
- survived ~ pclass + female
- survived ~ pclass + female + age
- survived ~ C(pclass) + female + age
- survived ~ C(pclass) + female + np.log(age)
- survived ~ C(pclass) + female + age<18


In [127]:
from patsy import dmatrices

def get_logistic_regression_predictions(regression_str, df):
    """
    use Sklearn (for cross validation) and StatsModels for patsy (y~X1+X2+X3) syntax
    """
    y, X = dmatrices(regression_str, df, return_type = 'dataframe')
    model = LogisticRegression(fit_intercept = False)
    mdl = model.fit(X, y)
    scores = cross_validate(clf, X, y, scoring=scoring, cv=5)
    logit = sm.Logit(y, X)
    print(regression_str)
    print("prsquared=", logit.fit(disp=0).prsquared.round(2))
    return pd.DataFrame(scores).describe()[1:3].round(2)[['test_accuracy','test_precision','test_recall','test_f1']]



In [141]:
get_logistic_regression_predictions("survived ~ female", df)

survived ~ female
prsquared= 0.19


Unnamed: 0,test_accuracy,test_precision,test_recall,test_f1
mean,0.77,0.66,0.68,0.67
std,0.03,0.04,0.07,0.05


In [140]:
get_logistic_regression_predictions("survived ~ pclass", df)

survived ~ pclass
prsquared= 0.1


Unnamed: 0,test_accuracy,test_precision,test_recall,test_f1
mean,0.71,0.6,0.43,0.5
std,0.02,0.04,0.04,0.03


In [136]:
get_logistic_regression_predictions("survived ~ female + pclass", df)

survived ~ female + pclass
prsquared= 0.29


Unnamed: 0,test_accuracy,test_precision,test_recall,test_f1
mean,0.8,0.86,0.53,0.64
std,0.02,0.12,0.14,0.06


In [137]:
get_logistic_regression_predictions("survived ~ female + C(pclass) + np.log(age)", df)

survived ~ female + C(pclass) + np.log(age)
prsquared= 0.32


Unnamed: 0,test_accuracy,test_precision,test_recall,test_f1
mean,0.82,0.86,0.58,0.69
std,0.03,0.06,0.08,0.06


In [138]:
get_logistic_regression_predictions("survived ~ C(pclass) + female + age<18", df)

survived ~ C(pclass) + female + age<18
prsquared= 0.31


Unnamed: 0,test_accuracy,test_precision,test_recall,test_f1
mean,0.83,0.89,0.57,0.69
std,0.03,0.04,0.08,0.06
