

# Machine Learning logistic regression for titanic dataset

Let's go through the machine learning workflow using a familiar dataset.



In [69]:
import pandas as pd
import numpy as np

## 1. Define (Business) Goal

A Goal should be measurable


**Titanic:**
> Predict who survived and who died<br>
> Arbitrarily: We want an accuracy of the model that is higher than 0.77

**Accuracy:** Ratio of correct predictions over all cases. What is the percentage of correctly classified cases.

**Loss:** Difference between y and y_hat


## 2. Get Data

For the penguins data and for the titanic data we just have to load a .csv file.


Potential data sources:
- Databases
- Create your own data (simulation) / run a survey
- Sensors / Devices that measure data
- Web scraping/API (Application Programming Interface)

In [70]:
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [71]:
df_train.shape

(891, 12)

In [72]:
df_test.shape

(418, 11)

## 3. Train-Validation-Test-Split

- We want to split our data set in data to train, to validate and to test:
    - Train data: the samples of data used to fit the model.
    - Validation data: the samples of data used to evaluate the model while fine-tuning the model hyperparameters
    - Test data: the samples of data used to evaluate the model
- The model should not see the test data until the end, when we use it to evaluate the performance of the model.

What is the purpose of splitting the data? - We want to be able to detect if our model is overfitting.
The splitting the data does not help to prevent overfitting but it helps to detect overfitting.

**Overfitting:**

Algorithm is to some extent memorizing the correct answers for the training data. This means that it will not work well on data it has not been trained on. The model does not **generalize** well.

### 3.1 Separate features and label (target)
- `X`:= is the array of features used to predict. It's a multidimensional array (or matrix, or a dataframe in pandas)
- `y`:= is the array of labels to be predicted. It's an array with a single dimension (or a vector, or a series in pandas)

In [73]:
df_train.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
690,691,1,1,"Dick, Mr. Albert Adrian",male,31.0,1,0,17474,57.0,B20,S
470,471,0,3,"Keefe, Mr. Arthur",male,,0,0,323592,7.25,,S
727,728,1,3,"Mannion, Miss. Margareth",female,,0,0,36866,7.7375,,Q
117,118,0,2,"Turpin, Mr. William John Robert",male,29.0,1,0,11668,21.0,,S
167,168,0,3,"Skoog, Mrs. William (Anna Bernhardina Karlsson)",female,45.0,1,4,347088,27.9,,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
373,374,0,1,"Ringhini, Mr. Sante",male,22.0,0,0,PC 17760,135.6333,,C
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S
704,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C


In [74]:
X_train = df_train.loc[:, df_train.columns != 'Survived']
y_train = df_train['Survived'].to_frame()

In the Titanic dataset your y is the column **Survived**

### 3.2 Train-Test split

In [75]:
#!conda install -c conda-forge scikit-learn
#!pip install scikit-learn

In [76]:
from sklearn.model_selection import train_test_split

In [77]:
# using the train test split function
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train ,
                                   random_state=104,
                                   stratify=y_train,
                                   test_size=0.2, 
                                   shuffle=True)


In [78]:
# Always check the shape of your train and test arrays to check if this was done correctly
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((712, 11), (179, 11), (712, 1), (179, 1))

In [79]:
type(y_train)

pandas.core.frame.DataFrame

### Logistic Regression in scikit-learn

In [80]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [81]:
# input feature
X_train_pclass = X_train[['Pclass']]  # to select the input features [[]] -> dataframe, matrix
X_val_pclass = X_val[['Pclass']]


# target variable     # to select the target (dependent) variable [] > Series, array

y_train = y_train['Survived']
y_val = y_val['Survived']

In [82]:
# check the shape
X_train_pclass.shape, y_train.shape, X_val_pclass.shape, y_val.shape

((712, 1), (712,), (179, 1), (179,))

In [83]:
# instantiate the model 
m_lgr = LogisticRegression() # build the model

# fit the model (train the model)
m_lgr.fit(X_train_pclass, y_train) #

In [84]:
# Get the coef w_0 and w_1 
w_1 = m_lgr.coef_
w_0 = m_lgr.intercept_

print(f'Model feature coefficient :{w_1}\nModel intercept/bias: {w_0}')

Model feature coefficient :[[-0.81980665]]
Model intercept/bias: [1.38281499]


In [85]:
# classes
m_lgr.classes_

array([0, 1])

In [86]:
# Get the estimated probabilities
estim_prob = m_lgr.predict_proba(X_val_pclass)
estim_prob.round(3)

array([[0.564, 0.436],
       [0.564, 0.436],
       [0.564, 0.436],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.564, 0.436],
       [0.564, 0.436],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.564, 0.436],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.564, 0.436],
       [0.746, 0.254],
       [0.363, 0.637],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.564, 0.436],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.363, 0.637],
       [0.564, 0.436],
       [0.363, 0.637],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.746, 0.254],
       [0.7

In [104]:
y_val.value_counts(normalize = True) #it is really important!!! look at the results of accuracy

0    0.614525
1    0.385475
Name: Survived, dtype: float64

In [87]:
# Let's transform the estim_prob to a padatframe with the proper column header
estim_prob_df = pd.DataFrame(data = estim_prob, columns =m_lgr.classes_)
estim_prob_df#[[1]]

Unnamed: 0,0,1
0,0.563849,0.436151
1,0.563849,0.436151
2,0.563849,0.436151
3,0.745851,0.254149
4,0.745851,0.254149
...,...,...
174,0.745851,0.254149
175,0.563849,0.436151
176,0.362852,0.637148
177,0.745851,0.254149


In [88]:
threshold = 0.5
pred = []
for item in estim_prob_df[1]:
    if item >= threshold:
        pred.append(1)
    else:
        pred.append(0)
        
estim_prob_df['prediction'] = pred    
estim_prob_df


Unnamed: 0,0,1,prediction
0,0.563849,0.436151,0
1,0.563849,0.436151,0
2,0.563849,0.436151,0
3,0.745851,0.254149,0
4,0.745851,0.254149,0
...,...,...,...
174,0.745851,0.254149,0
175,0.563849,0.436151,0
176,0.362852,0.637148,1
177,0.745851,0.254149,0


In [94]:
#y_val.reset_index
y_val_n = y_val.to_frame()
y_val_n

Unnamed: 0,Survived
134,0
183,1
608,1
85,1
287,0
...,...
593,0
149,0
194,1
836,0


In [95]:
((y_val_n.reset_index(drop=True)['Survived'] == estim_prob_df["prediction"]).sum())/y_val.shape[0]

0.6815642458100558

In [96]:
threshold = 0.9
pred = []
for item in estim_prob_df[1]:
    if item >= threshold:
        pred.append(1)
    else:
        pred.append(0)
        
estim_prob_df['prediction'] = pred    
estim_prob_df


Unnamed: 0,0,1,prediction
0,0.563849,0.436151,0
1,0.563849,0.436151,0
2,0.563849,0.436151,0
3,0.745851,0.254149,0
4,0.745851,0.254149,0
...,...,...,...
174,0.745851,0.254149,0
175,0.563849,0.436151,0
176,0.362852,0.637148,0
177,0.745851,0.254149,0


In [97]:
estim_prob_df['prediction'].value_counts()  # with 0.9 threshold all the predictions are 0 now

0    179
Name: prediction, dtype: int64

In [98]:
y_val_n = y_val.to_frame()

In [99]:
((y_val_n.reset_index(drop=True)['Survived'] == estim_prob_df["prediction"]).sum())/y_val.shape[0]

0.6145251396648045

In [100]:
threshold = 0.1
pred = []
for item in estim_prob_df[1]:
    if item >= threshold:
        pred.append(1)
    else:
        pred.append(0)
        
estim_prob_df['prediction'] = pred    
estim_prob_df


Unnamed: 0,0,1,prediction
0,0.563849,0.436151,1
1,0.563849,0.436151,1
2,0.563849,0.436151,1
3,0.745851,0.254149,1
4,0.745851,0.254149,1
...,...,...,...
174,0.745851,0.254149,1
175,0.563849,0.436151,1
176,0.362852,0.637148,1
177,0.745851,0.254149,1


In [101]:
y_val_n = y_val.to_frame()

In [102]:
estim_prob_df["prediction"].sum()

179

In [103]:
((y_val_n.reset_index(drop=True)['Survived'] == estim_prob_df["prediction"]).sum())/y_val.shape[0]

0.3854748603351955