<a href="https://colab.research.google.com/github/GuillaumeArp/Wild_Notebooks/blob/main/Quest_Logistic_regression_Titanic_Guillaume_Arp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Execute the code below
You will get a passenger list of the titanic.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
link = "https://raw.githubusercontent.com/murpi/wilddata/master/quests/titanic.csv"
df_titanic = pd.read_csv(link)
df_titanic['Survived'] = df_titanic['Survived'].apply(lambda x: "Survived" if x == 1 else "Dead")
df_titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,Dead,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,Survived,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,Survived,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,Survived,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,Dead,3,Mr. William Henry Allen,male,35.0,0,0,8.05


# Data preparation

What are the "type" of each column? Are there non-numeric columns? 

In [None]:
# What are the "type" of each column? 
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    object 
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(3), object(3)
memory usage: 55.6+ KB


There are 3 non numerical columns : Survived, Sex and Age. 
Survived used to be numerical before being converted to a string during the import, so we'll start with reverting that.
Name is irrelevant for the regression, and Sex will also be factorized.

In [None]:
# Import the dataset again without changing the Survived column

df_titanic = pd.read_csv(link)
df_titanic['Survived'].value_counts()


0    545
1    342
Name: Survived, dtype: int64

In [None]:
# Factorize the Sex column (male is 0, female is 1)

df_titanic['Sex_nb'] = df_titanic['Sex'].factorize()[0]
df_titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,1
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0


Then, make a first exploration (pairplots, correlation heatmaps, etc...) of the dataset

In [None]:
# Pairplot using numeric columns only:

fig = px.scatter_matrix(df_titanic.iloc[:, [0, 1, 4, 5, 6, 7, 8]])

fig.update_layout(width=1800, height=1800, title='Dataset Pairplot')
fig.show()


As expected, the pairplot doesn't give much insight here, with no really obvious correlation that can be seen here. The correlation heatmap should provide more insight.

In [None]:
# Correlation heatmap:

corr = df_titanic.corr()

fig = go.Figure()

fig.add_trace(go.Heatmap(
    z = corr,
    x = corr.columns.values,
    y = corr.columns.values,
    colorscale=px.colors.diverging.RdBu,
    zmid=0
))

fig.update_layout(width=1000, height=750, title='Correlation Heatmap')
fig.show()

We can see a clear positive correlation between the gender and the survival, meaning that females had more chances there. There is also a negative correlation between the fare and survival. Finally, having children abord also has a correlation with survival.

You are looking for Jack. How many people named Jack on board?

In [None]:
# How many people named Jack on board?

df_jack = df_titanic[df_titanic['Name'].str.contains('jack', case=False)]
df_jack

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb
762,0,1,Dr. Arthur Jackson Brewe,male,46.0,0,0,39.6,0


Oddly enough, nobody actually named Jack was aboard (so much for the love story). It may be worth noting that Jack is actually a diminutive for John (for reasons going back to the Middle Ages), and people using the first name Jack would actually have purchased the ticket under their real name, John, so let's have a look.

In [None]:
df_john = df_titanic[(df_titanic['Name'].str.contains('John ', case=False)) & (df_titanic['Sex'] == 'male')]
df_john

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb
44,0,3,Mr. William John Rogers,male,30.0,0,0,8.05,0
111,0,3,Mr. David John Barton,male,22.0,0,0,8.05,0
116,0,2,Mr. William John Robert Turpin,male,29.0,1,0,21.0,0
159,0,3,Mr. John Hatfield Cribb,male,44.0,0,1,16.1,0
161,0,3,Mr. John Viktor Bengtsson,male,26.0,0,0,7.775,0
164,1,3,Master. Frank John William Goldsmith,male,9.0,0,2,20.525,0
167,0,1,Mr. John D Baumann,male,60.0,0,0,25.925,0
187,0,3,Mr. John Bourke,male,40.0,1,1,15.5,0
211,0,3,Mr. John Henry Perkin,male,22.0,0,0,7.25,0
225,1,2,Mr. William John Mellors,male,19.0,0,0,10.5,0


More Johns, so that confirms that if people we usually called Jack, they would have purchased the ticket under their given name. Still no Jack (or John) Dawson anyway ðŸ˜ž

Now that we have done an preliminary analysis, let's change the Survived column to a string type again.

In [None]:
df_titanic['Survived'] = df_titanic['Survived'].apply(lambda x: "Survived" if x == 1 else "Dead")

# Logistic regression

Today, in this quest, you have an extraordinary power: you can travel in time to try to save some passengers. 
You obviously wanted to save Jack. But you didn't find his name on the list. He probably travels under a false name...
Too bad, thanks to this trip, you will still try to save as many people as possible. To do this, you have to identify the people who are most probably going to die. 

- Select features (X) with only numeric values, and without "Survived" column
- Select "Survived" column as target (y)
- Split your data with **random_state = 36**
- Train a logistic regression
- Print the accuracy score on the train set and on the test set. Is there overfitting?
- Print the Confusion Matrix on the test set
- How many iterations were needed to train this model?

In [None]:
# It's up to you:

X = df_titanic[['Pclass', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare', 'Sex_nb']]
y = df_titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=36, train_size=0.75)

model = LogisticRegression().fit(X_train, y_train)

print(f"Accuracy score on the train dataset: {model.score(X_train, y_train)}")
print(f"Accuracy score on the test dataset: {model.score(X_test, y_test)}")

Accuracy score on the train dataset: 0.8165413533834587
Accuracy score on the test dataset: 0.7882882882882883


Score is quite good and there is no overfitting, let's check the confusion matrix.

In [None]:
pd.DataFrame(data = confusion_matrix(y_true = y_test, y_pred = model.predict(X_test)),
             index = model.classes_ + " ACTUAL",
             columns = model.classes_ + " PREDICTED")

Unnamed: 0,Dead PREDICTED,Survived PREDICTED
Dead ACTUAL,110,19
Survived ACTUAL,28,65


In [None]:
print(classification_report(y_test, model.predict(X_test)))

              precision    recall  f1-score   support

        Dead       0.80      0.85      0.82       129
    Survived       0.77      0.70      0.73        93

    accuracy                           0.79       222
   macro avg       0.79      0.78      0.78       222
weighted avg       0.79      0.79      0.79       222



In [None]:
# Number of iterations to train the model:

model.n_iter_

array([54], dtype=int32)

# Model improvement

You can save all the people that the model will predict as dead. Change the weight of the classes to save all the people at risk.
- Change the weight of the classes
- Fit the model on train set
- Print the accuracy score on the train set and on the test set
- Print the Confusion Matrix on the test set, you must have no deaths that have been predicted as "Survived".

In [None]:
# It's up to you to save everybody:

model = LogisticRegression(class_weight={
    'Dead':6, 'Survived':1
}).fit(X_train, y_train)

print(f"Accuracy score on the train dataset: {model.score(X_train, y_train)}")
print(f"Accuracy score on the test dataset: {model.score(X_test, y_test)}")

pd.DataFrame(data = confusion_matrix(y_true = y_test, y_pred = model.predict(X_test)),
             index = model.classes_ + " ACTUAL",
             columns = model.classes_ + " PREDICTED")


Accuracy score on the train dataset: 0.7578947368421053
Accuracy score on the test dataset: 0.6756756756756757


Unnamed: 0,Dead PREDICTED,Survived PREDICTED
Dead ACTUAL,129,0
Survived ACTUAL,72,21


# People most at risk

You are looking for people most at risk.
- Compute the prediction probabilities **on your test set**
- Which column is about "survived" probability?
- Among the previous prediction probability array, select only the column corresponding to the "Survived" probability
- Display the passengers by sorting  most likely to survive first (`sorted_values()` method?)

In [None]:
# It's up to you:

model.predict_proba(X_test.iloc[:10, :])

array([[0.97204054, 0.02795946],
       [0.98872895, 0.01127105],
       [0.73713466, 0.26286534],
       [0.91899913, 0.08100087],
       [0.98979794, 0.01020206],
       [0.97789173, 0.02210827],
       [0.98311137, 0.01688863],
       [0.80126324, 0.19873676],
       [0.12814243, 0.87185757],
       [0.98067934, 0.01932066]])

In [None]:
model.classes_

array(['Dead', 'Survived'], dtype=object)

The first column of the array is `Dead` and the second is `Survived`.

In [None]:
proba = model.predict_proba(X_test)
proba = np.around(proba[:,1], 4)
proba

array([2.800e-02, 1.130e-02, 2.629e-01, 8.100e-02, 1.020e-02, 2.210e-02,
       1.690e-02, 1.987e-01, 8.719e-01, 1.930e-02, 1.850e-02, 1.225e-01,
       8.760e-02, 3.200e-02, 2.020e-02, 3.489e-01, 1.501e-01, 7.986e-01,
       1.410e-02, 2.996e-01, 1.480e-02, 6.244e-01, 5.270e-02, 1.532e-01,
       5.260e-02, 2.859e-01, 1.342e-01, 1.206e-01, 1.510e-02, 4.377e-01,
       5.970e-02, 2.670e-02, 1.327e-01, 1.080e-02, 3.599e-01, 1.230e-02,
       1.040e-02, 1.265e-01, 2.621e-01, 4.863e-01, 4.177e-01, 2.110e-02,
       1.410e-02, 8.334e-01, 7.800e-03, 5.742e-01, 1.770e-02, 3.887e-01,
       1.098e-01, 6.800e-03, 3.920e-02, 2.990e-01, 1.410e-02, 1.627e-01,
       3.186e-01, 2.020e-02, 7.200e-03, 7.100e-02, 1.350e-02, 6.800e-02,
       1.690e-02, 9.000e-03, 1.690e-02, 2.110e-02, 1.540e-02, 2.712e-01,
       2.033e-01, 2.418e-01, 9.350e-02, 1.283e-01, 5.400e-03, 5.428e-01,
       1.350e-02, 7.748e-01, 2.586e-01, 6.683e-01, 4.372e-01, 4.500e-03,
       1.982e-01, 1.350e-02, 4.457e-01, 9.400e-03, 

In [None]:
test_df = X_test.copy()

In [None]:
# Add the survival probability as a percentage ease the reading

test_df['proba_survival'] = proba * 100
test_df

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival
346,3,3.0,1,1,15.9000,0,2.80
458,3,34.0,0,0,8.0500,0,1.13
878,3,22.0,0,0,10.5167,1,26.29
618,1,42.0,1,0,52.5542,0,8.10
178,3,36.0,0,0,0.0000,0,1.02
...,...,...,...,...,...,...,...
345,3,34.0,1,0,16.1000,1,11.89
693,3,44.0,0,0,8.0500,0,0.72
423,2,28.0,1,0,26.0000,1,42.00
19,3,22.0,0,0,7.2250,1,26.21


In [None]:
test_df['Name'] = df_titanic['Name'].iloc[test_df.index]
test_df

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name
346,3,3.0,1,1,15.9000,0,2.80,Master. William Loch Coutts
458,3,34.0,0,0,8.0500,0,1.13,Mr. William Morley
878,3,22.0,0,0,10.5167,1,26.29,Miss. Gerda Ulrika Dahlberg
618,1,42.0,1,0,52.5542,0,8.10,Mr. Edwin Nelson Jr Kimball
178,3,36.0,0,0,0.0000,0,1.02,Mr. Lionel Leonard
...,...,...,...,...,...,...,...,...
345,3,34.0,1,0,16.1000,1,11.89,Mrs. Thomas Henry (Mary E Finck) Davison
693,3,44.0,0,0,8.0500,0,0.72,Mr. James Kelly
423,2,28.0,1,0,26.0000,1,42.00,Mrs. Charles V (Ada Maria Winfield) Clarke
19,3,22.0,0,0,7.2250,1,26.21,Mrs. Fatima Masselmani


In [None]:
test_df.sort_values('proba_survival', ascending=True)


Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name
322,3,20.0,8,2,69.5500,0,0.06,Mr. George John Jr Sage
535,3,69.0,0,0,14.5000,0,0.23,Mr. Samuel Beard Risien
508,3,66.0,0,0,8.0500,0,0.26,Mr. James Webber
13,3,39.0,1,5,31.2750,0,0.43,Mr. Anders Johan Andersson
264,3,16.0,4,1,39.6875,0,0.45,Mr. Ernesti Arvid Panula
...,...,...,...,...,...,...,...,...
755,1,33.0,0,0,86.5000,1,79.31,the Countess. of (Lucy Noel Martha Dyer-Edward...
217,1,32.0,0,0,76.2917,1,79.86,Miss. Albina Bazzani
534,1,30.0,0,0,106.4250,1,81.82,Miss. Bertha LeRoy
777,1,17.0,1,0,57.0000,1,83.34,Mrs. Albert Adrian (Vera Gillespie) Dick


# Bonus - Model improvement, under constraint

Your time travel boss tells you that there's a budget cut. You now can only save 120 people max. Not one more.

If your model predicts as "dead" someone who would have survived in reality. You then "save" that person, who would have survived even without your time-traveling help. And you take the place of someone who could have been saved. That's not optimal.

Select the 120 people with the highest probability of dying. Of these, how many actually survived?

In [None]:
test_df['Survived'] = df_titanic['Survived'].iloc[test_df.index]
test_df

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name,Survived
346,3,3.0,1,1,15.9000,0,2.80,Master. William Loch Coutts,Survived
458,3,34.0,0,0,8.0500,0,1.13,Mr. William Morley,Dead
878,3,22.0,0,0,10.5167,1,26.29,Miss. Gerda Ulrika Dahlberg,Dead
618,1,42.0,1,0,52.5542,0,8.10,Mr. Edwin Nelson Jr Kimball,Survived
178,3,36.0,0,0,0.0000,0,1.02,Mr. Lionel Leonard,Dead
...,...,...,...,...,...,...,...,...,...
345,3,34.0,1,0,16.1000,1,11.89,Mrs. Thomas Henry (Mary E Finck) Davison,Survived
693,3,44.0,0,0,8.0500,0,0.72,Mr. James Kelly,Dead
423,2,28.0,1,0,26.0000,1,42.00,Mrs. Charles V (Ada Maria Winfield) Clarke,Survived
19,3,22.0,0,0,7.2250,1,26.21,Mrs. Fatima Masselmani,Survived


In [None]:
# It's up to you:
sorted_asc_df_120 = test_df.sort_values('proba_survival').iloc[:120]
sorted_asc_df_120

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name,Survived
322,3,20.0,8,2,69.5500,0,0.06,Mr. George John Jr Sage,Dead
535,3,69.0,0,0,14.5000,0,0.23,Mr. Samuel Beard Risien,Dead
508,3,66.0,0,0,8.0500,0,0.26,Mr. James Webber,Dead
13,3,39.0,1,5,31.2750,0,0.43,Mr. Anders Johan Andersson,Dead
264,3,16.0,4,1,39.6875,0,0.45,Mr. Ernesti Arvid Panula,Dead
...,...,...,...,...,...,...,...,...,...
61,1,45.0,1,0,83.4750,0,7.38,Mr. Henry Birkhardt Harris,Dead
446,1,52.0,0,0,30.5000,0,7.72,Major. Arthur Godfrey Peuchen,Survived
618,1,42.0,1,0,52.5542,0,8.10,Mr. Edwin Nelson Jr Kimball,Survived
671,2,19.0,0,0,0.0000,0,8.29,Mr. Ennis Hastings Watson,Dead


In [None]:
sorted_asc_df_120['Survived'].value_counts()

Dead        100
Survived     20
Name: Survived, dtype: int64

20 passengers among the 120 with the highest propability of dying in the test subset actually survived. Let's do some filtering to make sure we actually save 120 people.

In [None]:
test_df_only_dead = test_df[test_df['Survived'] == 'Dead']
test_df_only_dead

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name,Survived
458,3,34.0,0,0,8.0500,0,1.13,Mr. William Morley,Dead
878,3,22.0,0,0,10.5167,1,26.29,Miss. Gerda Ulrika Dahlberg,Dead
178,3,36.0,0,0,0.0000,0,1.02,Mr. Lionel Leonard,Dead
377,3,19.0,0,0,7.7750,0,2.21,Mr. Karl Gideon Gustafsson,Dead
781,3,25.0,0,0,7.2500,0,1.69,Mr. Abraham (David Lishin) Harmer,Dead
...,...,...,...,...,...,...,...,...,...
658,3,40.0,0,0,7.2250,0,0.86,Mr. Mohamed Badt,Dead
166,3,45.0,1,4,27.9000,1,5.90,Mrs. William (Anna Bernhardina Karlsson) Skoog,Dead
771,3,18.0,0,0,7.7500,0,2.31,Mr. Pehr Fabian Oliver Malkolm Myhrman,Dead
693,3,44.0,0,0,8.0500,0,0.72,Mr. James Kelly,Dead


In [None]:
sorted_dead_df = test_df_only_dead.sort_values('proba_survival')
sorted_dead_df = sorted_dead_df.iloc[:120]
sorted_dead_df

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name,Survived
322,3,20.0,8,2,69.5500,0,0.06,Mr. George John Jr Sage,Dead
535,3,69.0,0,0,14.5000,0,0.23,Mr. Samuel Beard Risien,Dead
508,3,66.0,0,0,8.0500,0,0.26,Mr. James Webber,Dead
13,3,39.0,1,5,31.2750,0,0.43,Mr. Anders Johan Andersson,Dead
264,3,16.0,4,1,39.6875,0,0.45,Mr. Ernesti Arvid Panula,Dead
...,...,...,...,...,...,...,...,...,...
531,3,30.0,0,0,8.6625,1,19.80,Miss. Marija Cacic,Dead
82,1,28.0,0,0,47.1000,0,20.33,Mr. Francisco M Carrau,Dead
677,3,28.0,0,0,8.1375,1,21.28,Miss. Katie Peters,Dead
375,1,27.0,0,2,211.5000,0,21.98,Mr. Harry Elkins Widener,Dead


In [None]:
sorted_dead_df['Survived'].value_counts()

Dead    120
Name: Survived, dtype: int64

Now that `sorted_dead_df` dateframe contains exactly 120 passengers with the highest probability of dying (from the test subset) and who actually died. We can therefore save exactly 120 people now ðŸ˜€

# Bonus - More predictions

Does the Reverend "Rev. Juozas Montvila" have a better chance of survival than "Mrs. William (Margaret Norton) Rice"?
- Filter the initial DataFrame to get only the 2 rows with the 2 persons above, and only columns present in your variables (X)
- Make a prediction with probability for this 2 people
- Which one has a better chance to survive?


In [None]:
df_final_predict = df_titanic.copy()
df_final_predict = df_final_predict[(df_final_predict['Name'] == 'Rev. Juozas Montvila') | (df_final_predict['Name'] == 'Mrs. William (Margaret Norton) Rice')]
df_final_predict = df_final_predict[['Pclass', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare', 'Sex_nb']]
df_final_predict

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb
881,3,39.0,0,5,29.125,1
882,2,27.0,0,0,13.0,0


In [None]:
proba_2 = np.around(model.predict_proba(df_final_predict)[:, 1], 4)
proba_2

array([0.106 , 0.0598])

In [None]:
df_final_predict.index

Int64Index([881, 882], dtype='int64')

In [None]:
df_final = df_final_predict.copy()
df_final['proba_survival'] = proba_2 * 100
df_final

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival
881,3,39.0,0,5,29.125,1,10.6
882,2,27.0,0,0,13.0,0,5.98


In [None]:
df_final['Name'] = df_titanic['Name'].iloc[df_final.index]
df_final

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name
881,3,39.0,0,5,29.125,1,10.6,Mrs. William (Margaret Norton) Rice
882,2,27.0,0,0,13.0,0,5.98,Rev. Juozas Montvila


Both had low chances of survival, but Mrs. William (Margaret Norton) Rice still had more chances, with 10,6% chance, compared to nearly 6% for the Reverend Juozas Montvila.
Finally, let's see if any of them survived in 1912.

In [None]:
df_final['Survived'] = df_titanic['Survived'].iloc[df_final.index]
df_final

Unnamed: 0,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb,proba_survival,Name,Survived
881,3,39.0,0,5,29.125,1,10.6,Mrs. William (Margaret Norton) Rice,Dead
882,2,27.0,0,0,13.0,0,5.98,Rev. Juozas Montvila,Dead


Nope, both dead. RIP ðŸ’€