<a href="https://colab.research.google.com/github/BerengerQueune/wild_notebooks/blob/main/2_2_ML_Classifications_Logistic_Regression_Titanic_B%C3%A9renger.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Execute the code below
You will get a passenger list of the titanic.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
link = "https://raw.githubusercontent.com/murpi/wilddata/master/quests/titanic.csv"
df_titanic = pd.read_csv(link)
df_titanic['Survived'] = df_titanic['Survived'].apply(lambda x: "Survived" if x == 1 else "Dead")
df_titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,Dead,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,Survived,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,Survived,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,Survived,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,Dead,3,Mr. William Henry Allen,male,35.0,0,0,8.05


# Data preparation

What are the "type" of each column? Are there non-numeric columns? 

In [None]:
# What are the "type" of each column? 
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    object 
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(3), object(3)
memory usage: 55.6+ KB


In [None]:
# As we can see the columns are made of 3 types: float64, int64 (numeric values) and object (non-numeric values)

Then, make a first exploration (pairplots, correlation heatmaps, etc...) of the dataset

In [None]:
# First exploration:
fig = px.scatter_matrix(df_titanic, width=1500, height=1500)
fig.show()

In [None]:
corr = df_titanic.corr()

fig = go.Figure()

fig.add_trace(go.Heatmap(
    z = corr,
    x = corr.columns.values,
    y = corr.columns.values,
    colorscale = px.colors.diverging.RdBu,
    zmid=0
))

fig.update_layout(width=1000, height=900)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Fare versus Pclass", "Fare versus Age"))

fig.add_trace(go.Box(x = df_titanic["Pclass"], y=df_titanic["Fare"]),
              row=1, col=1)

fig.add_trace(go.Box(x = df_titanic["Age"], y=df_titanic["Fare"]),
              row=1, col=2)

fig.update_layout(autosize=False, template='plotly_dark', width = 1500, height = 700, showlegend=False)

fig.update_xaxes(title_text="Pclass", row=1, col=1)
fig.update_yaxes(title_text="Fare", row=1, col=1)

fig.update_xaxes(title_text="Age", row=1, col=2)
fig.update_yaxes(title_text="Fare", row=1, col=2)

fig.show()

You are looking for Jack. How many people named Jack on board?

In [None]:
# How many people named Jack on board?
df_titanic[df_titanic['Name'].str.contains('Jack')]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
762,Dead,1,Dr. Arthur Jackson Brewe,male,46.0,0,0,39.6


In [None]:
#No one is named Jack

# Logistic regression

Today, in this quest, you have an extraordinary power: you can travel in time to try to save some passengers. 
You obviously wanted to save Jack. But you didn't find his name on the list. He probably travels under a false name...
Too bad, thanks to this trip, you will still try to save as many people as possible. To do this, you have to identify the people who are most probably going to die. 

- Select features (X) with only numeric values, and without "Survived" column
- Select "Survived" column as target (y)
- Split your data with **random_state = 36**
- Train a logistic regression
- Print the accuracy score on the train set and on the test set. Is there overfitting?
- Print the Confusion Matrix on the test set
- How many iterations were needed to train this model?

In [None]:
# It's up to you:

df_titanic['Sex'] = df_titanic['Sex'].factorize()[0]

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = df_titanic[["Pclass", 'Sex', 'Age', "Siblings/Spouses Aboard",'Parents/Children Aboard', 'Fare']]
y = df_titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 36, train_size = 0.75)

model = LogisticRegression().fit(X_train,y_train)

print("accuracy score on train set:",model.score(X_train, y_train))
print("accuracy score on test set:",model.score(X_test, y_test))

accuracy score on train set: 0.8165413533834587
accuracy score on test set: 0.7882882882882883


In [None]:
# There is no overfitting

In [None]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(data = confusion_matrix(y_true = y_test, y_pred = model.predict(X_test)),
             index = model.classes_ + " actual",
             columns = model.classes_ + " predicted")

Unnamed: 0,Dead predicted,Survived predicted
Dead actual,110,19
Survived actual,28,65


In [None]:
print (f" The number of iterations was {model.n_iter_}.")

 The number of iterations was [54].


# Model improvement

You can save all the people that the model will predict as dead. Change the weight of the classes to save all the people at risk.
- Change the weight of the classes
- Fit the model on train set
- Print the accuracy score on the train set and on the test set
- Print the Confusion Matrix on the test set, you must have no deaths that have been predicted as "Survived".

In [None]:
df_titanic.tail()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
882,Dead,2,Rev. Juozas Montvila,0,27.0,0,0,13.0
883,Survived,1,Miss. Margaret Edith Graham,1,19.0,0,0,30.0
884,Dead,3,Miss. Catherine Helen Johnston,1,7.0,1,2,23.45
885,Survived,1,Mr. Karl Howell Behr,0,26.0,0,0,30.0
886,Dead,3,Mr. Patrick Dooley,0,32.0,0,0,7.75


In [None]:
# It's up to you to save everybody:
model = LogisticRegression()
model.fit(X_train,y_train)

print("accuracy score on train set:",model.score(X_train, y_train))
print("accuracy score on test set:",model.score(X_test, y_test))

pd.DataFrame(data = confusion_matrix(y_true = y_test, y_pred = model.predict(X_test)),
             index = model.classes_ + " actual",
             columns = model.classes_ + " predicted")

accuracy score on train set: 0.8165413533834587
accuracy score on test set: 0.7882882882882883


Unnamed: 0,Dead predicted,Survived predicted
Dead actual,110,19
Survived actual,28,65


# People most at risk

You are looking for people most at risk.
- Compute the prediction probabilities **on your test set**
- Which column is about "survived" probability?
- Among the previous prediction probability array, select only the column corresponding to the "Survived" probability
- Display the passengers by sorting  most likely to survive first (`sorted_values()` method?)

In [None]:
model.predict(X_test)

array(['Dead', 'Dead', 'Survived', 'Dead', 'Dead', 'Dead', 'Dead',
       'Survived', 'Survived', 'Dead', 'Dead', 'Dead', 'Dead', 'Dead',
       'Dead', 'Survived', 'Survived', 'Survived', 'Dead', 'Survived',
       'Dead', 'Survived', 'Dead', 'Dead', 'Dead', 'Survived', 'Dead',
       'Dead', 'Dead', 'Survived', 'Dead', 'Dead', 'Dead', 'Dead',
       'Survived', 'Dead', 'Dead', 'Dead', 'Survived', 'Survived',
       'Survived', 'Dead', 'Dead', 'Survived', 'Dead', 'Survived', 'Dead',
       'Survived', 'Dead', 'Dead', 'Dead', 'Survived', 'Dead', 'Survived',
       'Survived', 'Dead', 'Dead', 'Dead', 'Dead', 'Dead', 'Dead', 'Dead',
       'Dead', 'Dead', 'Dead', 'Survived', 'Survived', 'Survived', 'Dead',
       'Dead', 'Dead', 'Survived', 'Dead', 'Survived', 'Survived',
       'Survived', 'Survived', 'Dead', 'Survived', 'Dead', 'Survived',
       'Dead', 'Dead', 'Dead', 'Dead', 'Dead', 'Dead', 'Survived', 'Dead',
       'Dead', 'Survived', 'Dead', 'Dead', 'Survived', 'Dead', 'Survived'

In [None]:
# It's up to you:
model.fit(X_test,y_test)

prediction = model.predict_proba(X_test.iloc[:,:])
prediction[1]

array([0.85229328, 0.14770672])

In [None]:
model.classes_

array(['Dead', 'Survived'], dtype=object)

In [None]:
# Survived is the second column

In [None]:
prediction2 = prediction[:,1:]

In [None]:
X_test2 = X_test.copy()
X_test2["prediction"] = prediction2
X_test2

Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,prediction
346,3,0,3.0,1,1,15.9000,0.145435
458,3,0,34.0,0,0,8.0500,0.147707
878,3,1,22.0,0,0,10.5167,0.709508
618,1,0,42.0,1,0,52.5542,0.382961
178,3,0,36.0,0,0,0.0000,0.138885
...,...,...,...,...,...,...,...
345,3,1,34.0,1,0,16.1000,0.628743
693,3,0,44.0,0,0,8.0500,0.132665
423,2,1,28.0,1,0,26.0000,0.792821
19,3,1,22.0,0,0,7.2250,0.705539


In [None]:
isindex = X_test2.index
isindex

Int64Index([346, 458, 878, 618, 178, 377, 781,  78, 883, 318,
            ...
            371, 579, 658, 166, 771, 345, 693, 423,  19, 564],
           dtype='int64', length=222)

In [None]:
X_test2["Name"] = df_titanic["Name"].iloc[isindex]
X_test3 = X_test2.sort_values(by=['prediction'], ascending=False)
X_test3

Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,prediction,Name
297,1,1,50.0,0,1,247.5208,0.956357,Mrs. James (Helene DeLaudeniere Chaput) Baxter
309,1,1,18.0,2,2,262.3750,0.946516,Miss. Emily Borie Ryerson
534,1,1,30.0,0,0,106.4250,0.938024,Miss. Bertha LeRoy
304,1,1,42.0,0,0,110.8833,0.930419,Miss. Margaret Fleming
755,1,1,33.0,0,0,86.5000,0.928473,the Countess. of (Lucy Noel Martha Dyer-Edward...
...,...,...,...,...,...,...,...,...
264,3,0,16.0,4,1,39.6875,0.073027,Mr. Ernesti Arvid Panula
259,3,0,3.0,4,2,31.3875,0.067273,Master. Edvin Rojj Felix Asplund
384,3,0,1.0,5,2,46.9000,0.059367,Master. Sidney Leonard Goodwin
13,3,0,39.0,1,5,31.2750,0.050214,Mr. Anders Johan Andersson


# Bonus - Model improvement, under constraint

Your time travel boss tells you that there's a budget cut. You now can only save 120 people max. Not one more.

If your model predicts as "dead" someone who would have survived in reality. You then "save" that person, who would have survived even without your time-traveling help. And you take the place of someone who could have been saved. That's not optimal.

Select the 120 people with the highest probability of dying. Of these, how many actually survived?

In [None]:
X_test_121 = X_test3.sort_values(by=['prediction'], ascending=True)
X_test_121

Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,prediction,Name
322,3,0,20.0,8,2,69.5500,0.026221,Mr. George John Jr Sage
13,3,0,39.0,1,5,31.2750,0.050214,Mr. Anders Johan Andersson
384,3,0,1.0,5,2,46.9000,0.059367,Master. Sidney Leonard Goodwin
259,3,0,3.0,4,2,31.3875,0.067273,Master. Edvin Rojj Felix Asplund
264,3,0,16.0,4,1,39.6875,0.073027,Mr. Ernesti Arvid Panula
...,...,...,...,...,...,...,...,...
755,1,1,33.0,0,0,86.5000,0.928473,the Countess. of (Lucy Noel Martha Dyer-Edward...
304,1,1,42.0,0,0,110.8833,0.930419,Miss. Margaret Fleming
534,1,1,30.0,0,0,106.4250,0.938024,Miss. Bertha LeRoy
309,1,1,18.0,2,2,262.3750,0.946516,Miss. Emily Borie Ryerson


In [None]:
# It's up to you:
X_test_120_index = X_test_121[:120:].index
X_test_122 = X_test_121[:120:]
X_test_122["Survived"] = df_titanic["Survived"].iloc[X_test_120_index]
X_test_122



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,prediction,Name,Survived
322,3,0,20.0,8,2,69.5500,0.026221,Mr. George John Jr Sage,Dead
13,3,0,39.0,1,5,31.2750,0.050214,Mr. Anders Johan Andersson,Dead
384,3,0,1.0,5,2,46.9000,0.059367,Master. Sidney Leonard Goodwin,Dead
259,3,0,3.0,4,2,31.3875,0.067273,Master. Edvin Rojj Felix Asplund,Survived
264,3,0,16.0,4,1,39.6875,0.073027,Mr. Ernesti Arvid Panula,Dead
...,...,...,...,...,...,...,...,...,...
489,1,0,55.0,0,0,30.5000,0.373076,Mr. Harry Markland Molson,Dead
261,1,0,40.0,0,0,0.0000,0.375332,Mr. William Harrison,Dead
383,2,0,18.0,0,0,73.5000,0.380142,Mr. Charles Henry Davies,Dead
818,1,0,38.0,0,0,0.0000,0.381206,Jonkheer. John George Reuchlin,Dead


In [None]:
X_test_122["Survived"].value_counts()

Dead        102
Survived     18
Name: Survived, dtype: int64

# Bonus - More predictions

Does the Reverend "Rev. Juozas Montvila" have a better chance of survival than "Mrs. William (Margaret Norton) Rice"?
- Filter the initial DataFrame to get only the 2 rows with the 2 persons above, and only columns present in your variables (X)
- Make a prediction with probability for this 2 people
- Which one has a better chance to survive?


In [None]:
new_df = df_titanic.loc[df_titanic["Name"] == "Rev. Juozas Montvila"]
new_df2 = df_titanic.loc[df_titanic["Name"] == "Mrs. William (Margaret Norton) Rice"]
frames = [new_df, new_df2]
result = pd.concat(frames)
result = result[["Pclass", 'Sex', 'Age', "Siblings/Spouses Aboard",'Parents/Children Aboard', 'Fare']]
result

Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
882,2,0,27.0,0,0,13.0
881,3,1,39.0,0,5,29.125


In [None]:
prediction3 = model.predict_proba(result)
prediction4 = prediction3[:, 1]
prediction4

array([0.27810826, 0.44467988])

In [None]:
result['prediction'] = prediction4
result

Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,prediction
882,2,0,27.0,0,0,13.0,0.278108
881,3,1,39.0,0,5,29.125,0.44468


In [None]:
df_titanic.iloc[-6:-4:]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
881,Dead,3,Mrs. William (Margaret Norton) Rice,1,39.0,0,5,29.125
882,Dead,2,Rev. Juozas Montvila,0,27.0,0,0,13.0


In [None]:
#based on index, Mrs. William (Margaret Norton) Rice had a better chance of survival but they both died