## Task: Train a logistic regression classifier to predict survival of passengers in titanic dataset

You are provided with code to download and load titanic dataset in the form of a csv

In the dataset, each row represents information about the passengers of titanic, Like their name, gender, class etc(See the dataframe below for more info).

The target column is 'Survived' which tells us whether this particular passenger sirvived or not

Use any of all the other columns as the input features (You can choose to drop the columns you see are not worth keeping).

Your task is to train a logistic regression model which takes the input featues (make sure to not accidentaly feed the 'Survived' column to the model as input) and predicts the whether a passenger with these features would survive or not.

Make sure to put emphasis on code quality and to include a way to judge how good your model is performing on **un-seen data (untrained data)**.

As a bonus, see if you can figure out which feature is most likely to affect the survivability of a passenger.

In [2]:
from IPython.display import clear_output

In [3]:
%pip install numpy
%pip install pandas
%pip install matplotlib
%pip install gdown

clear_output()

In [4]:
!gdown 18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK  # Download the csv file.

Downloading...
From: https://drive.google.com/uc?id=18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK
To: /opt/notebooks/01_Week/Assignments/titanic.csv
100%|███████████████████████████████████████| 60.3k/60.3k [00:00<00:00, 468kB/s]


In [124]:
import pandas as pd
import matplotlib.pyplot as plt

In [125]:
titanic_data = pd.read_csv('titanic.csv')

In [126]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [136]:
data = titanic_data

In [137]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Solving it on my own

In [138]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

In [139]:
# 1 - Understand the data
unique_values = set(data["Embarked"])
print("Unique values of embarked column: ", unique_values)

print("\nembarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton")
print("nan for not recorded")

print("\nSibSp – Number of siblings and spouses on board")
print("Parch – Number of parents and children on board")

print("\nDimensions of the features: ", data.shape)

Unique values of embarked column:  {nan, 'Q', 'S', 'C'}

embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton
nan for not recorded

SibSp – Number of siblings and spouses on board
Parch – Number of parents and children on board

Dimensions of the features:  (891, 12)


In [140]:
# 2 – Drop columns by intuition
data = data.drop(columns=["Name", # just the name, no strings attatched to it
                                  "Ticket", # number / name of ticket does not change outcome
                                  "Embarked", # port of boarding does not count ("most likely rooms were booked before")
                                  "Fare", # because it is just a duplicate of passenger class (but would be more accurate, since some first class rooms were more expensive then others for example)
                                  "PassengerId", # because this is not related to the survivablility (also dataframe index is equal)
                                 ])

In [141]:
# 3 – Check for data completeness
nan_count = data["Pclass"].isnull().sum()
print("Number of NaN values in pclass:", nan_count)

nan_count = data["Age"].isnull().sum()
print("Number of NaN values in age:", nan_count)

nan_count = data["Sex"].isnull().sum()
print("Number of NaN values in sex:", nan_count)

nan_count = data["SibSp"].isnull().sum()
print("Number of NaN values in SipSp:", nan_count)

nan_count = data["Parch"].isnull().sum()
print("Number of NaN values in parch:", nan_count)

nan_count = data["Cabin"].isnull().sum()
print("Number of NaN values in cabin:", nan_count)

Number of NaN values in pclass: 0
Number of NaN values in age: 177
Number of NaN values in sex: 0
Number of NaN values in SipSp: 0
Number of NaN values in parch: 0
Number of NaN values in cabin: 687


In [142]:
# 4 – Drop columns because of too many missing values

data = data.drop(columns=["Cabin"])
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
0,0,3,male,22.0,1,0
1,1,1,female,38.0,1,0
2,1,3,female,26.0,0,0
3,1,1,female,35.0,1,0
4,0,3,male,35.0,0,0


In [143]:
# 5 - Remove entries with missing ages

print("Shape before:", data.shape)

# Remove entries where "age" is missing
data = data.dropna(subset=["Age"])

print("Shape after: ", data.shape)

Shape before: (891, 6)
Shape after:  (714, 6)


In [144]:
# 6 – Convert categorical columns to numeric (One Hot Encoding)
data = pd.get_dummies(data, columns=['Sex'], drop_first=True)

data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Sex_male
0,0,3,22.0,1,0,1
1,1,1,38.0,1,0,0
2,1,3,26.0,0,0,0
3,1,1,35.0,1,0,0
4,0,3,35.0,0,0,1


In [145]:
# 7 – Split data into feature  matrix (X) and target (y)
X = data.drop(columns=['Survived'])
y = data['Survived']

In [147]:
# 8 - Create subset for training and seperate test data later
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7331)

print("Train dataset shape: ", X_train.shape)
print("Test dataset shape", X_test.shape)

Train dataset shape:  (571, 5)
Test dataset shape (143, 5)


In [148]:
# 9 – Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [152]:
# 10 – Use the trained model on seperate test data

from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

# Print evaluation metrics
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Precision Score:", precision_score(y_test, y_pred))
print("Recall Score:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy Score: 0.8251748251748252
Precision Score: 0.8837209302325582
Recall Score: 0.6551724137931034
F1 Score: 0.7524752475247525
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.94      0.86        85
           1       0.88      0.66      0.75        58

    accuracy                           0.83       143
   macro avg       0.84      0.80      0.81       143
weighted avg       0.83      0.83      0.82       143

