## Task: Train a logistic regression classifier to predict survival of passengers in titanic dataset

You are provided with code to download and load titanic dataset in the form of a csv

In the dataset, each row represents information about the passengers of titanic, Like their name, gender, class etc(See the dataframe below for more info).

The target column is 'Survived' which tells us whether this particular passenger sirvived or not

Use any of all the other columns as the input features (You can choose to drop the columns you see are not worth keeping).

Your task is to train a logistic regression model which takes the input featues (make sure to not accidentaly feed the 'Survived' column to the model as input) and predicts the whether a passenger with these features would survive or not.

Make sure to put emphasis on code quality and to include a way to judge how good your model is performing on **un-seen data (untrained data)**.

As a bonus, see if you can figure out which feature is most likely to affect the survivability of a passenger.

In [None]:
from IPython.display import clear_output

In [None]:
%pip install numpy
%pip install pandas
%pip install matplotlib
%pip install gdown

clear_output()

In [None]:
!gdown 18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK  # Download the csv file.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
titanic_data = pd.read_csv('titanic.csv')

In [None]:
titanic_data.head()

In [None]:
data = titanic_data

In [None]:
data.head()

# Solving it with SKLearn

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import plotly.graph_objects as go

In [None]:
# 1 - Understand the data

unique_values = set(data["Embarked"])
print("Unique values of embarked column: ", unique_values)

print("\nembarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton")
print("nan for not recorded")

print("\nSibSp – Number of siblings and spouses on board")
print("Parch – Number of parents and children on board")

print("\nDimensions of the features: ", data.shape)

In [None]:
# 2 – Drop columns by intuition

data = data.drop(columns=["Name", # just the name, no strings attatched to it
                          "Ticket", # number / name of ticket does not change outcome
                          "Embarked", # port of boarding does not count ("most likely rooms were booked before")
                          "Fare", # because it is just a duplicate of passenger class (but would be more accurate, since some first class rooms were more expensive then others for example)
                          "PassengerId", # because this is not related to the survivablility (also dataframe index is equal)
                          ])

In [None]:
# 3 – Check for data completeness

nan_count = data["Pclass"].isnull().sum()
print("Number of NaN values in pclass:", nan_count)

nan_count = data["Age"].isnull().sum()
print("Number of NaN values in age:", nan_count)

nan_count = data["Sex"].isnull().sum()
print("Number of NaN values in sex:", nan_count)

nan_count = data["SibSp"].isnull().sum()
print("Number of NaN values in SipSp:", nan_count)

nan_count = data["Parch"].isnull().sum()
print("Number of NaN values in parch:", nan_count)

nan_count = data["Cabin"].isnull().sum()
print("Number of NaN values in cabin:", nan_count)

In [None]:
# 4 – Drop columns because of too many missing values

data = data.drop(columns=["Cabin"])
data.head()

In [None]:
# 5 - Remove entries with missing ages

print("Shape before:", data.shape)

# Remove entries where "age" is missing
data = data.dropna(subset=["Age"])

print("Shape after: ", data.shape)

In [None]:
# 6 – Split data into feature  matrix (X) and target (y)

X = data.drop(columns=['Survived'])
y = data['Survived']

In [None]:
# 7 – Convert categorical columns to numeric (One Hot Encoding)

X = pd.get_dummies(X, columns=['Sex'], drop_first=True)
X = pd.get_dummies(X, columns=['Pclass'], drop_first=True)

X.head()

In [None]:
# 8 – Normalize features
X = X.apply(lambda x: (x-x.min())/(x.max()-x.min()))
 

# Sanity check if date was added
print(X)

In [None]:
# 9 – Add intercept term 

ones = np.ones((X.shape[0], 1))
X["Intercept"] = ones

In [None]:
# 10 - Create subset for training and seperate test data later

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7331)

print("Train dataset shape: ", X_train.shape)
print("Test dataset shape", X_test.shape)

In [None]:
# 11 – Create and train the logistic regression model

model = LogisticRegression(verbose=1, max_iter=1000)
model.fit(X_train, y_train)

In [None]:
# 12 – Use the trained model on seperate test data

y_pred = model.predict(X_test)

In [None]:
# 13 – Print evaluation metrics

labels = ["Died", "Survived"]
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=labels))

In [None]:
# 14 – Visualize model coefs

feature_coef = pd.Series(model.coef_[0], index=X.columns).sort_values(ascending=True)

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))

# Plot horizontal bar plot
plt.barh(feature_coef.index, feature_coef.values, color='skyblue')

# Add labels and title
plt.title('Visualization')
plt.xlabel('Value')
plt.ylabel('Feature')

# Add gridlines for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Show plot
plt.tight_layout()
plt.show()

# Interpretation: Most important survivability feature
###  The biggest influence on if you survive or not is the gender. This is represented with the worsed odds when looking at the above coefs visualization for sex being male

In [None]:
# 15 - Correctly / Acutal Value Visualization

# Create the Plotly scatter plot
fig = go.Figure()

# Scatter plot for actual values
fig.add_trace(go.Scatter(
    x= np.arange(len(y_test)),
    y=y_test,
    mode='markers',
    name='Actual Values',
    marker=dict(color='blue', opacity=0.5, size=12),
    hovertemplate='Index: %{x}<br>Actual: %{y}<extra></extra>',
    hoverinfo='text'
))

# Scatter plot for predicted values
fig.add_trace(go.Scatter(
    x=np.arange(len(y_test)),
    y=y_pred,
    mode='markers',
    name='Predicted Values',
    marker=dict(color='red', opacity=0.5, size=8),
    hovertemplate='Index: %{x}<br>Predicted: %{y}<extra></extra>',
    hoverinfo='text'
))

# Add labels and title
fig.update_layout(
    title='Actual vs Predicted Values',
    xaxis_title='Index',
    yaxis_title='Value',
    legend_title='Legend',

)

# Show the plot
fig.show()

In [None]:
# 16 - Confusion Matrix Visualization

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cax = ax.matshow(cm, cmap='Blues')

# Add color bar
plt.colorbar(cax)

# Add labels, title and axes ticks
ax.set_xlabel('Predicted Label')
ax.set_ylabel('True Label')
ax.set_title('Confusion Matrix')

# Add labels to each cell in the matrix
for (i, j), val in np.ndenumerate(cm):
    ax.text(j, i, f'{val}', ha='center', va='center') 

# Set the tick labels
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(['Not Survived', 'Survived'])
ax.set_yticklabels(['Not Survived', 'Survived'])

plt.show()

In [None]:
# 17 – Sanity check to see if correctly predicted and test data is the same as in the confusion matrix

matching_survival_count = np.sum((y_test == 1) & (y_pred == 1))
print(f"Number of correct survival predictions: {matching_survival_count}")

# Doing it on my own

In [None]:
# 11 – Define needed funtcions for own implementation

def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def get_loss(yhat, y):
    return np.mean(- y * np.log(yhat) - (1 - y) * np.log(1 - yhat))
    
def log_reg_gradient_descent(X,y, learning_rate, n_iterations):
    # Randomize theta as a starting value
    theta = np.random.randn(X.shape[1],1)
    # list of losses to keep track of it
    ls = []
    
    # Loop which handles the iterations
    for i in range(num_epochs):
        #forward pass
        z = X @ theta
        yhat = sigmoid(z)
        l = get_loss(yhat, y)

        #backward pass
        dtheta = X.T @ (yhat - y)


        #optimization
        theta = theta - lr * dtheta 
        
        ls.append(l[0]) # only return first value, others are NaN - do not understand why they even exist
    return (theta[0], ls)

In [None]:
# 12 – Create and train own model

num_epochs = 100
lr = 0.001

t, loss_history = log_reg_gradient_descent(X_train, y_train, lr, num_epochs)


plt.plot(loss_history)

In [None]:
# 13 – Predict with new values

def predict(X_new, theta):
    z = X_new @ theta
    return sigmoid(z)

y_pred = predict(X_test, t)

In [None]:
# 14 – Visualize model coefs

feature_coef= t

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))

# Plot horizontal bar plot
plt.barh(t.index, t.values, color='skyblue')

# Add labels and title
plt.title('Visualization')
plt.xlabel('Value')
plt.ylabel('Feature')

# Add gridlines for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Show plot
plt.tight_layout()
plt.show()

In [None]:
# 15 - Correctly / Acutal Value Visualization

# Create the Plotly scatter plot
fig = go.Figure()

# Scatter plot for actual values
fig.add_trace(go.Scatter(
    x= np.arange(len(y_test)),
    y=y_test,
    mode='markers',
    name='Actual Values',
    marker=dict(color='blue', opacity=0.5, size=12),
    hovertemplate='Index: %{x}<br>Actual: %{y}<extra></extra>',
    hoverinfo='text'
))

# Scatter plot for predicted values
fig.add_trace(go.Scatter(
    x=np.arange(len(y_pred)),
    y=y_pred,
    mode='markers',
    name='Predicted Values',
    marker=dict(color='red', opacity=0.5, size=8),
    hovertemplate='Index: %{x}<br>Predicted: %{y}<extra></extra>',
    hoverinfo='text'
))

# Add labels and title
fig.update_layout(
    title='Actual vs Predicted Values',
    xaxis_title='Index',
    yaxis_title='Value',
    legend_title='Legend',

)

# Show the plot
fig.show()