# Logistic regression model

This notebook is organised in two parts:

* the *first part* implements a **logistic regression model from scratch**, as we did for the linear regression model.

* the *second part* makes use of **high-level APIs** for a concise implementation of the same logistic regression model.

Part of this code is based on [this tutorial](https://towardsdatascience.com/a-logistic-regression-from-scratch-3824468b1f88).

In [None]:
# importing necessary modules
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display

## Loading the dataset

We will be using a **modified version of the Titanic dataset** from [Kaggle](https://www.kaggle.com/azeembootwala/titanic). This version was adapted for logistic regression.

There are two files namely:

**`train_data.csv`**: a dataset of 792 instances and 16 features. The `survived` column is the target variable.The `parch` and `sibsp` columns from the original data set were replaced by a single column called `Family size`. 

All categorical data, like `Embarked` and `pclass` have been re-encoded using the one hot encoding method.Additionally, 4 more columns have been added, re-engineered from the `Name` column to `Title1` to `Title4` (Mr, Mrs, Master, Miss) signifying males and females depending on whether they were married or not. An additional analysis to see if `Married` people had more survival instincts or not, and is the trend similar for both genders.

All missing values have been filled with a median of the column values. All real valued data columns have been normalized.

**`testdata.csv`**: a dataset of 100 instances and 16 features, with the same arrangements made in the training dataset.

In [None]:
# loading the training dataset
# REMEMBER to upload this dataset into Colab
df1 = pd.read_csv('/content/titanic_train_data.csv')
df1.shape

In [None]:
# basic information about the dataset (columns, data types, missing data)
df1.info()

The target variable  is `Survived`; all other columns are features.

A short description:

* `Sex`: 0 or 1 => male or female
* `Age`: value rescaled between 0 and 1
* `Fare`: ticket price rescaled between 0 and 1
* `Pclass_1` .. `Pclass_3`: One-hot encoded Passenger class
* `Family_size`: rescaled value between 0 and 1 of family size.
* `Title_1 .. Title_4`: mr, mrs, master, miss one-hot encoded
* `Emb_1 .. Emb_3`: Embark location one-hot encoded.

In total we will have 14 features.

---

### Data pre-processing

Let's first remove some unnecessary columns from the dataset, as they won't be used in our model.

In [None]:
# Removing unnecessary colums
df1 = df1.drop(['Unnamed: 0', 'PassengerId'], axis=1)

In [None]:
# Example instances
df1.sample(5)

## First approach: regression model from scratch

Let's split the training dataset into features (`X`) and target (`Y`) variable. We have a total of 792 examples. Therefore, the shape for `Y` is `(𝑚,1)` where `𝑚 = 792`. For `X` we expect `(𝑚, 14)`, where the columns are the features.

In [None]:
# splitting the training dataset into features (X) and target (Y)
X_train = df1.iloc[:,1:].to_numpy()
Y_train = df1['Survived'].to_numpy()

In [None]:
X_train.shape, Y_train.shape

We need to **transpose** the input feature vector in order to perform the **dot product** necessary for the logistic regression model.

In [None]:
# transposing the feature vector (X)
X_train = X_train.T
X_train.shape

### Model defintion

Let's start by defining the **activation function**.

In [None]:
# custom sigmoid activation function
def sigmoid(Z):
    A = 1 / (1 + np.exp(-Z))
    return A

The next step is the **forward function**, which implements the dot product and makes use of the activation function.

We can split these in two steps:

$$Z = WX + b$$
$$A = \sigma(Z)$$


In [None]:
# custom forward pass function
def forward(X, W, b):
    Z = np.dot(W.T, X) + b
    A = sigmoid(Z)
    return A

The **loss function** should be a **binary cross entropy**, as we have only two target classes (`survided = 1` or `0`).

$$loss = -\frac{1}{m}\sum_{i=1}^{m} y\log(A) + (1 - y)\log(1 - A)$$


In [None]:
# custom loss function
# epsilon is a small value we add to avoid log(0) calculation
def loss(A, Y, epsilon = 1e-15):
    m = len(A)
    l = -1/m * np.sum( Y * np.log(A + epsilon) + (1 - Y) * np.log(1 - A + epsilon))
    return l

Next is the **backwards pass**. For this, we would need to differentiate the loss function with `W` and `b`.

$$\frac{\partial loss}{\partial W} \sum_{i=1}^{m} X(A - Y)\top$$

$$\frac{\partial loss}{\partial b} \sum_{i=1}^{m} (A - y)$$


In [None]:
# custom backward pass function
def backward(X, Y, A):
    m = len(yhat)
    dW = 1/m * np.dot(X, (A - Y).T)
    db = 1/m * np.sum(A - Y) 
    return (dW, db)

This step implements the **backpropagation function** for updating weights and bias.

In [None]:
# custom backpropagtion function for updating weights and bias
def update(W, b, dW, db, learning_rate = 0.01):
    W = W - learning_rate * dW
    b = b - learning_rate * db
    return (W, b)

As the activation function returns a probability between 0 and 1, we need a custom function to round values <= 0.5 to 0 and values > 0.5 to 1.

In [None]:
# custom round function
def roundValue(A):
    return np.uint8( A > 0.5)

The last step is the definition of our **accuracy metric**.

In [None]:
# custom accuracy function
def accuracy(yhat, Y):
    return round(np.sum(yhat==Y) / len(yhat) * 1000) / 10

### Model instantiation and training

In [None]:
# initializing model parameters
# random seed (for reproducibility)
np.random.seed(2022)
# we have 14 features in the dataset
W = 0.01 * np.random.randn(14)
# and a constant bias
b = 0

In [None]:
# hyperparameters for training
num_iterations = 500
lr = 0.01

# we will record loss and accuracy for plotting
losses, acces = [], []
# main training loop
for i in range(num_iterations):
    # forward pass
    A = forward(X_train, W, b)
    # loss calculation
    l = loss(Y_train, A)  
    # round the predicted value
    yhat = roundValue(A)
    # accuracy calculation
    acc = accuracy(yhat, Y_train)
    # backpropagation pass - update weights and bias
    dW, db = backward(X_train, Y_train, A)
    W, b = update(W, b, dW, db, learning_rate=lr)
    # keep record of loss and accurcy
    losses.append(l)
    acces.append(acc)
    # checkpoint
    if i % 50 == 0:
        print('loss:', l, f'\taccuracy: {accuracy(yhat, Y_train)}%') 

### Visualising training performance

In [None]:
with plt.xkcd():
  fig, ax = plt.subplots(1, 1, figsize=(8, 4))
  ax.plot(np.arange(len(losses)), losses, 'b-', label='loss')
  xlab, ylab = ax.set_xlabel('epoch'), ax.set_ylabel('loss')

In [None]:
with plt.xkcd():
  fig, ax = plt.subplots(1, 1, figsize=(8, 4))
  ax.plot(np.arange(len(acces)), acces, 'b-', label='accuracy')
  xlab, ylab = ax.set_xlabel('epoch'), ax.set_ylabel('accuracy')

### Testing the model over the testing dataset

In [None]:
# loading the testing dataset
# REMEMBER to upload this dataset into Google Colab
df2 = pd.read_csv('/content/titanic_test_data.csv')
df2.shape

In [None]:
# data pre-processing step
df2 = df2.drop(['Unnamed: 0', 'PassengerId'], axis=1)
X_test = df2.iloc[:,1:].to_numpy()
Y_test = df2['Survived'].to_numpy()
X_test = X_test.T
X_test.shape

In [None]:
# testing loop
# NOTICE that we are keeping the weights and bias from the trained model
num_iterations = 500
lr = 0.01

losses, acces = [], []
# main loop
for i in range(num_iterations):
    A = forward(X_test, W, b)
    l = loss(Y_test, A)  # loss function
    yhat = roundValue(A)
    acc = accuracy(yhat, Y_test)
    dW, db = backward(X_test, Y_test, A)
    W, b = update(W, b, dW, db, learning_rate=lr)
    losses.append(l)
    acces.append(acc)
    if i % 50 == 0:
        print('loss:', l, f'\taccuracy: {accuracy(yhat, Y_test)}%') 

In [None]:
# visualising model's performance over the testing data
with plt.xkcd():
    fig, ax = plt.subplots(1, 2, figsize=(14, 5))
    ax[0].plot(np.arange(len(losses)), losses, 'b-', label='loss')
    xlab, ylab = ax[0].set_xlabel('epoch'), ax[0].set_ylabel('loss')
    ax[1].plot(np.arange(len(acces)), acces, 'b-', label='accuracy')
    xlab, ylab = ax[1].set_xlabel('epoch'), ax[1].set_ylabel('accuracy')

## Second approach: regression model using high-level APIs

We are going to read the input features and target variable again, as we need an unmodified (i.e., not transposed) version of the data.

In [None]:
# loading training features and target variable
X_train = df1.iloc[:,1:].to_numpy()
Y_train = df1['Survived'].to_numpy()

In [None]:
# instantiating the model with only one layer
model = tf.keras.Sequential([
    # dense layer with 14 input features, one output, and sigmoid activation function
    tf.keras.layers.Dense(units=1, input_shape=[14], activation='sigmoid'),
])

In [None]:
# model hyperparameters: optimizer, loss function, and performance metric
model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['acc'])

In [None]:
# training the model and keeping track of loss and accuracy
train_history = model.fit(X_train, Y_train, epochs=50)

In [None]:
# Extracting weights and bias from the trained model
W_tf, b_tf = [x.numpy() for x in model.weights]
W_tf, b_tf

In [None]:
# visualising model's performance
with plt.xkcd():
  fig, ax = plt.subplots(1, 1, figsize=(8, 4))
  ax.plot(np.arange(50), train_history.history['loss'], 'b-', label='loss')
  xlab, ylab = ax.set_xlabel('epoch'), ax.set_ylabel('loss')

In [None]:
with plt.xkcd():
  fig, ax = plt.subplots(1, 1, figsize=(8, 4))
  ax.plot(np.arange(50), train_history.history['acc'], 'b-', label='accuracy')
  xlab, ylab = ax.set_xlabel('epoch'), ax.set_ylabel('accuracy')

In [None]:
X_test = df2.iloc[:,1:].to_numpy()
Y_test = df2['Survived'].to_numpy()
X_test.shape

In [None]:
# using the model for predicting values
ynew = model.predict(X_test)
# show the inputs and predicted outputs
for i in range(len(X_test)):
	print("Input target X: %s, Predicted target: %s" % (X_test[i][0], ynew[i]))

In [None]:
# checking model's performance
test_loss, test_acc = model.evaluate(X_test, Y_test, verbose=2)

print('\nTest accuracy:', test_acc)