In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### x_W5/L4-mlt-dip-iitm

# Logistic regression Implementation

- Logistic regression is the workhorse of machine learning

- Before deep learning era, logistic regression was the default choice for solving real life classification problems with hundreds of thousand of features.

- It works in binary, multi-class and multi-label classification set ups.

## Import Libraries


In [2]:
from IPython.display import display, Math, Latex # Imported for proper rendering of latex in colab

import numpy as np

# Import for generating plots
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(1234)
%matplotlib inline

As a good practice, set the random seed in order to reproduce same results across different runs of this colab.
> We can set random seed to any other number of our choice

In [3]:
np.random.seed(1234)

# Implementation



## Model

- As we know, the logistic regression classifier calculates the probability of a sample, represented with a feature vector X, belonging to class 1: Pr(y = 1|x).
- It has two steps:
    1. Linear combination of features and
    2. Sigmoid activation
    
> Let's apply these steps on a single example to calculate its probability of belonging to class 1:
    1. The first step performs linear combination of features and obtain z = $W^TX$
    2. The second step applies sigmoid or logistic activation on z to obtain the probability:
    Pr(y = 1|X) = sigmoid(Z) = 1/(1 + $e^-z$)

The vectorized form enables us to compute probabilities for several examples all at once as follow:
    
1. By vectorizing linear combination of features leading to efficient computation: Znx1 = (Xnxm)Wmx1
    
     * The resulting linear combination z is a vector with n components:

In [4]:
# Let's implement linear combination in vectorized form
def linear_combination(X:np.ndarray, w:np.ndarray) ->np.ndarray:
    return X@w

2. By vectorizing sigmoid or logistic activation to obtain a vector of probability or activation:
    Pr(y=1|X)nx1 = sigmoid(Znx1)
    * The sigmoid function is applied on the vecctor z with n components and the result is a probability or an activation vector with n components.

In [5]:
# Let's implement sigmoid function in a vectorized form
def sigmoid(z:np.ndarray) -> np.ndarray:
    '''Calculates sigmoid of linear combinatin of features z.
    Args:
        z: list of floats
        
    Returns:
        List of output of sigmoid function
    
    
    '''
    return 1/(1+np.exp(-z))

Further we apply vectorized prediction or inference function on activations to obtain a class label. Specifically, if activation or probability > threshold, then we label the sample with class 1, or 0.

In [6]:
def predict(X:np.ndarray, w:np.ndarray, threshold:float) -> np.ndarray:
    '''Predicts class label for samples.
    
    The samples are represented with a bunch of features and are 
    presented in form of a feature matrix X. The class label is predicted as follows:
    * If sigmoid(Xw) > threshold, the sample is labeled with class 1.
        * else class 0.
        
    Args:
        X: feature vector of shape(n, m)
        w: weight vector of shape(m,)
        threshold: probability threshold for classification.
    Returns:
        A list of class labels of shape (n,)
    
    '''
    return np.where(sigmoid(linear_combination(X, w)) > threshold, 1, 0)

Let's label a couple of samples through the code that we have written so far-
- Two samples-each with two features-np.array([[1, 20, 2], [1, 2, 2]]) where the first one is a dummy feature set to 1 corresponding to the bias.
- weight vector: np.array([-1, 0, 1])

In [7]:
feature_matrix = np.array([[1, 20, 2], [1, 2, 2]])
weight_vector = np.array([-1, 0, 1])

print("Shape of feature  matrix: ", feature_matrix.shape)
print("Shape of weight  vector: ", weight_vector.shape)

class_labels = predict(feature_matrix, weight_vector, 0.5)

print("Shape of output is; ", class_labels.shape)
print("The class label vector is: ", class_labels)

Shape of feature  matrix:  (2, 3)
Shape of weight  vector:  (3,)
Shape of output is;  (2,)
The class label vector is:  [1 1]


Both the samples are labelled with class 1.

## Loss Function

Next, we will implement binary cross entropy(BCE) loss with regularization. The base loss without regularization can be obtained by setting the regularization rate $\lambda$ to 0.

The generic form of loss is as follows:
 BCE = BCE on training examples + $\lambda$regularization penalty
 
Note that regularization rate $\lambda$ controls the amount of regularization penalty to be used.

We use $L_2$ and $L_1$ regularization in logistic regression.

In order to write the loss in vectorized form, we will first calculate the term inside summation in vectorized form;
    e = ylog(sigmoid(Xw)) + (1-y)log(1 - sigmoid(Xw))
    
With this, the loss becomes: J(w) = $-1^T$1xn(**e**nx1)

Exercise; verify this equality.
Adding $L_2$ penalty, we get: J(w) = -1^T(e) + $\lambda$$w^T$w

Adding $L_1$ penalty, we get: J(w) = -1^T(e) + $\lambda$$1^T$|w|

The loss function implements vectorized loss calculation with actual label vector, activation vector, weight vector, and $L_1$ and $L_2$ regularization rates


The loss function returns a scalar quantity that denotes the loss on all training examples for a particular choice of the weight vector.

In [8]:
def loss(y, sigmoid_vector, weight_vector, l1_reg_rate, l2_reg_rate):
    return (-1 * (np.sum(y * np.log(sigmoid_vector) + (1-y) * np.log(1-sigmoid_vector)))
             + l2_reg_rate * np.dot(np.transpose(weight_vector), weight_vector)
            + l1_reg_rate * np.sum(np.abs(weight_vector)))

## Optimization

Next, we will implement optimization. For that we will make use of iterative optimization techniques like gradient descent(GD), mini-batch gradient descent(MBGD) or stochastic gradient descent(SGD).

We will use GD implementation from linear regression.

We need to modify gradient update rule that is suitable for logistic regression loss:
    * STEP 1: Calculate gradient of loss function and
    * STEP 2: Scale the gradient with learning rate and use it for updating the weight vector.
    
    

### Gradient of loss function

The gradient of loss function can be calculated (in vecctorized form) as follow:
d(J(w)/dw = $X^T$(sigmoid(Xw)-y) + $\lambda$w

It is imlemented with `calculate_gradient` fuction that takes feature matrix X, label vector y, weight vector w and regularization rate $\lambda$ as arguments and effieciently calculates gradient of loss function w.r.t. the weight vector in vectorized form.

In [9]:
def calculate_gradient(X:np.ndarray, y:np.ndarray, w:np.ndarray, reg_rate:float) -> np.ndarray:
    '''Calculates gradient of loss function w.r.t. weight vector on training set.
    The gradient is calculated with the following vectorized operation:
        np.transpose(X)(sigmoid(Xw) - y) + \lambda w
    Args:
        X: Feature matrix for training data.
        y: label vector for training data.
        reg_rate: regularization rate
    Returns:
        A vector of gradients
        
        
        '''
    return np.transpose(X)@(sigmoid(linear_combination(X, w)) - y) + reg_rate * w

As part of the implementation, we store loss and weight vectors in each GD step as a class member variable.
- The step-wise loss is used for plotting a learning curve in order to ensure that the model is training as expected.

- The step-wise weight vector is useful in studying the trajectory of gradient descent in the loss landscape.

# Logistic regression class implementation

We combine these different components into a single python class with name `LogisticRegression`.

It has the following class member variables:
   1. Weight vector
   2. Loss and weight vector in each GD step

In [10]:
class LogisticRegression(object):
    """Logistic regression model.
    y = sigmoid(X @ w)
    """
    def set_weight_vector(self, w):
        self.w
        
    # Let's implement linear combination in vectorized form
    def linear_combination(self, X:np.ndarray) ->np.ndarray:
        return X@self.w
    
    def sigmoid(self, z:np.ndarray):
        """Return probability of input belonging class 1.
        Args:
            z: (n, ) np.ndarray
        Returns:
            sigmoid activation vector(n, ) np.ndarray
        
        """
        return 1/(1+np.exp(-z))
    
    def activation(self, X:np.ndarray) -> np.ndarray:
        '''Calculates sigmoid activation for logistic regressin.
        The sigmoid activation is calculated with the following vectorized form:
            act = sigmoid(Xw)
            
        Args:
            X: feature matrix with shape (n, m)
        Returns:
            activation vector with the shape (n,)'''
        return self.sigmoid(self.linear_combination(X))
    
    def predict(self, X:np.ndarray, threshold: float = 0.5):
        """Classify input data.
        Args:
            x : (N, D) np.ndarray
            threshold : float, optional
                threshold of binary classification (default is 0.5)
        Returns:
            (N, ) np.ndarray 
        """
        return (self.activation(X) > threshold).astype(int)
    
    def loss(self, X:np.ndarray, y:np.ndarray, reg_rate:float) -> float:
        '''Calculates binary cross entropy loss on training set.
        
        '''
        predicted_prob = self.activation(X)
        return (-1 * (np.sum(y * np.log(predicted_prob) + 
                             (1-y) * np.log(1 - predicted_prob))) +
                            reg_rate * np.dot(np.transpose(self.w), self.w))
    
    def calculate_gradient(self, X:np.ndarray, y:np.ndarray,
                          reg_rate:float) -> np.ndarray:
        '''Calculates gradients of loss function w.r.t. weight vector on training set.
        
        Returns:
            A vector of gradients.'''
        return np.transpose(X)@(self.activation(X) -y) + reg_rate * self.w
    
    def update_weights(self, grad:np.ndarray, lr:float) -> np.ndarray:
        '''Updates the weights based on the gradient of loss function.
        
        Weight updates are carried out with the following formula:
            w_new := w_old -lr * grad
            
        Args:
            2. grad: gradient of loss w.r.t. w
            3. lr: learning rate
        Returns:
            Updated weight vector
            '''
        return (self.w - lr*grad)
    
    def gd(self, X:np.ndarray, y:np.ndarray, num_epochs:int, lr:float, reg_rate:float) -> np.ndarray:
        '''Estimates parameters of linear regression model through gradient descent.
        Args:
            X: Feature matrix for training data.
            y: Label vector for training data.
            num_epochs: Number of training steps
            lr: learning rate
            reg_rate: regularization rate
        Returns:
            Weight vector: Final weight vector'''
        self.w = np.zeros(X.shape[1])
        self.w_all = []
        self.err_all = []
        for i in np.arange(0, num_epochs):
            dJdW = self.calculate_gradient(X, y, reg_rate)
            self.w_all.append(self.w)
            self.err_all.append(self.loss(X, y, reg_rate))
            self.w = self.update_weights(dJdW, lr)
        return self.w

In this section, we implemented binary logistic regression classifier from scratch. First we implemented all its components in vectorized form and then clubbed all of them together in a python class.