# Week 7 - Online Learning (With Solutions)

Welcome to this comprehensive tutorial on creating a Logistic Regression model from scratch! In this notebook, you will embark on an enlightening journey through the fundamental concepts and inner workings of online learning through Logistic Regression, a widely used algorithm in the realm of machine learning and classification tasks. 

## Authors
- Hossein A. Rahmani (hossein.rahmani.22@ucl.ac.uk)
- Sahan Bulathwela (m.bulathwela@ucl.ac.uk)

## Learning Outcomes
- **Understanding Logistic Regression:** Delve into the theoretical underpinnings of Logistic Regression and gain clarity on how it's used for binary classification problems. Learn about the sigmoid function, which is at the heart of this algorithm, and discover its significance in converting linear outputs into probabilistic predictions.
- **Comparison with Libraries:** Gain insights into how your scratch-built Logistic Regression model compares with established machine learning libraries like scikit-learn. Understand the underlying similarities and differences, and appreciate the convenience that libraries bring to real-world projects.
- **Utilise Different Online Learning Algorithms:** Familiarise with using online learning algorithms available in popular machine learning libraries.

## Task

Classification using Logistic Regression and other online learning algorithms.

## Logistic Regression from Scratch

Sources: 
- https://www.analyticsvidhya.com/blog/2022/02/implementing-logistic-regression-from-scratch-using-python/
- https://developer.ibm.com/articles/implementing-logistic-regression-from-scratch-in-python/
- https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch

### Importing Libraries

We first import the necessary libraries to implement a logistic regression model, improt dataset, and evalute the model using the dataset.

In [1]:
import numpy as np 
from numpy import log, dot, e, shape

from tqdm import tqdm

# to split dataset into train and test parts
from sklearn.model_selection import train_test_split
# to load dataset
from sklearn import datasets

import matplotlib.pyplot as plt

# sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Logistic Regression: Step-by-Step

#### Initializing Parameters

To embark on logistic regression modeling, initializing key parameters is the starting point. Four pivotal variables demand specific definitions within the input framework:

1. **Learning Rate**: This critical factor governs the step magnitude taken in each optimization iteration. It significantly impacts the model's convergence speed and accuracy.

2. **Number of Iterations**: Defining this variable establishes how frequently the optimization algorithm refines the model through multiple passes over the training data. It plays a role in achieving convergence and fine-tuning the model.

Moreover, two essential model components must be set:

3. **Weights**: These coefficients quantify feature influence in the logistic regression model. They shape predictions by assigning importance to each feature.

4. **Bias**: The bias term introduces an offset that aids the model in capturing inherent nuances or noise in the data.

In summary, the learning rate, number of iterations, weights, and bias constitute the foundation for launching logistic regression, facilitating accurate and effective modeling.

In [2]:
# initialse the required variables for logestic regression parameters
def init(learning_rate, number_iters):
    lr = learning_rate
    max_ite = number_iters
    weights = []
    Bias = []
    

#### Sigmoid Function

In a linear regression model, the hypothesis function is a linear combination of parameters given as $\hat{y} = wx + b$ for a simple single parameter data. This allows us to predict continuous values effectively to find the best fitting line on the dataset, but in logistic regression, the response variables are binomial, either ‘yes’ or ‘no’. So, it makes less sense to use the linear function to predict anything except the values between 0 and 1. And the most effective function to limit the results of a linear equation to [0,1] is the `sigmoid` or `logistic` function. In Logistic Regression, we try to create probabilities instead of a specific value, which makes it suitable for classification problem. To do this, we put the values into a sigmoid function to get the probabilities over the variables:

$s(x)=\frac{1}{1+e^{-x}}$

Finally, we have:

$\hat{y} = h_{\theta}(x) = \frac{1}{1+e^{-wx+b}}$

In [3]:
# here x is equal to wx+b
def sigmoid(x):
    sig = 1/(1+e**(-x))
    return sig

#### Cost Function

The cost function, also known as the `loss function`, defines the extent of disparity between the computed and real values. In linear regression, the least squared error is utilized as the cost function. However, in logistic regression, the least squared error function becomes non-convex, introducing a higher likelihood of gradient descent becoming trapped in local minima. To address this, the preferred choice is to employ the `log loss function` as the cost function. The formula gives the cost function for the logistic regression:

$J(w, b) = J(\theta) = \frac{1}{N}\sum_{i=1}^{n}[{y^ilog(h_\theta(x^i))}+(1-y^{i})log(1-h_\theta(x^i))]$

And the gradianets in terms of wights and bias are:

$J^{'}(\theta)=\begin{bmatrix} \frac{dJ}{dw} \\ \frac{dJ}{db} \end{bmatrix}=\begin{bmatrix} ... \end{bmatrix}=\begin{bmatrix} \frac{1}{N}\sum2x_i(\hat{y}-y_i) \\ \frac{1}{N}\sum{x_i}(\hat{y}-y_i) \end{bmatrix}$

In [4]:
def fit(self, X, y):
    # loading variables
    n_samples, n_features = X.shape
    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in tqdm(range(self.n_iters)):
        # making prediction
        linear_pred = np.dot(X, self.weights) + self.bias
        predictions = sigmoid(linear_pred)

        # Cost functions
        # calculating errors: cross entropy
        dw = (1/n_samples) * np.dot(X.T,(predictions -y ))
        db = (1/n_samples) * np.sum(predictions-y)
        
        # updating weights and bias
        self.weights = self.weights - self.lr * dw
        self.bias = self.bias - self.lr * db

#### Prediction

Everything that we have done far is for this step. We trained the model on a training dataset, and now we will use the learned parameters to predict the unseen data.

In [5]:
def predict(self, X):
    linear_pred = np.dot(X, self.weights) + self.bias
    y_pred = sigmoid(linear_pred)
    class_pred = [0 if y<=0.5 else 1 for y in y_pred]
    return class_pred

#### Putting Everything Together: Logistic Regression Class

In [6]:
class LogisticRegression():

    def __init__(self, lr=0.001, n_iters=1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in tqdm(range(self.n_iters)):
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = sigmoid(linear_pred)

            # calculating errors: cross entropy
            dw = (1/n_samples) * np.dot(X.T,(predictions -y ))
            db = (1/n_samples) * np.sum(predictions-y)
            
            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = sigmoid(linear_pred)
        class_pred = [0 if y<=0.5 else 1 for y in y_pred]
        return class_pred

### Loading Data

In [18]:
bc = datasets.load_breast_cancer()
X, y = bc.data, bc.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

### Evaluation

In [8]:
# evaluation metric
def accuracy(y_pred, y_test):
    return np.sum(y_pred==y_test) / len(y_test)

In [9]:
# 
clf = LogisticRegression(lr=0.0001)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

acc = accuracy(y_pred, y_test)
print(acc)

100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 11158.60it/s]

0.9298245614035088





### Logistic Regression using `scikit-learn`

Now, let’s see how our logistic regression fares in comparison to sklearn’s logistic regression.

In [10]:
model = LogisticRegression(n_iters=10000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))

  sig = 1/(1+e**(-x))
100%|██████████████████████████████████| 10000/10000 [00:00<00:00, 12346.94it/s]


0.9122807017543859


According to sklearn's Logistic source code, the solver used to minimize the loss function is the SAG solver (Stochastic Average Gradient). This paper defines this method, and in this link there is the implementation of the sag solver. This implementation of the solver uses a method to obtain the step size (learning rate), so there is not a way that you can change the learning rate (unless you want to change the source code).

https://datascience.stackexchange.com/questions/16751/learning-rate-in-logistic-regression-with-sklearn

## Running Logistic Regression with Stochastic Gradient Descent

Now, let us use the stochastic gradient descent algorithm in the `scikit-learn` library to learn the parameters of the stochastic gradient descent. 

***Implement logistic regression using the [`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) class***

In [11]:
from sklearn.linear_model import SGDClassifier

clf_logit = SGDClassifier(loss='log_loss', alpha=0.001, max_iter=10000)
clf_logit.fit(X_train, y_train)
y_pred_logit = model.predict(X_test)
print(accuracy_score(y_test, y_pred_logit))

0.9122807017543859


  sig = 1/(1+e**(-x))


***Now, implement the perceptron classifier using the same [`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) class***

In [12]:
from sklearn.linear_model import SGDClassifier

# Create the Perceptron classifier
clf_perceptron = SGDClassifier(loss='perceptron', penalty=None, learning_rate='constant', eta0=0.1, max_iter=1000)
# Train the classifier
clf_perceptron.fit(X_train, y_train)
# Predict on test data
y_pred_perceptron = clf_perceptron.predict(X_test)

# Evaluate accuracy
print("Perceptron Accuracy:", accuracy_score(y_test, y_pred_perceptron))

Perceptron Accuracy: 0.9210526315789473


### Doing more online learning

The whole point of doing online learning is to be able to learn incrementally with new data. Let us try this with the data that we already have. 

In order to have an online learning setting, we need an addtional "train" dataset that the model hasn't seen already. We can populate an addtional training dataset by splitting the test data by half. 

***Let's split the test data into two splits, namely, 1) `(X_train_delta, y_train_delta)` and  `(X_test, y_test)`.*** 

In [19]:
X_train_delta,X_test,y_train_delta, y_test = train_test_split(X_test, y_test,train_size = 0.5)


***Now let us use the previously trained logistic regression and precetron models to evalaute accuracy once again on the new test set.*** 

In [21]:
print("The accuracy of Logistic Regression Model on new test set is: {}".format(accuracy_score(y_test, clf_logit.predict(X_test))))
print("The accuracy of Perceptron Model on new test set is: {}".format(accuracy_score(y_test, clf_perceptron.predict(X_test))))

The accuracy of Logistic Regression Model on new test set is: 0.9298245614035088
The accuracy of Perceptron Model on new test set is: 0.8947368421052632


In this step, we further train the models with the new training dataset `(X_train_delta, y_train_delta)`. 

***Let's further train the two models using the new training data.***

In [None]:
X_train = X_train

***Now, let us reevaluate the models with the same test dataset that we used before.*** 

In [None]:
print("The accuracy of Logistic Regression Model on new test set is: {}".format(accuracy_score(y_test, clf_logit.predict(X_test))))
print("The accuracy of Perceptron Model on new test set is: {}".format(accuracy_score(y_test, clf_perc.predict(X_test))))