# Week 2 - Classical ML Models - Part I

## 2. Logistic Regression and Classification

As it has been mentioned in the previous week, in the classification problems we have a set of inputs that belong to 2 or more categories and we have to train model to assign new set of inputs to corresponding categories.

Although there are many existing classification models, in this lecture we will focus on the **logistic regression**.

### Logistic Regression

Even though it includes the regression term, it is more related to classification than regression models. Logistic regression is essentially used to calculate the probability of a binary event.

Before going too much in-depth to the logistic regression model, we will first analyze its *'building blocks'*.

#### Sigmoid function

![sigmoid function](https://static.javatpoint.com/tutorial/machine-learning/images/logistic-regression-in-machine-learning.png)

The sigmoid function (seen above), maps any real values to a range of 0 and 1. As we train, logistic regression model, we are basically aiming to find the threshold value: the values above threshold is considered as 1, while values below threshold are 0.

The sigmoid function itself can be expressed mathematically as
$g(z) = \frac{1}{1 + \exp(-z)}$. In Python code, this could be expressed as:

In [11]:
import numpy as np

def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

#### Logistic regression hypothesis

So far we know how to express the sigmoid function in mathematical and code notations. We also know that in the logistic regression the inputs are passed through the sigmoid in order to determine the classification threshold value.

However, how does this connection translate in mathematics?

As you might remember from the last notebook, the linear regression had hypothesis $ \hat{y} = \theta ^T x$. For the logistic regression, this hypothesis becomes: $ \hat{y} = \frac{1}{1 + \exp(-\theta ^T x)}$.

The hypothesis can be written as:

In [13]:
def hypothesis(X, theta):
    z = np.dot(X, theta)
    y_hat = sigmoid(z)
    
    return y_hat

#### Cost function

At this point, we have our function for our logistic function probability output. On the other hand, in order to actually train our model, we also have to define our cost function.

In the linear regression case, our cost function had the following form:
$\frac{1}{m} \sum_{i = 1}^m(\hat{y_{i}} - y_{i}) ^ 2$. Here $\hat{y_{i}}$ is the output of our probability function while $y_{i}$ is the actual label. On the other hand, our probability function $\hat{y}$ has a more complicated expression that would make it quite hard to find its optimum (function would have many local minimum points). Therefore, the cost function for the logistic regression can be written as: $J(\theta) = -\frac{1}{m} \sum_{i = 1}^m (y_{i} \log{(\hat{y}_{i})} + (1-y_{i}) \log{(1 - \hat{y}_{i})})$.

In code this can be expressed as:

In [None]:
def cost(y_hat, y):
    cost = np.mean(y * np.log(y_hat) + (1 - y) * np.log(1-y_hat))
    return cost

#### Gradient

The optimization process in ML models is usually based on the gradient descent algorithm. We will analyze it much more in depth in the near future lectures, however, just for the purpose of this tutorial, understand it as a way of finding the minimum of the function. As you might remember from the math lessons, the minimum point can be found with the use of derivatives. As we want to analyze the cost function in respect to $\theta$, we need to differentiate our cost function: $\frac{1}{m} X^T (\hat{y} - y)$.

The gradient descent formula for updatint $\theta$ values, therefore, becomes:
$\theta = \theta - lr.\frac{1}{m} X^T (\hat{y} - y)$

The code for finding the optimal $\theta$ becomes:

In [None]:
def gradient_descent(x_train, y_train, lr, epochs):
    intercept = np.ones((x_train.shape[0], 1))
    x_train = np.concatenate((intercept, x_train), axis=1)
    theta = np.zeros(x_train.shape[1])

    m = len(x_train)
    
    for i in range(epochs):
        y_hat = hypothesis(x_train, theta)
        gradient = np.dot(x_train.T, (y_hat - y_train)) / m
        
        theta -= lr * gradient
    
    return theta

#### Prediction

After finding the optimal $\theta$, the prediction process is quite straightforward - we just simply need to make a new hypothesis. The output of such function ($\hat{y}$) will be a probability in the range from $0$ to $1$, therefore, we need to map values according to $0.5$ threshold.

In [None]:
def predict(x_train, y_train, x_test, lr, epochs):
    
    theta = gradient_descent(x_train, y_train)
    intercept = np.ones((x_test.shape[0], 1))
    x_test = np.concatenate((intercept, x_test), axis=1)
    y_hat = hypothesis(x_test, theta)
    
    y_pred = []
    
    for i in range(len(y_hat)):
        if(y_hat[i] >= 0.5):
            y_pred.append(1)
        else:
            y_pred.append(0)
    
    return y_pred

### Exercise

Let's now put it all together into a simple logistic model. For this example exercise, we will use one of the sklearn datasets (Iris).

In [142]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import math
from sklearn.metrics import classification_report,accuracy_score


iris = load_iris()

#Features
X = iris.data[:, :2]
y = (iris.target != 0) * 1

In [None]:
############-------Tasks to do-------------------##########################

In [None]:
#Split the X, y dataset into x_train, x_test, y_train and y_test
x_train, x_test, y_train, y_test = train_test_split(X, y,random_state=0)

In [None]:
#---Building the model

#sigmoid function
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

#hypothesis function
def hypothesis(X, theta):
    z = np.dot(X, theta)
    y_hat = sigmoid(z)
    
    return y_hat

#cost function
def cost(y_hat, y):
    cost = np.mean(- y * np.log(y_hat) - (1 - y) * np.log(1-y_hat))
    return cost

#gradient descent function
def gradient_descent(x_train, y_train, lr, epochs):
    intercept = np.ones((x_train.shape[0], 1))
    x_train = np.concatenate((intercept, x_train), axis=1)
    theta = np.zeros(x_train.shape[1])
    lr = 0.01
    epochs = 100

    m = len(x_train)
    
    for i in range(epochs):
        y_hat = hypothesis(x_train, theta)
        gradient = np.dot(x_train.T, (y_hat - y_train)) / m
        
        theta -= lr * gradient
    
    return theta

#prediction function
def predict(x_train, y_train, x_test, lr, epochs):
    
    theta = gradient_descent(x_train, y_train)
    intercept = np.ones((x_test.shape[0], 1))
    x_test = np.concatenate((intercept, x_test), axis=1)
    y_hat = hypothesis(x_test, theta)
    
    y_pred = []
    
    for i in range(len(y_hat)):
        if(y_hat[i] >= 0.5):
            y_pred.append(1)
        else:
            y_pred.append(0)
    
    return y_pred

In [None]:
#assigning learning rate and epochs values (can leave as it is or change it)
lr = 0.01
epochs = 100

In [None]:
#predicting values
y_pred = predict(x_test, y_test, x_test, lr, epochs) #here should be your prediction function
print('Accuracy on test set: ' + str(accuracy_score(y_test, y_pred)))