# Gradient descent based models

## Overview

In these models, we assign a **Weight** for each feature in a linear fashion and add these values together with some **Bias** to reach our target value. but how do we choose proper weights and bias?<br>
for that, we have to define **Cost/Loss Function**. there are various cost functions in machine learning but in the end, all of them somehow represent our accuracy of prediction, so by reducing cost function we expect to reach more accurate prediction (we are talking about training phase accuracy and not test phase accuracy). so we need a mechanism to minimize our cost function, for that, we are using **Gradient Descent algorithm** (that's why we call these models gradient descent-based models).<br>
before we talk about gradient descent, it is worth mentioning that we said we deal with linear combinations of features and if we only stick with them we only can predict problems with linear answers. so there are several ways to add some **non-linearity** to our predictor equations for example adding some polynomial terms of features or using other non-linear functions like **Sigmoid**, **ReLU**, etc (more on them later). now let's talk about gradient descent.

## Gradient Descent

in these models, we assign a **Weight** for each feature in a linear fashion and add these values together with some **Bias** to reach our target value. but how do we choose proper weights and bias?<br>
for that, we have to define **Cost/Loss Function**. there are various cost functions in machine learning but in the end, all of them somehow represent our accuracy of prediction, so the lower the cost function the Gradient Descent (GD) is an iterative first-order optimization algorithm that is used to find the local optimum of a given function. here our function is the cost/loss function. as we said before the cost function somehow represents our prediction accuracy so loss equal to zero is our goal. because of the derivative nature of the GD algorithm the cost function needs to have two specifications, it has to be:
1. Differentiable
2. Convex/Concave<br>

as we mentioned before the GD is an iterative process and in each iteration by using derivatives, it changes our parameters until we reach a certain number of iterations (maximum iteration) or the difference between loss value reaches a certain small value (tolerance).<br>
this method has a parameter called **Learning rate**, which will multiply to our derivative value before we update our parameters and have a value between zero and one. the small value of the learning rate means small steps toward optimum value and a big learning rate means big steps. so we have to deal with it as hyperparameters in our problems.<br>
In general GD algorithm steps are:
1. initiation of our parameters (starting point)
2. calculate the gradient at this point
3. make a scaled step (with learning rate) in the opposite direction to the gradient (if we want to minimize)
4. repeat steps 2 and 3 until we reach our maximum iteration or tolerance<br>

there are various types of gradient descent and we can classify them like this:
* From **Data Perspective**:
    * **Batch** gradient descent: process all the training samples for each iteration.
    * **Stochastic** gradient descent: process one sample at each iteration.
    * **Mini-batch** gradient descent: process b samples for each iteration (b < number of samples)
* From **Algorithm Perspective**
    * **Momentum**-based gradient
    * Root mean squared propagation or **RMSProp**
    * Adaptive moment estimation or **ADAM**<br>

By using all of these algorithms we will add hyperparameters to our problem. so we will have:
* batch size ${b}$ for mini-batch GD. the default value is 32 but we can choose other values too (usually choose $2^{n}$ values).
* $\beta_1$ for momentum (default 0.9)
* $\beta_2$ and $\epsilon$ for RMSProp (default 0.99 and 10e-8)
* $\beta_1$, $\beta_2$, $\epsilon$ for ADAM<br>

Several models use gradient descent but the common ones are:
* Linear Regression
* Logistic Regression
* Neural Network

Now that we know about all of this stuff let's code them. but instead of using libraries, we going to code logistic regression from scratch first and after that learn how to implement them by using libraries.<br>
but why from scratch? because this way we are going to understand the logic behind all the gradient descent-based models.<br>
why logistic regression? because it is the base of all the gradient descent-based models.<br>
why not linear regression? logistic regression and linear regression have a lot in common but logistic regression has a non-linearity in it, so let's learn how to use them too.

## Logistic Regression

Logistic regression is a gradient descent based model that used for classification problems and it will estimates the probability of each class in our problem. as we discuss before each feature has its assign weight and we we will add them toghether in linear fashion with some bias term to calculate our prediction. but in order to get probability, we need some function to convert our prediction into probability or in other word a number between 0 and 1. so for that we use **Logistic function**. the two common loss function used for logistic regression are **logistic loss** function and **cross-entropy loss** function.<br>

### Implementing Logistic Regression From Scratch

because i think psudo code is very important in coding process, first i will write psudo code for logistic regression with some notes to help us in coding process, then we will write our code.

#### Psudo code for logistic regression

1. initializing weights (${w}$) and bias (${b}$).
2. Calculate the linear combination. (${z} = w^{T}{X} + {b}$)
3. perform the logistic function on z. (${a} = logistic({z})$). this is our estimation (prediction).
4. calculating loss using proper formula. (here we will be using the cross-entropy loss function). our loss is function of y and a (${J} = f({y},{a})$).
5. calculating our derivatives (da, dz, dw, db). here we need some calculus and particularly chain rule. In the end, we need dw, db but we have to calculate da, dz to get our result.
    * $da = \frac{\partial J\}{\partial a\}\$<br>
    
    * $dz = \frac{\partial J\}{\partial z\}\ = \frac{\partial J\}{\partial a\}\ * \frac{\partial a\}{\partial z\}\$
    
    * $dw = \frac{\partial J\}{\partial w\}\ = \frac{\partial J\}{\partial a\}\ * \frac{\partial a\}{\partial z\}\ * \frac{\partial z\}{\partial w\}\$
    
    * $db = \frac{\partial J\}{\partial b\}\ = \frac{\partial J\}{\partial a\}\ * \frac{\partial a\}{\partial z\}\ * \frac{\partial z\}{\partial b\}\$<br>

    
    
6. update w, b using dw, db.
    * ${w} := {w} - \alpha{dw}$
    * ${b} := {b} - \alpha{db}$
7. repeat steps 2-6 until we reach our maximum iteration or our tolerance.

#### Some note for implementing logistic regression

1. because we are dealing with matrices, the shape is important, every time we use `.T` or `.reshape`, we are trying to control our matrix shape for further calculation. so don't distract by them.
2. here we will be using basic gradient descent, not other variants.
3. $\alpha\$ is our learning rate.
4. for initialization we can assign our parameters to zero or any random number (better be a small number between zero and 1 or some scale near it).
5. for cross-entropy loss function:<br>

    $\frac{\partial J\}{\partial a\}\ = - \frac{y}{a}\ + \frac{1 - y}{1 - a}\$
6. for sigmoid function: <br>

    $\frac{\partial a\}{\partial z\}\ = \{a}{(1 - a)}\$<br>
7. so:
    
    $\frac{\partial J\}{\partial z\}\ = \{a} - {y}\$

8. as we said before our output in logistic regression is the probability so it will be between zero and one, in order to use it for classifying we will change our output to 0 and 1, each represents one of our classes. so if our estimation is bigger than some **threshold** (default 0.5), we classify our answer as 1, and if it's lower than our threshold we will classify it as 0. we also can treat it as hyperparameter.

#### Code

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from scipy.special import expit #our logistic function
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
## importing our data
data, target = load_breast_cancer(return_X_y=True)
data.shape, target.shape

((569, 30), (569,))

In [3]:
# splitting our data into train and test
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42,stratify=target)

In [4]:
##scaling our data, remember we only fit on our train data
scale = StandardScaler()
scale.fit(X_train)
X = scale.transform(X_train)
X_test = scale.transform(X_test)

In [5]:
## make sure we have numpy array in our desire shape
X = np.array(X).T
Y = np.array(y_train)
n = len(X) # number of features
m = len(Y) # number of training samples
## reshaping our data to desire shape and initialize our weights and bias (step 1)
Y = Y.reshape((1,m))
b = np.zeros((1,m))
w = np.zeros((n,1))

In [6]:
## hyperparameters
alpha = 0.025 #learning rate
iteration = 700 # number of iteration
threshold = 0.5

In [7]:
## our iterative process for calculate w , b
for i in range(iteration): #(step 7)
    Z = np.dot(w.T,X)+b # calculate linear combination (step 2)
    A = expit(Z) # (step 3) adding our non linearity here for binary classification the proper one is logistic function that's why we call it logistic regression
    loss = -Y*np.log(A)+(1-Y)*np.log(1-A) # calculating loss with proper formula (cross entropy) (step 4)
    J = np.sum(loss) # loss over all samples (step 4)
    ## computing gradients(step 5)
    dZ = A-Y 
    dw = (np.dot(X,dZ.T))/m
    db = (np.sum(dZ))/m
    ## updating (step 6)
    w = w-alpha*dw
    b = b-alpha*db
#     plt.scatter(i,np.abs(J))
# plt.show()

In [8]:
## loss value
J

-2.8647747718188667

In [9]:
## training accuracy
A[A>=threshold] = 1
A[A<threshold] = 0
train_accuracy = np.sum(A==Y)/len(A.T)
train_accuracy

0.9868131868131869

In [10]:
## preparing testing data to proper input format
X_test = np.array(X_test).T
Y_test = np.array(y_test)
Y_test = Y_test.reshape((1,len(Y_test)))

In [11]:
## test accuracy
Z_predict = np.dot(w.T,X_test)
Y_predict = expit(Z_predict)
Y_predict[Y_predict>=threshold] = 1
Y_predict[Y_predict<threshold] = 0
test_accuracy = np.sum(Y_predict==Y_test)/len(Y_predict.T)
test_accuracy

0.956140350877193

**NOTE:** if we want to implement linear regression, we just need to use the least square function for loss and we don't need the logistic function, so the derivatives will be changed, and actually, they will be much simpler, that's why I choose to write logistic regression.

### GDB Models in Scikit Learn

#### Logistic Regression

#### Linear Regression

## Neural Network

### Implementing neural network from scratch

### Implementing neural network with pytorch( or sth else)