# Loss Function and Gradient Decent

The goal of this notebook is to go over the loss functions and gradient decent in relation to machine learning algorithms.
In supervised machine learning algorithms, we want to minimize the error for each training example during the learning process. This error comes from the loss function. And the minimization is done using optimization strategies like gradient descent. 

## Loss Functions

Machines "learn" by means of a loss function. It's the standard method of how well a specific algorithm models it's given data. If the model's predictions deviates too much from the actual results, the loss/error function outputs a large number. 

However, there is no one-size-fits-all loss function in machine learning. There are various factors involved in choosing a loss function for specific problem such as type of machine learning algorithm chosen, ease of calculating the derivatives or even the number of potential outliers in the data set.

### 1) Regression Loss Functions
Regression loss functions establish a linear relationship between a dependent variable (Y) and independent variable or variables (X). So we are trying to fit the best line in space on these variables.

These are best utilized when you have *numerical/continous data*, for example predicting housing prices. However, they can be utilized for catagorical variables when utilizing the linear seperable algorithm.

#### a. Mean Squared Error Loss
One of the most commonly used regression loss functions, MSE measures the average squared difference between the actual and predicted values by the model. The output is a single number associated with a set of values. 
Consider the slope intercept linear equation, $\hat{y} = mx+b$. 

We derive MSE as:

$ \frac{1}{N}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2} $

In [7]:
## PYTHON CODE FUNCTION FOR MSE
# b is the intercept
# m is the slope
# points are from the data set

def MSE(b, m, points):
    N = float(len(points)) #N is the total number of points
    error = 0
    for i in range (0, len(points)):
        x = points.iat[i, 0] #x values, note you could have multi-linear regression with multiple x's
        y = points.iat[i, 1] #y values
        error += (y-(m*x + b)) ** 2 #the squared summation of original y values minus predicted y_hat values
    return error/N #the mean

#### b. Mean Squared Logarithmic Error Loss
MSLE measures the ratio between actual and predicted value. It introduces a curve in the error curve. MSLE mainly focuses on percentual difference and predicted values. It can be a good choice as a loss function, when we want to predict house sales prices, bakery sales prices and the data is continuous.

We derive MSLE as:

$\frac{1}{N}\sum_{i=1}^{n}(log(y_{i}+1)-log(\hat{y}_{i}+1))^{2}$

In [8]:
## PYTHON CODE FUNCTION FOR MSE
def MSLE(b, m, points):
    N = float(len(points)) #N is the total number of points
    error = 0
    for i in range (0, len(points)):
        x = points.iat[i, 0] #x values, note you could have multi-linear regression with multiple x's
        y = points.iat[i, 1] #y values
        log_y = math.log(y+1)
        log_yhat = math.log((m*x + b) + 1)
        error += (log_y - log_yhat) ** 2 #the squared summation of original y values minus predicted y_hat values
    return error/N #the mean

#### c. Other Regression Errors
Mean Absolute Error (MAE), Root mean squared error (RMSE), Huber Loss (combination of MSE and MAE)

### 2) Binary Classification Loss Functions
These loss functions are made to measure the performances of the classification model. In this, data points are assigned one of the labels i.e either 0 or 1.

#### a. Binary Cross-Entropy Loss
It’s a default loss function for binary classification problems. Cross-entropy loss calculates the performance of a classification model which gives an output of a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability value deviate from the actual label.
Also called Sigmoid Cross-Entropy loss. It is a Sigmoid activation plus a Cross-Entropy loss. 

##### i) Sigmoid
It squashes a vector in the range (0, 1). It is applied independently to each element of $z$. It’s also called logistic function.

Defined as:

$\sigma(z)= \frac{1}{1+e^{-z}}$

In [10]:
## PYTHON CODE FOR SIGMOID FUNCTION
## z is any value of which the sigmoid is individually applied
### For example, z can be the y_hat regression equation

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

##### ii) Cross- Entropy
We then apply that sigmoid to the Coss-Entropy loss function to obtain *Binary Cross-Entropy*.

Defined as:

$CE = -x_{i} log(f(\sigma _{i})) - (1 - x_{i}) log(1 - f(\sigma _{i}))$

$x_{i}$ = 0 or 1

We then take the derivative of the Cross-Entropy function paired with sigmoid in order to minimize the error.

#### b. Other Binary Erros
Hinge Loss, Square Hinge Loss

## Gradient Decent

When you venture into machine learning one of the fundamental aspects of your learning would be to understand “Gradient Descent”. Gradient descent is the backbone of an machine learning algorithm. Once you get hold of gradient descent things start to be more clear and it is easy to understand different algorithms. 

- 1) Choose the propper error function base on info listed above
- 2) Take the derivative of the error function
- 3) Give the function a starting point
- 4) The point will either move in a positive or negative direction, depending on what loss function is being used as well at the learning rate
- 5) The min/max is found after a number of iterations and the error is very close to zero


I will be utilizing python to display examples of gradient decent with several different loss functions. 

### 1) Basic Gradient Decent
This is the simplest form of gradient descent technique. Its main feature is that we take small steps in the direction of the minima by taking gradient of the cost function.

The gradient is in terms of each parameter needed for the final model.

*PSUEDOCODE*
<blockquote>
    <p> update = learning_rate * gradient_of_parameters </p>
    <p> parameters = parameters - update </p>
</blockquote>

Here, we see that we make an update to the parameters by taking gradient of the parameters. And multiplying it by a learning rate, which is essentially a constant number suggesting how fast we want to go the minimum. Learning rate is a hyper-parameter and should be treated with care when choosing its value.

#### a. Gradient Decent Using Basic Function
In this example, we will take a function with multiple regressors.


In [23]:
### PYTHON CODE 
import numpy as np

# Initializing variables
x = [0.0, 10.0]

#Original function, not used in the gradient function 
original_function = 5*((x[0])**2) + ((x[1])**2) + 4*(x[0])*(x[1]) - 6*x[0] - 4*x[1] + 15

# Taking the fist derivative for each parameter
gradient = np.array([10*(x[0]) + 4*(x[1]) - 6, 4*(x[0]) + 2*(x[1]) - 4])

# Choosing our step parameter, often defined as alpha
alpha = .16

# Number of iterations
num = 500

def grad_decent(x, alpha, num):
    gradient = np.array([10*(x[0]) + 4*(x[1]) - 6, 4*(x[0]) + 2*(x[1]) - 4]) #Initial Gradient
    x_0 = x #Setting our starting point as the inital point
    x_n = x_0 - alpha*gradient #creating a new x step point
    i = 0
    while i < num:
        x_0 = x_n #Our inital point is now the step point just taken
        gradient = np.array([10*(x_n[0]) + 4*(x_n[1]) - 6, 4*(x_n[0]) + 2*(x_n[1]) - 4]) #gradient of that step point
        x_n = x_n - alpha*gradient #new step point created 
        i += 1
    return x_n

grad_decent(x, alpha, num)

array([-1.,  4.])

#### b. Gradient Decent in application to cost functions
As we explore several supervised algorithms we will come to find that gradient decent is the backbone to the vast majority of them. When hard coding a machine learning algorithms, the difficulty comes in being able to successfully take the derivative of our cost functions, some being quite complex, and vectorize that process through our code. The gradient decent method will be marked in the other notebooks through this Intro to Machine Learning Repository.
Several packages within python have built in gradient decent functions for their associated algorithm. 