# Cost Function

In ML, cost functions are used to estimate how badly models are performing. 

Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y. This is typically expressed as a difference or distance between the predicted value and the actual value. The cost function (you may also see this referred to as loss or error.) can be estimated by iteratively running the model to compare estimated predictions against “ground truth” — the known values of y.

The objective of a ML model, therefore, is to find parameters, weights or a structure that minimises the cost function.

It is a `function` that measures the performance of a model for any given data. Cost Function `quantifies the error` between predicted values and expected values and presents it in the form of a single real number.
 $$ Hypothesis: h_{\theta}(x) = {\theta + \theta_{1}(x)}$$
 $$ Parameters: {\theta_{0} , \theta_{1}}$$
 $$ Cost Function: {J(\theta_{0}, \theta_{1}) = {1\over2m} \displaystyle\sum_{i=1}^{m}[(h_\theta)(x^i) - y^i]^2}$$
 $$ Goal: minimize J({\theta_{0} , \theta_{1}}) with respect to  \theta_{0}, \theta_{1}$$

**Intuition Behind Gradient Descent**
Let’s say you are playing a game where the players are at the top of a mountain, and they are asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what approach do you think would make you reach the lake?

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends. From that position, take a step in the descending direction and iterate this process until we reach the lowest point.

# Gradient Descent 

[Image Link](https://miro.medium.com/max/405/1*UUHvSixG7rX2EfNFTtqBDA.gif)

The goal of the gradient descent algorithm is to minimize the given function (say cost function). To achieve this goal, it performs two steps iteratively:

1. Compute the gradient (slope), the first order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point




## Plotting Gradient Descent Algorithm

When we have a single parameter (theta), we can plot the dependent variable cost on the y-axis and theta on the x-axis. If there are two parameters, we can go with a 3-D plot, with cost on one axis and the two parameters (thetas) along the other two axes.


![Gradient Descent](../images/gradient_descent_algo.gif "Gradient Descent")

Source: Miro Medium: https://miro.medium.com/max/405/1*UUHvSixG7rX2EfNFTtqBDA.gif

### Gradient Descent Visualization

![Gradient Descent](../images/gradient_descent.jpg "Gradient Descent")

Source: Youtuber, [codebasics](https://www.youtube.com/channel/UCh9nVJoWXmFb7sLApWGcLPQ)

![Overshoot](../images/overshoot.png "Overshoot")

Source: Coursera, [Andrew NG](https://www.coursera.org/learn/machine-learning)

![2D static animation of Gradient Descent](../images/grad_desc_2d.png "Overshoot")

Source: Youtuber, [codebasics](https://www.youtube.com/channel/UCh9nVJoWXmFb7sLApWGcLPQ)



## Algorithm in Python

* Gradient Descent Algorithm
$$\theta_{j} = \theta_{j} - \alpha [ \frac{\partial }{\partial \theta_{j}} J(\theta_{0},\theta_{1})]  $$

* `Alpha or Learning rate` – a tuning parameter in the optimization process. It decides the length of the steps.
* Hypothesis: 

$$ h_{\theta}(x) = {\theta + \theta_{1}(x)} $$

* Parameters:
 $$ {\theta_{0} , \theta_{1}} $$
 
* Cost Function: 
 $$ {J(\theta_{0}, \theta_{1}) = {1\over2m} \displaystyle\sum_{i=1}^{m}[(h_\theta)(x^i) - y^i]^2} $$
 
* Goal: 
 $$  minimize J({\theta_{0} , \theta_{1}}) with respect to  \theta_{0}, \theta_{1} $$

* Slope:

 $$ slope = m = m - \alpha * [ \frac{\partial }{\partial m}]  $$

where,
 
 $$  [ \frac{\partial }{\partial m}] = {2\over n} * \displaystyle\sum_{i=1}^{n}[y_i - (mx_i + b)]} $$
 
* Intercept:

 $$ slope = c = c - \alpha * [ \frac{\partial }{\partial b}]  $$
 
where,
 
 $$ [ \frac{\partial }{\partial b}] = {2\over n} * \displaystyle\sum_{i=1}^{n}[y_i - (mx_i + b)]} $$

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Gradient Descent algorithm function
# m = slope, c = intercept

def gradient_descent(x,y):
    m_curr = c_curr = 0
    iterations = 10
    n = len(x)
    learning_rate = 0.08

    for i in range(iterations):
        y_predicted = m_curr * x + c_curr
        cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
        md = -(2/n)*sum(x*(y-y_predicted))
        cd = -(2/n)*sum(y-y_predicted)
        m_curr = m_curr - learning_rate * md
        c_curr = c_curr - learning_rate * cd
        print ("m {}, c {}, cost {} iteration {}".format(m_curr,c_curr,cost, i))

# For a given values of 'x' and 'y' vectors
# where x will be the known features or factors and y is something, we will be predicting, based on historical data.
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])

gradient_descent(x,y)

m 4.96, c 1.44, cost 89.0 iteration 0
m 0.4991999999999983, c 0.26879999999999993, cost 71.10560000000002 iteration 1
m 4.451584000000002, c 1.426176000000001, cost 56.8297702400001 iteration 2
m 0.892231679999997, c 0.5012275199999995, cost 45.43965675929613 iteration 3
m 4.041314713600002, c 1.432759910400001, cost 36.35088701894832 iteration 4
m 1.2008760606719973, c 0.7036872622079998, cost 29.097483330142282 iteration 5
m 3.7095643080294423, c 1.4546767911321612, cost 23.307872849944438 iteration 6
m 1.4424862661541864, c 0.881337636696883, cost 18.685758762535738 iteration 7
m 3.4406683721083144, c 1.4879302070713722, cost 14.994867596913156 iteration 8
m 1.6308855378034224, c 1.0383405553279617, cost 12.046787238456794 iteration 9


**Note:** Trial and error method, where we manually adjust or supervise the machine to learn what is the minimum cost.

In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import math

def predict_using_sklean():
    df = pd.read_csv("../data/test_scores.csv")
    model = LinearRegression()
    # model.fit(x, y)
    model.fit(df[['math']],df.cs)
    return model.coef_, model.intercept_

def gradient_descent(x,y):
    m_curr = 0
    b_curr = 0
    iterations = 10
    n = len(x)
    learning_rate = 0.0002

    cost_previous = 0

    for i in range(iterations):
        y_predicted = m_curr * x + b_curr
        cost = (1/n)*sum([value**2 for value in (y-y_predicted)])
        md = -(2/n)*sum(x*(y-y_predicted))
        bd = -(2/n)*sum(y-y_predicted)
        m_curr = m_curr - learning_rate * md
        b_curr = b_curr - learning_rate * bd
        if math.isclose(cost, cost_previous, rel_tol=1e-20):
            break
        cost_previous = cost
        print ("m {}, b {}, cost {}, iteration {}".format(m_curr,b_curr,cost, i))

    return m_curr, b_curr

if __name__ == "__main__":
    df = pd.read_csv("../data/test_scores.csv")
    x = np.array(df.math)
    y = np.array(df.cs)

    m, b = gradient_descent(x,y)
    print('')
    print("Using gradient descent function: Coef {} Intercept {}".format(m, b))

    m_sklearn, b_sklearn = predict_using_sklean()
    print("Using sklearn: Coef {} Intercept {}".format(m_sklearn,b_sklearn))

m 1.9783600000000003, b 0.027960000000000002, cost 5199.1, iteration 0
m 0.20975041279999962, b 0.0030470367999999894, cost 4161.482445460163, iteration 1
m 1.7908456142986242, b 0.025401286955264, cost 3332.2237319269248, iteration 2
m 0.37738163667530467, b 0.005499731626422651, cost 2669.4843523161976, iteration 3
m 1.6409848166378898, b 0.023373894401807944, cost 2139.826383775145, iteration 4
m 0.5113514173939655, b 0.0074774305434828076, cost 1716.5264071567592, iteration 5
m 1.5212165764726306, b 0.021771129698498662, cost 1378.2272007804495, iteration 6
m 0.6184191426785134, b 0.009075514323270572, cost 1107.8601808918404, iteration 7
m 1.4254981563597626, b 0.020507724625171385, cost 891.7842215178443, iteration 8
m 0.7039868810749315, b 0.010370210797388455, cost 719.0974036421305, iteration 9

Using gradient descent function: Coef 0.7039868810749315 Intercept 0.010370210797388455
Using sklearn: Coef [1.01773624] Intercept 1.9152193111569176


## Logistic Regression
Difference between `Linear Regression` and `Logistic Regression`?

| Linear Regression | Logistic Regression |
| --- | --- |
| Requires well-labeled data meaning it needs `supervision`. | Requires well-labeled data meaning it needs `supervision`. |
| The prediction gained is usually a value that can be in the range of negative infinity to positive infinity.  | The prediction that is gained through the logistic regression is actually in the range of just zero to one. This feature allows for an easy classification with the help of a threshold value. |
| Linear regression requires `no function of activation`. | Here we need a function of activation. In this case, that function is the `sigmoid function`. |
| There is `no threshold value` in linear regression. | A `threshold value is needed` to determine the classes of each instance properly. |
 
* Goal of logistic regression, is to figure out some way to split the datapoints to have an accurate prediction of a given observation class using the information present in the features.`Decision Boundary` splits the data into two parts. 
* Gradient Descent = same equation but the hypothesis function is different.