# Cost Function and Gradient Descent

## Linear Regression

For an input vector $x^T = (X_1 , X_2 , \dots , X_p )$, and a real-valued output y, the linear regression model
has the form:

$$f(X) = \beta_0 + \sum_{j=1}^pX_j\beta_j$$

We consider a set of training data $(x_1 , y_1 ) \dots (x_N , y_N )$ from which to estimate the parameters $\beta$.

In [None]:
Image(filename='../images/lr.png', width=600)

## Cost function

A cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X (independent variables) and y (dependent variable).

The function that defines the difference between your actual value and the predicted value. 

$$
y_t = y_p + e
$$

In case of Linear regression with:
$$y_p = f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2$$

$$
error = y_t - y_p = y_t - (\beta_0 + \beta_1X_1 + \beta_2X_2)
$$

In [None]:
Image(filename='../images/error.png', width=600)

### Types of Cost funtion (Just for Introduction)

### Sum of Squares
$$\sum_{i=1}^ne_i^2 = \sum_{i=1}^n(y_t - \beta_0 - \beta_1X_1 - \beta_2X_2)^2$$

# (Vanilla) Gradient descent

- Gradient Descent is an **optimisation algorithm** that attempts to find the **local or global minima** of a function by using its **partial derivatives**. In ML it is used to __optimise the cost function__ (reduce our error). 

- This algorithm is commonly used to optimise convex functions. The idea is very simple — to reach the global optimum of a convex function from any point, we need to move in the direction **opposite** to that of greatest increase of the function. As the function is convex, this strategy will always take us to the global optimum. 

 - A gradient is an extension of partial derivatives. Gradients take the partial derivatives of each variable in a function and then places each partial derivative in a vector. The gradient value is zero at a local maximum or local minimum (because there is no single direction of increase) -  also referred to as **convergence**.

 - In mathematics, the **gradient is a multi-variable generalization of the derivative**. While a derivative can be defined on functions of a single variable, for functions of several variables, the gradient takes its place. The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued.

 - **Gradient** of a function gives the direction of the **steepest ascent**, i.e. the direction to move if you want to increase the function. 

Now, the _direction_ of **greatest increase of the function** is determined by taking the partial derivatives of the function with respect to every variable. For example, let us say we wish to optimise a convex function $f(x)$, where $x = (x_1, x_2, \ldots, x_d)^T$. Let us say, we start at a point $s \in \mathbb{R}^d$. Then, the direction of greatest increase of $f$ at $s$ is given by 

$$
\nabla f(s) = \left(\frac{\partial f(s)}{\partial x_1}, \frac{\partial f(s)}{\partial x_2}, \ldots ,\frac{\partial f(s)}{\partial x_d} \right)^T.
$$

 - Since we want to **decrease the function**, we take the **negative gradient**. The length of the gradient function is an indication of how step the slope is. 
 
With this derivative, we design an update rule, which asks us to move in the direction opposite to the direction of greatest increase. By repeatedly applying this rule, we hope to reach the global minimum. We hence move to $t$, which is given by

$$
t = s - \gamma \nabla f(s),
$$
where $\gamma$ is a parameter called the *learning rate*.

In [None]:
Image(filename='../images/gd1.png', width=600)

### Lets create a model with 50 observations and 2 features per observation

In [None]:
feature_1=np.random.randint(low=1,high=20,size=(50,))
feature_2=np.random.randint(low=1,high=20,size=(50,)) 

In [None]:
feature_1.shape, feature_2.shape

In [None]:
y_true=3+2*feature_1-4*feature_2+np.random.random((50,))
print(y_true.shape)
y_true 

In [None]:
X=pd.DataFrame({'intercept':np.ones_like(feature_1),'feature_1':feature_1,'feature_2':feature_2})
X.head()

In [None]:
W=np.zeros(X.shape[1])
W

In [None]:
def myprediction(features,weights):
    predictions=np.dot(features,weights)
    return(predictions)

y_pred = myprediction(X,W)
y_pred

In [None]:
def myerror(y_true,y_pred):
    error=y_true - y_pred
    return(error)

myerror(y_true,y_pred)

In [None]:
def mycost(y_true,y_pred):
    error=myerror(y_true,y_pred)
    cost=np.dot(error.T,error)
    return(cost)

mycost(y_true,y_pred)

In [None]:
def gradient(y_true,features,weights):
    
    y_pred = myprediction(features,weights)
    error=myerror(y_true,y_pred)    
    gradient=-np.dot(features.T,error)  #gradient = -2X^T*(y-XW)
    
    return(gradient)

gradient(y_true,X,W)

In [None]:
def lr_fit(y_true,features,learning_rate):
    
    weights=np.zeros(features.shape[1])
    
    for i in np.arange(30000):
        
        weights = weights - learning_rate*gradient(y_true,features,weights)         
#        weights[0] = weights[0] - 10*learning_rate*gradient(y_true,features,weights)[0]
        
        if i%1000==0:
            y_pred = myprediction(features,weights)
            print(mycost(y_true,y_pred),weights)
            
    return(weights)

lr_fit(y_true,X,0.0001)

### Lets implement the same model using scikit learn

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr=LinearRegression()

In [None]:
X.shape

In [None]:
lr.fit(X.iloc[:,1:],y_true)

In [None]:
lr.coef_

In [None]:
lr.intercept_