# Linear Regression

- Introduction
- Cost Function
- Gradient Descent

## Linear Regression Introduction

Linear regression is a method of using supervised learning (where you have the input data along with the correct values) to determine a linear relationship between the independent (x) variables and the dependent (y) variable. Linear regression follows the basic simple steps: 

1. Plot the dependent variable against the independent variable
2. Come up with a starting hypothesis function and measure its correlation with your data, similar to a line of best fit
3. Find the error margins between the real data and your predicted points, and proceed to change the line's parameters to minimize the total error
4. Repeat step 3 until it reaches the best possible correlation with the data


At its core, linear regression follows a very simple idea. We start off with a hypothesis function and proceed to measure how much our hypothesis differs from the actual points. Using calculus and derivatives, we try to decrease this error until we finally reach a low point. 

Linear regression can occur with one input variable or multiple input variables, but the methodology is very similar. For our purposes, we will focus on univariate linear regression, which uses 1 x value and 1 y value. 


<img src="images/generalGraph.png">

NOTE: Linear regression is an algorithm applicable to many different types of machine learning programs. However, since it is an algorithm, this notebook will focus more on the mathematical approach than the coding aspect, though there is a short example project in the notebook LinearRegressionProject.ipynb. 

To use linear regression with machine learning, there are two main ideas: the cost function and gradient descent. The cost function is essentially a method to measure the accuracy of your hypothesis function by comparing its predicted values to the true data points. Gradient descent is a method to decrease the error margin. 

There are multiple alternative methods possible with linear regression, such as the normal equation. However, the cost function and gradient descent together are among the best to visualize and understand the workings behind Machine Learning, so we will focus on these. 

## Cost Function

The cost function is a method to measure the accuracy of your hypothesis function. This essentially takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.

Here is the cost function: 

![](https://drive.google.com/uc?export=download&id=1XafsTRf_j6o3ZN5_R1sLKt2950Dm-IXV)

Here is what you should actually focus on: 

<img src="images/focusingCostFunction.png">

### Dissecting the cost function

<b>m:</b> the number of entries in our dataset

<b>i:</b> which entry in the dataset we are on

<b>h<sub>0</sub>(x<sub>i</sub>):</b> the current y value that our function predicted

<b>y<sub>i</sub>:</b> the true y value from the dataset 

<b>REAL COST FUNCTION:</b> J(0, 1) = 1/2 * average of (error margin of predicted output vs real output)

Within the parentheses, we see h<sub>0</sub>(x<sub>i</sub>), which represents the y value that our hypothesis function predicts, and y<sub>i</sub>, which represents the true y function. By subtracting these two, we find the exact value of how wrong our prediction was. We then square this value, which involves derivatives and allows us to find the absolute value of our error. Using segmas (the strange e), we are able to perform this operation for all coordinates in our dataset and add them together. We then multiply this value by 1/m, which finds the average error of our prediction. We then multiply by 1/2, which has been mathematically proven to help with computation during gradient descent (which we'll cover next). 

## Gradient Descent

At this point, we have a hypothesis function and can measure its fit to our given data. Our next goal is to decrease our error margin and hone in on parameters for our hypothesis function. We can do this using gradient descent.

We first graph our hypothesis function based on its fields of θ<sub>0</sub> and θ<sub>1</sub>. It is important to note that we are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters. We put θ<sub>0</sub> on the x axis and θ<sub>1</sub> on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

![](https://drive.google.com/uc?export=download&id=1gK9su_V0ADoDiFvbtuTGM9QmP3Ujiw7y)



Our end goal is to get our cost function the very bottom of the pits in our graph, which represent the minimum error possible. The red arrows show the minimum points in the graph.

We do this by taking the derivative of our cost function. You don't need to know calculus for this course, but for our purposes, the derivative tells you which direction you should move towards to decrease your error margin. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

With each run through, we test the hypothesis's error margin using the cost function. Using the derivative, we then determine how to decrease this error margin. We run through this process over and over again until we reach a low point. 

<img src="images/gradientGraph.png">

One problem is that depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places. You select your starting point based on a case-by-case basis. If there are multiple local minimums, you can visualize it, check thousands of starting points, or use an alternative method such as the normal equation to determine a starting point. 

### Putting the cost function and gradient descent together

We can put gradient descent and the cost function together through a new equation, listed below. 

![](https://drive.google.com/uc?export=download&id=1SUEiwPX_Eh-AQdjl_ylv9Au_ypWiXZIj)

As you can see, the cost function makes up a big part of the new equation. With the section obtained from the cost function, we see that we have the average error of the hypothesis function compared to the true points. We then multiply that by the learning rate and subtract the new error margin from the old parameter, which gives you a new parameter with a lower error margin. By repeating this until convergence (reaching a low point), you are able to gradually increase the accuracy of your function. 