# Linear Regression Using Gradient Descent

**Linear regression** is a simple yet powerful technique that helps us understand and predict the relationship between variables. It tries to draw a straight line through a set of data points, showing how the input affects the output. Once the line is created, we can use it to predict future outcomes based on new data.

### Example:
Imagine you're trying to predict **house prices** based on the **size of the house**. By looking at previous house sales, we can see how size affects price. For example:

| House Size (sq.ft) | House Price ($1000) |
| ------------------ | ------------------- |
| 950                | 210                 |
| 1600               | 300                 |
| 2200               | 320                 |
| 2900               | 360                 |
| 3200               | 450                 |

With this dataset, linear regression creates a **best-fit line** through the data points, allowing us to estimate outcomes for new inputs. For example, if we want to predict the price of a house that is 1800 sq.ft, the model uses the best-fit line to estimate the price based on the relationship it has learned.

The graph below shows the data points along with the best-fit line, which can be used to predict the prices of houses for sizes not included in the original data.

![Linear Regression](images/Linear_Regression_Example1.png)

This method is widely used in many fields like predicting house prices, forecasting stock market trends, estimating sales figures, and even predicting medical outcomes. It helps us make well-informed decisions using past data.

---
---

## The Equation of the Best-fit Line

The equation of the best-fit line can be written in several ways, depending on the notation used:

- **y = mx + c**
- **y = $β_0$ + $β_1$ x**
- **$h_θ$(x) = $θ_0$ + $θ_1$ x**

We will focus on the equation:  
> **$h_θ$(x) = $θ_0$ + $θ_1$ x**  

Here's what each term means:
- **$θ_0$** is the **intercept**, where the line crosses the y-axis.
- **$θ_1$** is the **slope** of the line, which shows how much y (price) changes with a one-unit increase in x (size).
- **x** is the **input variable** (house size in this case).

In simple terms, this equation shows how changes in the input variable affect the output.

---
---

## Aim

The goal of **linear regression** is to find the **best-fit line** that minimizes the difference between the actual data points and the predicted values (i.e., the distance between the points and the line). Instead of trying all possible lines, we start with an initial guess and adjust the line to reduce the error. 

We use a **cost function** to measure this error. By minimizing the cost function, we ensure that the line passes as close as possible to the data points.

---
---

## Cost Function

> J($θ_0$, $θ_1$) = $ \frac{1}{2m} $ $ \sum_{i=1}^{m} $ $ (h_θ(x_i) - y_i )^2 $

This is the **cost function** for linear regression.

### Breaking it down:

1. **J($θ_0$, $θ_1$)**     : This is the cost function that measures how well the line fits the data. The smaller the cost, the better the fit.
2. **$\frac{1}{2m}$**      : This normalizes the cost by dividing the total error by **2m** (where **m** is the number of data points). The **2** simplifies future calculations.
3. **$\sum_{i=1}^{m}$**    : The summation symbol means we add up the errors for all **m** data points.
4. **$(h_θ(x_i) - y_i)^2$**: $(h_θ(x_i) - y_i)^2$: This is the **error** for each data point (i.e., the distance between the data points and the best-fit line). Squaring the error ensures that negative errors do not cancel out positive ones and emphasizes larger errors.  
   - **$h_θ(x_i)$** is the model's prediction for the $i$th data point.  
   - **$y_i$** is the actual value for the $i$th data point.
   - The difference **$(h_θ(x_i) - y_i)$** is the error, and squaring it ensures all errors are positive and penalizes larger errors more.

---
---

## Gradient Descent

Now that we understand our goal is to **minimize the cost function**, the next question is: **How do we find the values of \( θ_0 \) and \( θ_1 \) that yield the best-fit line?**

The answer lies in **Gradient Descent**, a powerful optimization algorithm used to minimize the cost function and determine the best-fit line.

### Understanding Gradient Descent
To grasp the concept, imagine standing on a hill and aiming to reach the lowest point (minimize the cost). Here’s how it works:

- **Direction of the Slope**: You take steps downhill in the direction where the slope (gradient) decreases.
- **Step Size**: If your steps are too large, you might overshoot the minimum point; if they’re too small, it could take too long to reach your goal.

Gradient Descent helps you find the optimal step size and direction to efficiently reach the minimum cost.

![Gradient Descent Visualization](images/Gradient_Descent_1.png)  
*The graph above illustrates how gradient descent navigates the landscape of the cost function.*

---

### Gradient Descent Formula

When we want to find the best-fit line in linear regression, we need to tweak our parameters, $θ_0$ and $θ_1$. The **Gradient Descent** algorithm helps us do just that by updating these parameters in a systematic way.

Here’s how it works:

1. **Updating $θ_0$** (the intercept):
   $$
   θ_0 := θ_0 - α \frac{\partial J(θ_0, θ_1)}{\partial θ_0}
   $$

2. **Updating $θ_1$** (the slope):
   $$
   θ_1 := θ_1 - α \frac{\partial J(θ_0, θ_1)}{\partial θ_1}
   $$

#### Breaking It Down:

- **Learning Rate (α)**: This is like the size of your steps when you're walking downhill. If your steps are too big, you might miss the lowest point and go too far. If they’re too small, you might take forever to get there. The learning rate controls how fast you move towards the best-fit line.

- **Cost Function $(J(θ_0, θ_1))$**: Think of this as a measure of how far off your predictions are from the actual data. Our goal is to minimize this cost function, which means we want our predictions to be as close to the actual values as possible.

- **Partial Derivatives $(\frac{\partial J(θ_0, θ_1)}{\partial θ_0}$ and $\frac{\partial J(θ_0, θ_1)}{\partial θ_1})$**: These represent the slopes or gradients of the cost function with respect to the parameters. In simple terms, they tell us how much the cost function changes as we adjust each parameter.

---

### Deriving The Cost Function:

If we substitute the value of $J(θ_0, θ_1)$ in both cases, we get:

1. $$ \frac{\partial J(θ_0, θ_1)}{\partial θ_0} = \frac{1}{m} \sum_{i=1}^{m} \left( h_θ(x_i) - y_i \right) $$

2. $$ \frac{\partial J(θ_0, θ_1)}{\partial θ_1} = \frac{1}{m} \sum_{i=1}^{m} \left( h_θ(x_i) - y_i \right) x_i $$

#### Explanation:
- $ h_θ(x_i) $ is the hypothesis function, often $ h_θ(x_i) = θ_0 + θ_1 x_i $ in linear regression.
- The summations show the average gradient (for *m* examples), and for $θ_1$, it's multiplied by $x_i$ because you're differentiating with respect to $θ_1$, and $x_i$ is the variable associated with $ θ_1 $.

If we substitute the values of $J(θ_0, θ_1), 

1. $$ θ_0 := θ_0 - α \frac{1}{m} \sum_{i=1}^{m} \left( h_θ(x_i) - y_i \right) $$
2. $$ θ_1 := θ_1 - α \frac{1}{m} \sum_{i=1}^{m} \left( h_θ(x_i) - y_i \right) x_i $$

---
---

## Conclusion

Linear regression is a foundational technique in machine learning and statistics. By fitting a straight line through the data using **Gradient Descent**, we minimize the error and find the best relationship between the input and output variables. Here's a quick recap of the key steps:

1. **Define the hypothesis**: Use the equation $h_θ(x) = θ_0 + θ_1 x$ to predict outcomes.
2. **Use the cost function**: Minimize the squared error between predicted and actual values.
3. **Apply Gradient Descent**: Update the parameters ($θ_0$, $θ_1$) iteratively to reduce the error.
4. **Repeat**: Continue updating until the cost function reaches a minimum.

---
---