# Simple Linear Regression - Cost Function and Gradient Descent


## Summary

* The goal of simple linear regression is to find the **best fit line** by minimizing the error between predicted and actual values
* The **cost function** used is **Mean Squared Error (MSE)**, denoted as J(θ₀, θ₁) = (1/2m) Σ(h_θ(x^(i)) - y^(i))²
* **Theta zero (θ₀)** represents the **intercept** of the line, while **theta one (θ₁)** represents the **slope**
* The cost function calculates the squared difference between predicted points and true output values, then averages them
* Alternative cost functions exist, including **Mean Absolute Error** and **Root Mean Squared Error**
* **Gradient Descent** is the curve formed when plotting the cost function against different parameter values
* The **global minima** is the point on the gradient descent curve where the cost function is minimized
* The aim is to reach the global minima by systematically changing θ₀ and θ₁ values
* A **convergence algorithm** is needed to efficiently find the optimal parameter values rather than randomly selecting them

## Cost Function in Simple Linear Regression

The primary objective in simple linear regression is to find the **best fit line** that minimizes the total error across all data points. This is achieved through the use of a **cost function**.

### Mean Squared Error (MSE)

The cost function used in this context is the **Mean Squared Error**, represented by the notation:

**J(θ₀, θ₁) = (1/2m) Σ(h_θ(x^(i)) - y^(i))²**

Where:
* **m** is the number of training examples
* **h_θ(x^(i))** represents the predicted points
* **y^(i)** represents the true output values
* The summation runs from i = 1 to m

### Understanding the Components

The term **h_θ(x^(i)) - y^(i)** calculates the **error value** by subtracting the true output from the predicted value. Squaring this difference serves multiple purposes and provides advantages that distinguish MSE from other cost functions.

### Alternative Cost Functions

While MSE is commonly used, other cost functions are also available:
* **Mean Absolute Error (MAE)**
* **Root Mean Squared Error (RMSE)**

Each of these cost functions has specific advantages and disadvantages that are relevant in different scenarios.

## The Optimization Objective

The fundamental aim is to **minimize the cost function J(θ₀, θ₁)** by adjusting the values of theta zero and theta one. This minimization process involves:

* **Theta zero (θ₀)**: Controls the intercept of the line
* **Theta one (θ₁)**: Controls the slope of the line

The optimization goal is expressed as:

**Minimize J(θ₀, θ₁) = (1/2m) Σ(h_θ(x^(i)) - y^(i))²**

Since the cost function divides by m (the number of examples), this produces a **mean** of the squared errors, hence the name Mean Squared Error.

## The Line Equation

The equation of the straight line in simple linear regression is:

**h_θ(x) = θ₀ + θ₁ · x**

This equation represents the best fit line that the algorithm attempts to find.

## Simplified Example with θ₀ = 0

To visualize the concept in a 2D diagram, consider the special case where **theta zero is set to zero**. This means:
* The line passes through the **origin**
* The **intercept is exactly zero**
* The equation simplifies to: **h_θ(x) = θ₁ · x**

### Example Dataset

Consider a simple dataset:

| x | y |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |

For this dataset, the data points are (1,1), (2,2), and (3,3).

### Testing Different θ₁ Values

#### Case 1: θ₁ = 1 (Optimal)

When **theta one equals 1**, the equation becomes:
* **h_θ(x) = 1 · x**
* For x = 1: h_θ(1) = 1
* For x = 2: h_θ(2) = 2
* For x = 3: h_θ(3) = 3

The predicted points (1,1), (2,2), (3,3) match the actual data points perfectly. The cost function calculation:

```
J(θ₁) = (1/2m) Σ(h_θ(x^(i)) - y^(i))²
J(θ₁) = (1/2·3)[(1-1)² + (2-2)² + (3-3)²]
J(θ₁) = (1/6)[0 + 0 + 0]
J(θ₁) = 0
```

This represents the **best fit line** for this dataset, with **zero error**. The line passes perfectly through all data points, creating the **global minima** on the gradient descent curve.

#### Case 2: θ₁ = 0.5

When **theta one equals 0.5**, the equation becomes:
* **h_θ(x) = 0.5 · x**
* For x = 1: h_θ(1) = 0.5
* For x = 2: h_θ(2) = 1
* For x = 3: h_θ(3) = 1.5

The cost function calculation:

```
J(θ₁) = (1/2·3)[(0.5-1)² + (1-2)² + (1.5-3)²]
J(θ₁) = (1/6)[(0.5)² + (1)² + (1.5)²]
J(θ₁) = (1/6)[0.25 + 1 + 2.25]
J(θ₁) ≈ 0.58
```

This yields approximately **J(θ₁) ≈ 0.58**, indicating a higher error than the optimal case.

#### Case 3: θ₁ = 0

When **theta one equals zero**, the equation becomes:
* **h_θ(x) = 0**
* All predicted values are zero
* The predicted points are (1,0), (2,0), (3,0) instead of the actual (1,1), (2,2), (3,3)

The cost function calculation:

```
J(θ₁) = (1/2·3)[(0-1)² + (0-2)² + (0-3)²]
J(θ₁) = (1/6)[1 + 4 + 9]
J(θ₁) = (1/6)[14]
J(θ₁) ≈ 2.3
```

This produces a significantly higher error of approximately **2.3**, confirming this is not the best fit line.

### Visualization of the Gradient Descent Curve

When plotting the cost function **J(θ₁)** against different values of **θ₁**, we observe:

* At **θ₁ = 0**: J(θ₁) ≈ 2.3 (high error)
* At **θ₁ = 0.5**: J(θ₁) ≈ 0.58 (moderate error)
* At **θ₁ = 1**: J(θ₁) = 0 (zero error - **global minima**)
* At **θ₁ = 1.5**: Error increases again
* At **θ₁ = 2.0**: Error continues to increase
* At **θ₁ = 2.5**: Error is even higher

This creates a **U-shaped curve** (parabola) where:
* The lowest point (bottom of the U) is at θ₁ = 1
* This point represents the **global minima**
* Moving away from this point in either direction increases the error
* The curve demonstrates the **gradient descent** concept

## Gradient Descent

When plotting different **theta one values** against their corresponding **cost function values J(θ₁)**, a curve emerges. This curve is called **Gradient Descent**.

### Key Characteristics

* The curve shows how the cost function changes with different parameter values
* The lowest point on this curve represents the **global minima**
* At the global minima, the **error is minimized**
* This point corresponds to the **best fit line**

### The Global Minima

The **global minima** is the point where:
* The cost function reaches its minimum value
* The error is at its lowest
* The best fit line is achieved

In the example above, the global minima occurs at **θ₁ = 1**, where **J(θ₁) = 0**.

### Importance in Machine Learning

**Gradient Descent** is not only crucial for simple linear regression but is also **super important in deep learning techniques**. The fundamental concept remains the same: systematically move toward the point that minimizes the cost function.

## The Need for a Convergence Algorithm

While it's possible to manually test different values of theta one (and theta zero), this approach is not practical because:
* Randomly selecting different parameter values is inefficient
* There's no systematic way to improve the estimates
* Manual testing doesn't scale to larger datasets or more complex models

A **convergence algorithm** is required to:
* Start with an initial value of θ₁ (and θ₀)
* Systematically change these values
* Move toward the global minima efficiently
* Find the optimal parameters that minimize the cost function

The convergence algorithm provides a **mechanism for changing parameter values** in a way that consistently reduces the cost function and approaches the best fit line.

## Moving Forward

The key takeaway is understanding that:
* The goal is to minimize the cost function **J(θ₀, θ₁)**
* This is achieved by moving along the **gradient descent curve**
* The target is to reach the **global minima**
* A systematic **convergence algorithm** is needed to accomplish this efficiently

The convergence algorithm methodology will determine how to update theta zero and theta one values to reach the optimal solution.
