# Gradient Descent and Convergence Algorithm for Simple Linear Regression


## Summary

* **Gradient descent** is a technique used to reach the **global minima**, representing the point of minimal error for a **best fit line**.
* When the **intercept** ($\theta_0$) is not zero, the **cost function** is visualized as a 3D **inverted mountain** or bowl shape rather than a 2D curve.
* The **convergence algorithm** iteratively updates both the **intercept** ($\theta_0$) and the **slope** ($\theta_1$) parameters simultaneously.
* **Partial derivatives** of the **cost function** are required to determine the specific update direction for each parameter.
* The **Learning Rate** ($\alpha$) acts as a multiplier to control how much the **coefficients** are adjusted in each iteration.

## Exam Notes

### 2D vs. 3D Gradient Descent Visualization

**Question**: Why is **gradient descent** sometimes represented in 3D instead of 2D?

**Answer**:  
A 2D diagram is possible when the **intercept** ($\theta_0$) is assumed to be zero, meaning the **best fit line** passes through the origin. However, when both $\theta_0$ and $\theta_1$ are changeable parameters, a 3D diagram (surface plot) is necessary to visualize the **cost function** $J(\theta_0, \theta_1)$.

### Parameter Update Logic

**Question**: What is the mathematical difference between the update rules for $\theta_0$ and $\theta_1$?

**Answer**:  
Due to the chain rule in **calculus**, the derivative for $\theta_1$ includes an additional multiplication by the input feature $x^{(i)}$. The update for $\theta_0$ only considers the difference between the **hypothesis** and the actual value.

---

## Visualization of Gradient Descent

The process of finding the most accurate **simple linear regression** model involves optimizing two **coefficients**: the **intercept** ($\theta_0$) and the **slope** ($\theta_1$).

In simplified scenarios where the **intercept** is zero ($\theta_0 = 0$), the **gradient descent** curve is a 2D plot of the **cost function** $J(\theta)$ against the **slope**. In this 2D view, the goal is to find the **global minima** by moving along the curve based on the **positive or negative slope**.

When $\theta_0$ is also a variable, the visualization transitions into a 3D diagram. This shape resembles an **inverted mountain**, where the objective is to descend from the top to a single point at the bottom, known as the **global minima**.

## The Convergence Algorithm and Cost Function

The **convergence algorithm** is a repeated iterative process defined by the following update rule:

$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)
$$

The **cost function** $J(\theta_0, \theta_1)$ used here is the **Mean Squared Error**:

$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
$$

In this equation, $m$ represents the total number of **data points**, $h_\theta(x)$ is the **hypothesis**, and $y$ is the actual value.

## Mathematical Derivation of Update Rules

To update the parameters, we calculate the **partial derivative** of the **cost function** with respect to both $\theta_0$ and $\theta_1$. This utilizes the power rule for **derivatives**:

$$
\frac{d}{dx} x^n = nx^{n-1}
$$

### Updating the Intercept ($\theta_0$)

When $j=0$, we derive the function with respect to $\theta_0$. Since the **hypothesis** is:

$$
h_\theta(x) = \theta_0 + \theta_1 x
$$

the derivative of the inner term with respect to $\theta_0$ is 1. This simplifies the derivative to:

$$
\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})
$$

### Updating the Slope ($\theta_1$)

When $j=1$, we derive the function with respect to $\theta_1$. The derivative of the hypothesis with respect to $\theta_1$ is $x$, giving:

$$
\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}
$$

## Final Implementation Logic

The algorithm must **repeat until convergence**, meaning it continues until the parameters reach the **global minima** and the **best fit line** is found.

The final simultaneous update equations are:

1. **Intercept Update**  
   $$
   \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})
   $$

2. **Slope Update**  
   $$
   \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}
   $$

While these mathematical foundations are complex, practical implementation is often simplified using libraries like **Scikit-Learn** (**sklearn**).
