Here are detailed notes for **L8.5: Basic Algorithm for Unconstrained Optimization – Gradient Descent**, covering each and every topic mentioned in the transcript:

---

# 🧠 Lecture 8.5: Basic Algorithm for Unconstrained Optimization – Gradient Descent

---

## 🔧 Goal

To minimize an unconstrained objective function:

$$
\min_{x \in \mathbb{R}} f(x)
$$

We want a general-purpose algorithm that, given any differentiable function $f$, iteratively finds a point that minimizes it.

---

## ⚙️ Gradient Descent Algorithm (1D)

### **Algorithm Summary**

1. **Initialize**:
   Choose any starting point $x_0 \in \mathbb{R}$

2. **For** $t = 0, 1, 2, \dots$

   $$
   x_{t+1} = x_t - \eta_t f'(x_t)
   $$

   where the **step size** (learning rate) is chosen as:

   $$
   \eta_t = \frac{1}{t+1}
   $$

This is the **Gradient Descent Algorithm** for 1D unconstrained optimization.

---

## 📛 Why is it Called "Gradient Descent"?

* The **gradient** is the generalization of the derivative to higher dimensions.
* The word **descent** refers to the fact that we move in the direction that decreases the function value: the **negative derivative**.

This is a **first-order optimization algorithm**, as it uses only first-order information (the gradient or derivative of $f$).

---

## 🌍 Generalization to Higher Dimensions

In higher dimensions ($x \in \mathbb{R}^d$):

* Replace the derivative $f'(x)$ with the **gradient** $\nabla f(x)$
* The update rule becomes:

  $$
  x_{t+1} = x_t - \eta_t \nabla f(x_t)
  $$

The algorithm works similarly for **vector-valued inputs**, as long as the function is differentiable.

---

## 🔁 Nature of Gradient Descent

* **Iterative**: Improves the guess with each step $t$
* **Only requires ability to compute derivative**
* **Simple to implement** and widely used in practice

---

## ✅ Convergence Properties

### **Key Property**: If $\eta_t = \frac{1}{t+1}$, then:

* The algorithm **converges**: i.e., oscillations are avoided
* It **converges to a local minimum** (not necessarily the global one)

---

## 📉 Local vs Global Minimum

### 🟥 Global Minimum:

* A point $x^\star$ is a **global minimum** if:

  $$
  f(x^\star) \leq f(x) \quad \forall x \in \mathbb{R}
  $$

### 🟦 Local Minimum:

* A point $\hat{x}$ is a **local minimum** if:

  $$
  \exists \epsilon > 0 \text{ such that } f(\hat{x}) \leq f(x), \forall x \in (\hat{x} - \epsilon, \hat{x} + \epsilon)
  $$
* The function value is the lowest **in a small neighborhood**, not necessarily everywhere.

### 📈 Local Maximum:

Defined similarly, but with the inequality sign reversed:

$$
f(\hat{x}) \geq f(x) \quad \forall x \in (\hat{x} - \epsilon, \hat{x} + \epsilon)
$$

---

## 🧩 Why Gradient Descent Only Finds Local Minima

* If you start at a point $x_0$, GD will follow the **negative gradient** direction
* It will stop when the derivative (or gradient) becomes **zero**
* If this happens at a **local minimum**, the algorithm will **get stuck** there
* It **cannot climb out** of the valley to reach a better minimum beyond, because it only moves based on local slope

---

## 🔍 Illustrative Example

Suppose $f(x)$ is a "wiggly" function like:

![wiggly\_function](conceptual)

* There are many **local minima** (marked in blue)
* Only one **global minimum** (marked in red)
* Depending on where you start ($x_0$), gradient descent may converge to any of the local minima

---

## ❗ Important Observations

1. **Gradient Descent is Not Guaranteed to Find Global Minima**

   * Without assumptions on $f$, no algorithm can guarantee global minimum

2. **But in Practice: Local Minima May Be Good Enough**

   * Especially in ML, **local minima are often as good as global ones**
   * For many problems, **every local minimum is also a global minimum**

---

## ✅ Example: $f(x) = (x - 5)^2$

* This is a **convex function**
* It has **one** local minimum, which is also the **global minimum**
* Gradient Descent converges to 5, no matter the starting point

---

## 🎯 Convex Functions

* A function is **convex** if **every local minimum is also a global minimum**
* Gradient Descent performs **optimally** on convex functions
* Most loss functions used in machine learning are **convex**

---

## 🔮 Looking Ahead: Open Questions

1. **Why is $-f'(x)$** a good direction to move in?

   * We’ll analyze its **geometric** and **intuitive** justification
   * Leads us to more advanced optimization methods

2. **What if constraints are present?**

   * We'll extend this to **constrained optimization** later

---

## 📝 Summary

| Concept        | Description                                                |
| -------------- | ---------------------------------------------------------- |
| Objective      | Minimize $f(x)$ with no constraints                        |
| Algorithm      | Gradient Descent                                           |
| Update Rule    | $x_{t+1} = x_t - \eta_t f'(x_t)$, $\eta_t = \frac{1}{t+1}$ |
| Works for      | Any differentiable function $f$                            |
| Converges to   | A **local minimum** (not necessarily global)               |
| Limitations    | Can get stuck at local minima                              |
| Useful when    | Function is **convex** – local = global minimum            |
| Real-world use | Extremely effective in many ML applications                |