Here are **in-depth notes** for **L8.8: Taylor Series in Higher Dimensions** from the *Foundations of Machine Learning Theory* course, covering **every concept mentioned in the transcript**:

---

## 🎯 Lecture Goal:

Understand the **Taylor Series** for multivariable functions and its **role in Gradient Descent** and **Constrained Optimization**.

---

## 🧠 1. **Recap: Taylor Series in 1D**

In 1D, the Taylor series for a scalar function $f(x)$ expanded around a point $x$ is:

$$
f(x + \eta d) = f(x) + \eta d f'(x) + \frac{1}{2} \eta^2 d^2 f''(x) + \dots
$$

* $\eta$: step size (scalar)
* $d$: direction (scalar in 1D)
* $f'(x)$: derivative at point $x$
* This motivated the use of $-f'(x)$ as a descent direction.

---

## 📈 2. **Taylor Series in Higher Dimensions**

### Goal:

Extend the Taylor Series to **vector-valued inputs** $x \in \mathbb{R}^d$.

$$
f(x + \eta d) \approx f(x) + \eta \cdot d^T \nabla f(x) + \text{(higher-order terms)}
$$

### Key Differences from 1D:

* $x, d \in \mathbb{R}^d$: vectors
* $\nabla f(x)$: gradient of $f$ at point $x$
* $d^T \nabla f(x)$: **directional derivative**, a scalar

💡 This approximation is **first-order**, ignoring higher-order terms (e.g., Hessians).

---

## 🧭 3. **Dot Product and Directional Derivative**

Let:

* $a = (a_1, \dots, a_d)$
* $b = (b_1, \dots, b_d)$

Then:

$$
a^T b = \sum_{i=1}^{d} a_i b_i
$$

This gives the **dot product**, which geometrically represents **projection** of one vector onto another and **how aligned they are**.

So:

* $d^T \nabla f(x)$ tells us how fast $f$ is increasing in direction $d$
* If $d^T \nabla f(x) < 0$, then $f$ is **decreasing** along $d$
* If $d^T \nabla f(x) > 0$, then $f$ is **increasing** along $d$

---

## 🟢 4. **Choosing Descent Directions**

We want:

$$
f(x + \eta d) - f(x) < 0 \Rightarrow d^T \nabla f(x) < 0
$$

So, any $d$ satisfying:

$$
d^T \nabla f(x) < 0
$$

is a **valid descent direction**.

---

## 🔻 5. **Gradient Descent Direction**

Let:

$$
d = -\nabla f(x)
$$

Then:

$$
d^T \nabla f(x) = -\|\nabla f(x)\|^2 < 0
$$

✅ Always a valid descent direction
✅ Guarantees decrease in function value if $\eta$ is small enough
✅ Always points in direction of **steepest descent**

---

## 🧭 6. **Geometry of Descent Directions**

In 2D, let:

* $W = \nabla f(x)$ (gradient vector)

We consider the set of vectors $d \in \mathbb{R}^2$ satisfying:

* $d^T W = 0$: orthogonal to gradient → on a **line**
* $d^T W < 0$: descent directions → **half-space** below this line
* $d^T W > 0$: ascent directions → **half-space** above this line

Thus, **descent directions** lie in a half-space opposite to gradient.

---

## 🔽 7. **Why Gradient Descent?**

Although many directions $d$ satisfy $d^T \nabla f(x) < 0$, only one gives **maximum decrease**:

> **The steepest descent direction is $-\nabla f(x)$**

If you're allowed to move by a **unit length**, then:

$$
\text{Max decrease in } f(x) \Rightarrow \text{Move in direction } -\nabla f(x)
$$

🔁 Thus, gradient descent is also called **steepest descent**.

---

## ⚠️ 8. **Gradient Descent in Constrained Optimization**

So far, we've assumed an **unconstrained optimization** problem:

$$
\min_x f(x)
$$

But what if there are constraints?
E.g.,

$$
\min_x f(x) \quad \text{subject to } g(x) \leq 0
$$

Now:

* Not all directions are valid
* If you move in $-\nabla f(x)$, you might **violate the constraint**

Example:
In the **cow and rope** example:

* Constraint: $g(x) = x_1^2 + x_2^2 - r^2 \leq 0$
* Only points inside the circle (rope radius) are allowed

❗In such cases, pure gradient descent **might fail**, because:

* It may suggest a direction that **leads outside the feasible region**

---

## ✅ 9. Summary of Key Concepts

| Concept                            | Explanation                                                    |
| ---------------------------------- | -------------------------------------------------------------- |
| **Taylor Series (1D)**             | Approximates function using derivatives                        |
| **Taylor Series (Multivariable)**  | $f(x + \eta d) \approx f(x) + \eta d^T \nabla f(x)$            |
| **Directional Derivative**         | $d^T \nabla f(x)$, measures rate of change along direction $d$ |
| **Descent Direction**              | Any $d$ such that $d^T \nabla f(x) < 0$                        |
| **Steepest Descent**               | Direction $-\nabla f(x)$, gives largest decrease               |
| **Gradient Descent**               | Move along $-\nabla f(x)$ with step size $\eta$                |
| **Constraints Complicate Descent** | With $g(x) \leq 0$, $-\nabla f(x)$ may be infeasible           |
| **Half-space Geometry**            | Gradient divides space into ascent/descent half-spaces         |

---

## 📌 Next Steps

This lecture transitions into **constrained optimization**, where:

* Not all directions are allowed
* You must consider constraints when picking your descent direction

Upcoming topics include:

* How to adjust gradient descent for constraints
* Karush-Kuhn-Tucker (KKT) conditions
* Projected gradient descent
* Lagrangian multipliers