# Week 10: Derivatives

**Course:** Mathematics for Data Science I (BSMA1001)  
**Week:** 10 of 12

## Learning Objectives
- Derivative definition
- Power rule, product rule, quotient rule
- Chain rule
- Critical points and extrema
- Applications in optimization


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import optimize, integrate
import sympy as sp

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
sp.init_printing()
%matplotlib inline

print('‚úì Libraries loaded')

## üìê 1. Derivative Definition

### Introduction to Derivatives

The **derivative** measures the instantaneous rate of change of a function. It's one of the most important concepts in calculus and is fundamental to optimization, machine learning, and data science.

**Geometric interpretation:** The derivative at a point is the slope of the tangent line to the curve at that point.

**Physical interpretation:** If $f(t)$ represents position at time $t$, then $f'(t)$ represents velocity (rate of change of position).

---

### 1.1 Formal Definition (Limit Definition)

The derivative of $f$ at $x = a$ is:

$$f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}$$

**Alternative forms:**

$$f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}$$

$$f'(a) = \lim_{x \to a} \frac{f(x) - f(a)}{x - a}$$

**Key insight:** The derivative is a limit of average rates of change (slopes of secant lines) as the interval shrinks to zero.

---

### 1.2 Notation

Multiple notations for derivatives:

1. **Lagrange notation:** $f'(x)$, $f''(x)$, $f'''(x)$
2. **Leibniz notation:** $\frac{df}{dx}$, $\frac{d^2f}{dx^2}$
3. **Newton notation:** $\dot{f}$, $\ddot{f}$ (for time derivatives)
4. **Operator notation:** $D_x f$, $D^2_x f$

**In this course, we primarily use:** $f'(x)$ and $\frac{df}{dx}$

---

### 1.3 Differentiability and Continuity

**Theorem:** If $f$ is differentiable at $x = a$, then $f$ is continuous at $x = a$.

**Contrapositive:** If $f$ is not continuous at $a$, then $f$ is not differentiable at $a$.

**Important:** The converse is NOT true!
- A function can be continuous but not differentiable
- **Example:** $f(x) = |x|$ at $x = 0$ (continuous but sharp corner)

**Non-differentiable cases:**
1. **Discontinuity:** Function has a jump or break
2. **Corner/cusp:** Sharp point (e.g., $|x|$ at 0)
3. **Vertical tangent:** Slope is infinite (e.g., $\sqrt[3]{x}$ at 0)

---

### 1.4 Computing Derivatives from Definition

**Example 1:** Find $f'(x)$ for $f(x) = x^2$

$$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h}$$

$$= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h}$$

$$= \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0} (2x + h) = 2x$$

**Result:** $(x^2)' = 2x$

**Example 2:** Find $f'(x)$ for $f(x) = \frac{1}{x}$

$$f'(x) = \lim_{h \to 0} \frac{\frac{1}{x+h} - \frac{1}{x}}{h} = \lim_{h \to 0} \frac{x - (x+h)}{h \cdot x(x+h)}$$

$$= \lim_{h \to 0} \frac{-h}{h \cdot x(x+h)} = -\frac{1}{x^2}$$

---

### 1.5 Data Science Applications

**1. Gradient Descent**

The derivative tells us the direction of steepest ascent:

$$\theta_{t+1} = \theta_t - \alpha \cdot \frac{\partial L}{\partial \theta}$$

Where $\frac{\partial L}{\partial \theta}$ is the gradient (derivative of loss).

**2. Backpropagation**

Neural networks use chain rule of derivatives to compute gradients.

**3. Sensitivity Analysis**

How much does output change when input changes? $\text{Sensitivity} = \frac{df}{dx}$

**4. Marginal Analysis**

- **Marginal cost:** $MC = \frac{dC}{dq}$
- **Marginal revenue:** $MR = \frac{dR}{dq}$

**5. Rate of Change**

- **Velocity:** $v(t) = \frac{ds}{dt}$
- **Acceleration:** $a(t) = \frac{dv}{dt} = \frac{d^2s}{dt^2}$

---

### 1.6 Common Derivatives (Reference)

| Function | Derivative | Notes |
|----------|------------|-------|
| $x^n$ | $nx^{n-1}$ | Power rule |
| $e^x$ | $e^x$ | Exponential |
| $\ln x$ | $\frac{1}{x}$ | Natural log |
| $\sin x$ | $\cos x$ | Sine |
| $\cos x$ | $-\sin x$ | Cosine |
| $\tan x$ | $\sec^2 x$ | Tangent |

---

In [None]:
"""
DERIVATIVE DEFINITION - SECTION HEADER
"""

print("="*80)
print("SECTION 1: DERIVATIVE DEFINITION")
print("="*80)

In [None]:
"""
1. NUMERICAL DERIVATIVE APPROXIMATION
"""

print("\n" + "="*80)
print("1. NUMERICAL DERIVATIVE FROM LIMIT DEFINITION")
print("="*80)

def numerical_derivative(f, x, h=1e-7):
    """
    Compute derivative using limit definition: f'(x) ‚âà [f(x+h) - f(x)]/h
    """
    return (f(x + h) - f(x)) / h

def central_difference(f, x, h=1e-5):
    """
    More accurate: f'(x) ‚âà [f(x+h) - f(x-h)]/(2h)
    """
    return (f(x + h) - f(x - h)) / (2 * h)

# Example: f(x) = x¬≤
print("\nExample: f(x) = x¬≤ at x = 3")
f_square = lambda x: x**2
x_val = 3

print(f"  Analytical: f'(x) = 2x, so f'(3) = 6")

# Test different h values
h_values = [0.1, 0.01, 0.001, 0.0001, 0.00001]
print(f"\n  {'h':>10} | {'Forward diff':>15} | {'Error':>12}")
print("  " + "-"*42)

for h in h_values:
    approx = numerical_derivative(f_square, x_val, h)
    error = abs(approx - 6)
    print(f"  {h:10.5f} | {approx:15.10f} | {error:12.2e}")

print(f"\n  Central difference (h=0.00001): {central_difference(f_square, x_val):.10f}")
print("  ‚úì As h ‚Üí 0, approximation converges to exact derivative")

In [None]:
"""
2. VISUALIZING SECANT TO TANGENT
"""

print("\n" + "="*80)
print("2. SECANT LINES APPROACHING TANGENT LINE")
print("="*80)

def visualize_derivative_convergence(f, a, h_values, true_derivative):
    """
    Visualize how secant lines converge to tangent line as h ‚Üí 0
    """
    fig, axes = plt.subplots(1, len(h_values), figsize=(16, 4))
    if len(h_values) == 1:
        axes = [axes]
    
    x = np.linspace(a - 2, a + 2, 500)
    y = f(x)
    
    for idx, h in enumerate(h_values):
        ax = axes[idx]
        
        # Plot function
        ax.plot(x, y, 'b-', linewidth=2, label='f(x)')
        
        # Points
        fa = f(a)
        fah = f(a + h)
        ax.plot(a, fa, 'ro', markersize=10, label=f'(a, f(a))', zorder=5)
        ax.plot(a + h, fah, 'go', markersize=8, label=f'(a+h, f(a+h))')
        
        # Secant line
        secant_slope = (fah - fa) / h
        x_secant = np.array([a - 1, a + h + 1])
        y_secant = fa + secant_slope * (x_secant - a)
        ax.plot(x_secant, y_secant, 'g--', linewidth=2, 
                label=f'Secant (m={secant_slope:.3f})')
        
        # True tangent line
        x_tangent = np.array([a - 1, a + 1])
        y_tangent = fa + true_derivative * (x_tangent - a)
        ax.plot(x_tangent, y_tangent, 'r--', linewidth=2, alpha=0.7,
                label=f"Tangent (m={true_derivative:.3f})")
        
        ax.set_xlabel('x', fontsize=11)
        ax.set_ylabel('f(x)', fontsize=11)
        ax.set_title(f'h = {h:.3f}\nSlope ‚âà {secant_slope:.3f}', 
                     fontsize=11, fontweight='bold')
        ax.legend(fontsize=8, loc='best')
        ax.grid(True, alpha=0.3)
        ax.set_xlim([a-1.5, a+1.5])
    
    plt.tight_layout()
    plt.show()

# Example: f(x) = x¬≤ at x = 2
print("\nExample: f(x) = x¬≤ at x = 2")
print("  f'(2) = 4 (true derivative)")
print("\n  Visualizing convergence as h decreases:")

h_vals = [1.0, 0.5, 0.1, 0.01]
visualize_derivative_convergence(lambda x: x**2, 2, h_vals, 4)

In [None]:
"""
3. DERIVATIVES FROM FIRST PRINCIPLES - Example 1: x¬≥
"""

print("\n" + "="*80)
print("3. COMPUTING DERIVATIVES FROM FIRST PRINCIPLES")
print("="*80)

# Example 1: x¬≥
print("\nExample 1: f(x) = x¬≥")
print("  Step 1: f'(x) = lim[h‚Üí0] [(x+h)¬≥ - x¬≥]/h")
print("  Step 2: Expand (x+h)¬≥ = x¬≥ + 3x¬≤h + 3xh¬≤ + h¬≥")
print("  Step 3: [(x¬≥ + 3x¬≤h + 3xh¬≤ + h¬≥) - x¬≥]/h")
print("  Step 4: [3x¬≤h + 3xh¬≤ + h¬≥]/h = 3x¬≤ + 3xh + h¬≤")
print("  Step 5: lim[h‚Üí0] (3x¬≤ + 3xh + h¬≤) = 3x¬≤")
print("  ‚úì Result: (x¬≥)' = 3x¬≤")

# Verify numerically
f_cube = lambda x: x**3
x_test = 2
analytical = 3 * x_test**2
numerical = central_difference(f_cube, x_test)
print(f"\n  Verification at x = {x_test}:")
print(f"    Analytical: 3({x_test})¬≤ = {analytical}")
print(f"    Numerical: {numerical:.10f}")
print(f"    Error: {abs(analytical - numerical):.2e}")

In [None]:
"""
3. DERIVATIVES FROM FIRST PRINCIPLES - Example 2: ‚àöx
"""

# Example 2: ‚àöx
print("\n" + "="*60)
print("Example 2: f(x) = ‚àöx")
print("  Step 1: f'(x) = lim[h‚Üí0] [‚àö(x+h) - ‚àöx]/h")
print("  Step 2: Multiply by conjugate: [‚àö(x+h) - ‚àöx][‚àö(x+h) + ‚àöx]/[h(‚àö(x+h) + ‚àöx)]")
print("  Step 3: [(x+h) - x]/[h(‚àö(x+h) + ‚àöx)]")
print("  Step 4: h/[h(‚àö(x+h) + ‚àöx)]")
print("  Step 5: 1/(‚àö(x+h) + ‚àöx)")
print("  Step 6: lim[h‚Üí0] 1/(‚àö(x+h) + ‚àöx) = 1/(2‚àöx)")
print("  ‚úì Result: (‚àöx)' = 1/(2‚àöx)")

# Verify
f_sqrt = lambda x: np.sqrt(x)
x_test = 4
analytical = 1 / (2 * np.sqrt(x_test))
numerical = central_difference(f_sqrt, x_test)
print(f"\n  Verification at x = {x_test}:")
print(f"    Analytical: 1/(2‚àö{x_test}) = {analytical}")
print(f"    Numerical: {numerical:.10f}")
print(f"    Error: {abs(analytical - numerical):.2e}")

In [None]:
"""
3. DERIVATIVES FROM FIRST PRINCIPLES - Example 3: eÀ£
"""

# Example 3: eÀ£
print("\n" + "="*60)
print("Example 3: f(x) = eÀ£")
print("  Step 1: f'(x) = lim[h‚Üí0] [e^(x+h) - eÀ£]/h")
print("  Step 2: [eÀ£¬∑e^h - eÀ£]/h = eÀ£¬∑[e^h - 1]/h")
print("  Step 3: eÀ£ ¬∑ lim[h‚Üí0] [e^h - 1]/h")
print("  Step 4: lim[h‚Üí0] [e^h - 1]/h = 1 (special limit)")
print("  Step 5: eÀ£ ¬∑ 1 = eÀ£")
print("  ‚úì Result: (eÀ£)' = eÀ£")

# Verify special limit
h_vals = [0.1, 0.01, 0.001, 0.0001, 0.00001]
print(f"\n  Verifying lim[h‚Üí0] (e^h - 1)/h = 1:")
print(f"  {'h':>10} | {'(e^h - 1)/h':>15}")
print("  " + "-"*28)
for h in h_vals:
    limit_val = (np.exp(h) - 1) / h
    print(f"  {h:10.5f} | {limit_val:15.10f}")

In [None]:
"""
4. DIFFERENTIABILITY VS CONTINUITY - |x| at x=0
"""

print("\n" + "="*80)
print("4. DIFFERENTIABILITY VS CONTINUITY")
print("="*80)

# Example: |x| at x = 0
print("\nExample: f(x) = |x| at x = 0")
print("  Is f continuous at 0? YES")
print("    lim[x‚Üí0] |x| = 0 = f(0) ‚úì")
print("\n  Is f differentiable at 0?")
print("    Check left and right derivatives:")

h_vals = [0.1, 0.01, 0.001, 0.0001, -0.0001, -0.001, -0.01, -0.1]
print(f"\n  {'h':>10} | {'[f(0+h) - f(0)]/h':>20} | {'Direction':>10}")
print("  " + "-"*45)

f_abs = lambda x: np.abs(x)
for h in h_vals:
    deriv_approx = (f_abs(h) - f_abs(0)) / h
    direction = "Right" if h > 0 else "Left"
    print(f"  {h:10.4f} | {deriv_approx:20.3f} | {direction:>10}")

print("\n  Left derivative: lim[h‚Üí0‚Åª] = -1")
print("  Right derivative: lim[h‚Üí0‚Å∫] = +1")
print("  Since left ‚â† right, |x| is NOT differentiable at 0 ‚úó")
print("\n  Key insight: Continuous does NOT imply differentiable!")

In [None]:
"""
5. TANGENT LINE COMPUTATION
"""

print("\n" + "="*80)
print("5. TANGENT LINE EQUATIONS")
print("="*80)

def compute_tangent_line(f, df, a):
    """
    Compute tangent line at x = a: y = f'(a)(x - a) + f(a)
    """
    fa = f(a)
    slope = df(a)
    intercept = fa - slope * a
    
    if intercept >= 0:
        eq = f"y = {slope:.3f}x + {intercept:.3f}"
    else:
        eq = f"y = {slope:.3f}x - {abs(intercept):.3f}"
    
    return slope, intercept, eq

# Example 1: f(x) = x¬≤ at x = 3
print("\nExample 1: f(x) = x¬≤ at x = 3")
f1 = lambda x: x**2
df1 = lambda x: 2*x
slope, intercept, eq = compute_tangent_line(f1, df1, 3)
print(f"  f(3) = 9")
print(f"  f'(3) = 6")
print(f"  Tangent line: {eq}")

# Example 2: f(x) = x¬≥ - 2x¬≤ + 1 at x = 1
print("\nExample 2: f(x) = x¬≥ - 2x¬≤ + 1 at x = 1")
f2 = lambda x: x**3 - 2*x**2 + 1
df2 = lambda x: 3*x**2 - 4*x
slope, intercept, eq = compute_tangent_line(f2, df2, 1)
print(f"  f(1) = {f2(1)}")
print(f"  f'(x) = 3x¬≤ - 4x")
print(f"  f'(1) = {df2(1)}")
print(f"  Tangent line: {eq}")

In [None]:
"""
6. COMPREHENSIVE VISUALIZATIONS (9 plots)
"""

print("\n" + "="*80)
print("6. COMPREHENSIVE VISUALIZATIONS")
print("="*80)

fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.35, wspace=0.35)

# Plot 1: Derivative as slope
print("\n  Creating Plot 1: Derivative as slope...")
ax = fig.add_subplot(gs[0, 0])
x = np.linspace(-2, 4, 500)
f = lambda x: 0.5*x**2 - x + 1
df = lambda x: x - 1

x_points = [0, 1, 2, 3]
colors = ['red', 'green', 'blue', 'purple']

ax.plot(x, f(x), 'b-', linewidth=2.5, label='f(x) = 0.5x¬≤ - x + 1')

for xp, color in zip(x_points, colors):
    ax.plot(xp, f(xp), 'o', color=color, markersize=10, zorder=5)
    slope = df(xp)
    x_tan = np.array([xp - 1, xp + 1])
    y_tan = f(xp) + slope * (x_tan - xp)
    ax.plot(x_tan, y_tan, '--', color=color, linewidth=1.5, alpha=0.7,
            label=f"x={xp}, m={slope:.1f}")

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Derivative = Slope of Tangent Line', fontsize=11, fontweight='bold')
ax.legend(fontsize=8, loc='best')
ax.grid(True, alpha=0.3)

# Plot 2: Function and its derivative
print("  Creating Plot 2: Function and derivative...")
ax = fig.add_subplot(gs[0, 1])
x = np.linspace(-3, 3, 500)
f_x = x**3 - 3*x
df_x = 3*x**2 - 3

ax.plot(x, f_x, 'b-', linewidth=2.5, label="f(x) = x¬≥ - 3x")
ax.plot(x, df_x, 'r-', linewidth=2.5, label="f'(x) = 3x¬≤ - 3")
ax.axhline(0, color='black', linewidth=0.8)
ax.axvline(0, color='black', linewidth=0.8)

critical_pts = [-1, 1]
for cp in critical_pts:
    idx = np.argmin(np.abs(x - cp))
    ax.plot(cp, f_x[idx], 'go', markersize=10, zorder=5)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Function and Derivative\nGreen dots: f\'=0 (critical points)', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9, loc='best')
ax.grid(True, alpha=0.3)

# Plot 3: Convergence of difference quotient
print("  Creating Plot 3: Convergence to derivative...")
ax = fig.add_subplot(gs[0, 2])
f = lambda x: x**2
x0 = 2
true_deriv = 4

h_values = np.logspace(-8, 0, 100)
approx_derivs = [(f(x0 + h) - f(x0)) / h for h in h_values]

ax.semilogx(h_values, approx_derivs, 'b-', linewidth=2.5, label='Approximation')
ax.axhline(true_deriv, color='red', linestyle='--', linewidth=2.5, 
           label=f'True derivative = {true_deriv}')
ax.fill_between(h_values, true_deriv - 0.05, true_deriv + 0.05, 
                alpha=0.2, color='red')

ax.set_xlabel('h', fontsize=11)
ax.set_ylabel("[f(x+h) - f(x)]/h", fontsize=11)
ax.set_title('Convergence to Derivative\nf(x)=x¬≤ at x=2', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3, which='both')
ax.set_ylim([3.5, 4.5])

# Plot 4: |x| - continuous but not differentiable
print("  Creating Plot 4: Continuous but not differentiable...")
ax = fig.add_subplot(gs[1, 0])
x = np.linspace(-2, 2, 500)
y = np.abs(x)

ax.plot(x, y, 'b-', linewidth=3, label='f(x) = |x|')
ax.plot(0, 0, 'ro', markersize=15, label='Corner at (0,0)', zorder=5)

x_left = np.array([-2, 0])
y_left = -x_left
ax.plot(x_left, y_left, 'g--', linewidth=2.5, alpha=0.8, label='Left: slope = -1')

x_right = np.array([0, 2])
y_right = x_right
ax.plot(x_right, y_right, 'r--', linewidth=2.5, alpha=0.8, label='Right: slope = +1')

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Continuous but NOT Differentiable\nCorner at x=0', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 5: Velocity as derivative
print("  Creating Plot 5: Velocity as derivative...")
ax = fig.add_subplot(gs[1, 1])
t = np.linspace(0, 10, 500)
s = lambda t: -0.5*t**2 + 5*t
v = lambda t: -t + 5

ax.plot(t, s(t), 'b-', linewidth=2.5, label='s(t) = -0.5t¬≤ + 5t (position)')
ax.plot(t, v(t), 'r-', linewidth=2.5, label="v(t) = s'(t) = -t + 5 (velocity)")
ax.axhline(0, color='black', linewidth=0.8)

t_stop = 5
ax.plot(t_stop, s(t_stop), 'go', markersize=15, label=f'v=0 at t={t_stop}s', zorder=5)
ax.axvline(t_stop, color='green', linestyle=':', alpha=0.5)

ax.set_xlabel('Time t (seconds)', fontsize=11)
ax.set_ylabel('Value', fontsize=11)
ax.set_title('Velocity as Derivative\nMaximum position when velocity = 0', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9, loc='best')
ax.grid(True, alpha=0.3)

# Plot 6: Numerical vs analytical derivative
print("  Creating Plot 6: Numerical vs analytical...")
ax = fig.add_subplot(gs[1, 2])
x = np.linspace(0.1, 5, 200)
f = lambda x: np.sin(x)
df_true = lambda x: np.cos(x)

y_analytical = df_true(x)
h = 0.01
y_numerical = np.array([(f(xi + h) - f(xi)) / h for xi in x])
error = np.abs(y_analytical - y_numerical)

ax.plot(x, y_analytical, 'b-', linewidth=2.5, label='Analytical: cos(x)')
ax.plot(x, y_numerical, 'r--', linewidth=2, alpha=0.7, label=f'Numerical (h={h})')

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel("f'(x)", fontsize=11)
ax.set_title(f'Numerical vs Analytical\nf(x)=sin(x), max error={error.max():.2e}', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 7: Second derivative and concavity
print("  Creating Plot 7: Second derivative and concavity...")
ax = fig.add_subplot(gs[2, 0])
x = np.linspace(-2, 2, 500)
f = lambda x: x**4 - 2*x**2
ddf = lambda x: 12*x**2 - 4

ax.plot(x, f(x), 'b-', linewidth=2.5, label="f(x) = x‚Å¥ - 2x¬≤")
ax.plot(x, ddf(x), 'r-', linewidth=2.5, label="f''(x) = 12x¬≤ - 4")
ax.axhline(0, color='black', linewidth=0.8)

inflection = [-np.sqrt(1/3), np.sqrt(1/3)]
for ip in inflection:
    idx = np.argmin(np.abs(x - ip))
    ax.plot(ip, f(x)[idx], 'go', markersize=12, zorder=5)

ax.fill_between(x, -5, 5, where=(ddf(x) > 0), alpha=0.15, color='yellow',
                label='Concave up (f">0)')
ax.fill_between(x, -5, 5, where=(ddf(x) < 0), alpha=0.15, color='cyan',
                label='Concave down (f"<0)')

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Second Derivative & Concavity\nGreen: inflection points', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=8, loc='best')
ax.grid(True, alpha=0.3)
ax.set_ylim([-4, 4])

# Plot 8: Gradient descent visualization
print("  Creating Plot 8: Gradient descent...")
ax = fig.add_subplot(gs[2, 1])
f = lambda x: (x - 3)**2 + 1
df = lambda x: 2*(x - 3)

x = np.linspace(0, 6, 500)
ax.plot(x, f(x), 'b-', linewidth=2.5, label='Loss: L(Œ∏) = (Œ∏-3)¬≤+1')

x_current = 0.5
learning_rate = 0.3
steps = []

for i in range(10):
    steps.append((x_current, f(x_current)))
    gradient = df(x_current)
    x_current = x_current - learning_rate * gradient

x_path = [s[0] for s in steps]
y_path = [s[1] for s in steps]
ax.plot(x_path, y_path, 'ro-', markersize=8, linewidth=2, label='GD path', zorder=4)

for i in range(len(steps)-1):
    ax.annotate('', xy=(x_path[i+1], y_path[i+1]), 
                xytext=(x_path[i], y_path[i]),
                arrowprops=dict(arrowstyle='->', color='red', lw=2))

ax.plot(3, 1, 'g*', markersize=25, label='Minimum', zorder=5)
ax.plot(x_path[0], y_path[0], 'ko', markersize=10, label='Start', zorder=5)

ax.set_xlabel('Parameter Œ∏', fontsize=11)
ax.set_ylabel('Loss L(Œ∏)', fontsize=11)
ax.set_title('Gradient Descent\nŒ∏ ‚Üê Œ∏ - Œ±¬∑dL/dŒ∏ (Œ±=0.3)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 9: Rate of change application (population growth)
print("  Creating Plot 9: Population growth rate...")
ax = fig.add_subplot(gs[2, 2])
t = np.linspace(0, 5, 500)
P = lambda t: 100 * np.exp(0.3 * t)
dP = lambda t: 30 * np.exp(0.3 * t)

color = 'tab:blue'
ax.plot(t, P(t), color=color, linewidth=2.5, label='P(t) = 100e^(0.3t)')
ax.set_xlabel('Time (years)', fontsize=11)
ax.set_ylabel('Population', fontsize=11, color=color)
ax.tick_params(axis='y', labelcolor=color)

ax_twin = ax.twinx()
color = 'tab:red'
ax_twin.plot(t, dP(t), color=color, linewidth=2.5, label="P'(t) = 30e^(0.3t)")
ax_twin.set_ylabel('Growth Rate (per year)', fontsize=11, color=color)
ax_twin.tick_params(axis='y', labelcolor=color)

ax.set_title('Population Growth\nDerivative = Growth Rate', 
             fontsize=11, fontweight='bold')
ax.grid(True, alpha=0.3)

lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax_twin.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, fontsize=9, loc='upper left')

plt.tight_layout()
plt.show()

print("\n‚úì All 9 visualizations complete")

print("\n" + "="*80)
print("SECTION 1 COMPLETE: Derivative Definition")
print("="*80)")

## ‚öôÔ∏è 2. Differentiation Rules: Power, Product, Quotient

### Introduction

Computing derivatives from first principles is tedious. **Differentiation rules** provide shortcuts to find derivatives quickly and efficiently.

---

### 2.1 Power Rule

**Theorem:** For any real number $n$:

$$\frac{d}{dx}[x^n] = nx^{n-1}$$

**Examples:**

- $(x^5)' = 5x^4$
- $(x^{-2})' = -2x^{-3} = -\frac{2}{x^3}$
- $(\sqrt{x})' = (x^{1/2})' = \frac{1}{2}x^{-1/2} = \frac{1}{2\sqrt{x}}$

---

### 2.2 Product Rule

**Theorem:** If $u$ and $v$ are differentiable:

$$(uv)' = u'v + uv'$$

**Example:** $f(x) = x^2 \sin x$

Let $u = x^2$, $v = \sin x$
- $u' = 2x$
- $v' = \cos x$

$$f'(x) = 2x \cdot \sin x + x^2 \cdot \cos x$$

---

### 2.3 Quotient Rule

**Theorem:** If $u$ and $v$ are differentiable and $v \neq 0$:

$$\left(\frac{u}{v}\right)' = \frac{u'v - uv'}{v^2}$$

**Mnemonic:** "Low dee-high minus high dee-low, over low-low"

**Example:** $f(x) = \frac{x^2}{x+1}$

$$f'(x) = \frac{2x(x+1) - x^2 \cdot 1}{(x+1)^2} = \frac{x^2 + 2x}{(x+1)^2}$$

---

### 2.4 Common Derivatives

| Function | Derivative | Rule Used |
|----------|------------|-----------|
| $c$ | $0$ | Constant |
| $x^n$ | $nx^{n-1}$ | Power |
| $cf(x)$ | $cf'(x)$ | Constant multiple |
| $f \pm g$ | $f' \pm g'$ | Sum/Difference |
| $fg$ | $f'g + fg'$ | Product |
| $\frac{f}{g}$ | $\frac{f'g - fg'}{g^2}$ | Quotient |

---

In [None]:
"""
DIFFERENTIATION RULES - SECTION HEADER
"""

print("="*80)
print("SECTION 2: POWER, PRODUCT, QUOTIENT RULES")
print("="*80)

In [None]:
"""
1. POWER RULE DEMONSTRATION
"""

print("\n" + "="*80)
print("1. POWER RULE: d/dx[x^n] = nx^(n-1)")
print("="*80)

# Symbolic differentiation with SymPy
x = sp.Symbol('x')

powers = [2, 3, 5, -1, -2, sp.Rational(1,2), sp.Rational(1,3)]

print("\nPower Rule Examples:")
print(f"  {'Function':>15} | {'Derivative':>20}")
print("  " + "-"*38)

for n in powers:
    f = x**n
    df = sp.diff(f, x)
    f_str = f"x^{n}" if isinstance(n, int) else f"x^({n})"
    df_str = sp.latex(df)
    print(f"  {f_str:>15} | {df_str:>20}")

In [None]:
"""
1. POWER RULE - Numerical Verification
"""

# Numerical verification
print("\n" + "="*60)
print("Numerical Verification at x = 2:")

x_val = 2
h = 1e-7

print(f"\n  {'Function':>12} | {'Symbolic':>12} | {'Numerical':>12} | {'Error':>10}")
print("  " + "-"*52)

test_powers = [2, 3, -1, 0.5]
for n in test_powers:
    f_sym = lambda x: x**n
    df_symbolic = n * (x_val**(n-1))
    df_numerical = (f_sym(x_val + h) - f_sym(x_val)) / h
    error = abs(df_symbolic - df_numerical)
    
    print(f"  x^{n:>9} | {df_symbolic:12.6f} | {df_numerical:12.6f} | {error:10.2e}")

In [None]:
"""
2. PRODUCT RULE DEMONSTRATION
"""

print("\n" + "="*80)
print("2. PRODUCT RULE: (uv)' = u'v + uv'")
print("="*80)

# Example 1: x¬≤ sin(x)
print("\nExample 1: f(x) = x¬≤ sin(x)")

x = sp.Symbol('x')
u = x**2
v = sp.sin(x)
f = u * v

u_prime = sp.diff(u, x)
v_prime = sp.diff(v, x)
f_prime_formula = u_prime * v + u * v_prime
f_prime_direct = sp.diff(f, x)

print(f"  u = x¬≤, u' = {u_prime}")
print(f"  v = sin(x), v' = {v_prime}")
print(f"  Product rule: f' = u'v + uv'")
print(f"             = ({u_prime})(sin(x)) + (x¬≤)({v_prime})")
print(f"             = {sp.simplify(f_prime_formula)}")
print(f"  Match: {sp.simplify(f_prime_formula - f_prime_direct) == 0} ‚úì")

In [None]:
"""
2. PRODUCT RULE - Numerical Verification
"""

# Numerical verification
x_val = 2.0
h = 1e-7

f_func = lambda x: np.exp(x) * np.log(x)
df_numerical = (f_func(x_val + h) - f_func(x_val)) / h
df_analytical = np.exp(x_val) * (np.log(x_val) + 1/x_val)

print(f"\nExample 2: f(x) = e^x ln(x) at x = {x_val}")
print(f"  Analytical: {df_analytical:.10f}")
print(f"  Numerical:  {df_numerical:.10f}")
print(f"  Error:      {abs(df_analytical - df_numerical):.2e}")

In [None]:
"""
3. QUOTIENT RULE DEMONSTRATION
"""

print("\n" + "="*80)
print("3. QUOTIENT RULE: (u/v)' = (u'v - uv')/v¬≤")
print("="*80)

# Example: x¬≤/(x+1)
print("\nExample: f(x) = x¬≤/(x+1)")

x = sp.Symbol('x')
u = x**2
v = x + 1
f = u / v

u_prime = sp.diff(u, x)
v_prime = sp.diff(v, x)
f_prime_formula = (u_prime * v - u * v_prime) / v**2
f_prime_direct = sp.diff(f, x)

print(f"  u = x¬≤, u' = {u_prime}")
print(f"  v = x+1, v' = {v_prime}")
print(f"  Quotient rule: f' = (u'v - uv')/v¬≤")
print(f"                   = (2x(x+1) - x¬≤¬∑1)/(x+1)¬≤")
print(f"                   = (x¬≤ + 2x)/(x+1)¬≤")
print(f"  Simplified: {sp.simplify(f_prime_direct)}")

In [None]:
"""
3. QUOTIENT RULE - Numerical Verification
"""

# Numerical verification
x_val = 1.5
h = 1e-7

f_func = lambda x: np.sin(x) / x
df_numerical = (f_func(x_val + h) - f_func(x_val)) / h
df_analytical = (x_val * np.cos(x_val) - np.sin(x_val)) / x_val**2

print(f"\nExample: f(x) = sin(x)/x at x = {x_val}")
print(f"  Analytical: {df_analytical:.10f}")
print(f"  Numerical:  {df_numerical:.10f}")
print(f"  Error:      {abs(df_analytical - df_numerical):.2e}")

In [None]:
"""
4. COMPREHENSIVE VISUALIZATIONS (6 plots)
"""

print("\n" + "="*80)
print("4. COMPREHENSIVE VISUALIZATIONS (6 plots)")
print("="*80)

fig = plt.figure(figsize=(18, 8))
gs = fig.add_gridspec(2, 3, hspace=0.35, wspace=0.35)

# Plot 1: Power rule
ax = fig.add_subplot(gs[0, 0])
x_vals = np.linspace(0.1, 3, 500)

powers_to_plot = [1, 2, 3, 0.5]
colors = ['red', 'blue', 'green', 'purple']

for n, color in zip(powers_to_plot, colors):
    dy = n * x_vals**(n-1)
    ax.plot(x_vals, dy, color=color, linewidth=2, label=f"(x^{n})' = {n}x^{n-1}")

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel("f'(x)", fontsize=11)
ax.set_title('Power Rule', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 2: Product rule
ax = fig.add_subplot(gs[0, 1])
x_vals = np.linspace(0, 2*np.pi, 500)

f = x_vals * np.sin(x_vals)
f_prime = np.sin(x_vals) + x_vals * np.cos(x_vals)

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = x¬∑sin(x)')
ax.plot(x_vals, f_prime, 'r-', linewidth=2.5, label="f'(x) = sin(x) + x¬∑cos(x)")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Product Rule', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 3: Quotient rule
ax = fig.add_subplot(gs[0, 2])
x_vals = np.linspace(0.1, 5, 500)

f = x_vals**2 / (x_vals + 1)
df = (2*x_vals*(x_vals + 1) - x_vals**2) / (x_vals + 1)**2

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = x¬≤/(x+1)')
ax.plot(x_vals, df, 'r-', linewidth=2.5, label="f'(x)")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Quotient Rule', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 4: Polynomial derivative
ax = fig.add_subplot(gs[1, 0])
x_vals = np.linspace(-3, 3, 500)

f = lambda x: x**4 - 4*x**3 + 6*x**2
df = lambda x: 4*x**3 - 12*x**2 + 12*x

ax.plot(x_vals, f(x_vals), 'b-', linewidth=2.5, label='f(x) = x‚Å¥ - 4x¬≥ + 6x¬≤')
ax.plot(x_vals, df(x_vals), 'r-', linewidth=2.5, label="f'(x) = 4x¬≥ - 12x¬≤ + 12x")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Polynomial Differentiation', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 5: sin(x)/x
ax = fig.add_subplot(gs[1, 1])
x_vals = np.linspace(-10, 10, 500)
x_vals = x_vals[x_vals != 0]

f = np.sin(x_vals) / x_vals
df = (x_vals * np.cos(x_vals) - np.sin(x_vals)) / x_vals**2

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = sin(x)/x')
ax.plot(x_vals, df, 'r-', linewidth=2.5, label="f'(x)")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Quotient Rule: sinc function', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_ylim([-1, 1.5])

# Plot 6: Tangent lines comparison
ax = fig.add_subplot(gs[1, 2])

functions = [
    (lambda x: x**2, lambda x: 2*x, 'x¬≤', 'blue'),
    (lambda x: x**3, lambda x: 3*x**2, 'x¬≥', 'red'),
]

x_point = 1
x_range = np.linspace(-0.5, 2.5, 500)

for f, df, label, color in functions:
    ax.plot(x_range, f(x_range), color=color, linewidth=2, label=f'f={label}', alpha=0.7)
    
    slope = df(x_point)
    y_tan = f(x_point) + slope * (x_range - x_point)
    ax.plot(x_range, y_tan, '--', color=color, linewidth=1.5, alpha=0.5,
            label=f"m={slope:.1f}")
    
    ax.plot(x_point, f(x_point), 'o', color=color, markersize=8)

ax.axvline(x_point, color='black', linestyle=':', alpha=0.5)
ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title(f'Tangent Lines at x={x_point}', fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úì All 6 visualizations complete")

print("\n" + "="*80)
print("SECTION 2 COMPLETE: Differentiation Rules")
print("="*80)

## üîó 3. Chain Rule

### Introduction

The **chain rule** allows us to differentiate **composite functions** - functions within functions.

---

### 3.1 Chain Rule Formula

**Theorem:** If $y = f(g(x))$, then:

$$\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$$

**Leibniz notation:**

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

where $u = g(x)$ and $y = f(u)$.

---

### 3.2 Examples

**Example 1:** $f(x) = (x^2 + 1)^{10}$

Let $u = x^2 + 1$, then $y = u^{10}$

$$\frac{dy}{du} = 10u^9, \quad \frac{du}{dx} = 2x$$

$$f'(x) = 10(x^2 + 1)^9 \cdot 2x = 20x(x^2 + 1)^9$$

---

### 3.3 Applications in ML

**Backpropagation** is essentially repeated application of the chain rule!

For neural network: $y = f_3(f_2(f_1(x)))$

$$\frac{dy}{dx} = \frac{dy}{df_2} \cdot \frac{df_2}{df_1} \cdot \frac{df_1}{dx}$$

---

In [None]:
"""
1. CHAIN RULE EXAMPLES
"""

print("\n" + "="*80)
print("1. CHAIN RULE EXAMPLES")
print("="*80)

# Symbolic differentiation
x = sp.Symbol('x')

# Example 1: (x¬≤ + 1)^10
print("\nExample 1: f(x) = (x¬≤ + 1)^10")
f1 = (x**2 + 1)**10
df1 = sp.diff(f1, x)
print(f"  Inner function: u = x¬≤ + 1")
print(f"  Outer function: y = u^10")
print(f"  du/dx = 2x")
print(f"  dy/du = 10u^9")
print(f"  f'(x) = 10(x¬≤ + 1)^9 ¬∑ 2x = 20x(x¬≤ + 1)^9")
print(f"  SymPy result: {sp.simplify(df1)}")

# Example 2: sin(3x¬≤)
print("\nExample 2: f(x) = sin(3x¬≤)")
f2 = sp.sin(3*x**2)
df2 = sp.diff(f2, x)
print(f"  Inner function: u = 3x¬≤")
print(f"  Outer function: y = sin(u)")
print(f"  du/dx = 6x")
print(f"  dy/du = cos(u)")
print(f"  f'(x) = cos(3x¬≤) ¬∑ 6x = 6x¬∑cos(3x¬≤)")
print(f"  SymPy result: {df2}")

# Example 3: e^(x¬≥)
print("\nExample 3: f(x) = e^(x¬≥)")
f3 = sp.exp(x**3)
df3 = sp.diff(f3, x)
print(f"  Inner function: u = x¬≥")
print(f"  Outer function: y = e^u")
print(f"  du/dx = 3x¬≤")
print(f"  dy/du = e^u")
print(f"  f'(x) = e^(x¬≥) ¬∑ 3x¬≤ = 3x¬≤e^(x¬≥)")
print(f"  SymPy result: {df3}")

In [None]:
"""
1. CHAIN RULE - Numerical Verification
"""

# Numerical verification
print("\n" + "="*60)
print("Numerical Verification at x = 2:")

x_val = 2.0
h = 1e-7

funcs = [
    (lambda x: (x**2 + 1)**10, lambda x: 20*x*(x**2 + 1)**9, "(x¬≤+1)^10"),
    (lambda x: np.sin(3*x**2), lambda x: 6*x*np.cos(3*x**2), "sin(3x¬≤)"),
    (lambda x: np.exp(x**3), lambda x: 3*x**2*np.exp(x**3), "e^(x¬≥)")
]

print(f"\n  {'Function':>12} | {'Analytical':>12} | {'Numerical':>12} | {'Error':>10}")
print("  " + "-"*52)

for f, df_analytical, name in funcs:
    df_numerical = (f(x_val + h) - f(x_val)) / h
    df_exact = df_analytical(x_val)
    error = abs(df_numerical - df_exact)
    print(f"  {name:>12} | {df_exact:12.4f} | {df_numerical:12.4f} | {error:10.2e}")

In [None]:
"""
2. BACKPROPAGATION SIMULATION
"""

print("\n" + "="*80)
print("2. BACKPROPAGATION (CHAIN RULE IN NEURAL NETWORKS)")
print("="*80)

print("\nSimple 2-layer network: Loss = (ReLU(w¬∑x + b))¬≤")
print("  Layer 1: z‚ÇÅ = w¬∑x + b")
print("  Activation: a‚ÇÅ = max(0, z‚ÇÅ)")
print("  Loss: L = a‚ÇÅ¬≤")

# Forward pass
x_input = 2.0
w = 1.5
b = 0.5

z1 = w * x_input + b
a1 = max(0, z1)
loss = a1**2

print(f"\nForward pass:")
print(f"  Input: x = {x_input}")
print(f"  z‚ÇÅ = {w}¬∑{x_input} + {b} = {z1}")
print(f"  a‚ÇÅ = ReLU({z1}) = {a1}")
print(f"  Loss = {a1}¬≤ = {loss}")

# Backward pass (chain rule)
print(f"\nBackward pass (computing dL/dw using chain rule):")

dL_da1 = 2 * a1
print(f"  ‚àÇL/‚àÇa‚ÇÅ = 2a‚ÇÅ = {dL_da1:.4f}")

da1_dz1 = 1 if z1 > 0 else 0
print(f"  ‚àÇa‚ÇÅ/‚àÇz‚ÇÅ = {da1_dz1} (ReLU derivative: 1 if z>0, else 0)")

dz1_dw = x_input
print(f"  ‚àÇz‚ÇÅ/‚àÇw = x = {dz1_dw}")

dL_dw = dL_da1 * da1_dz1 * dz1_dw
print(f"\n  Chain rule: ‚àÇL/‚àÇw = (‚àÇL/‚àÇa‚ÇÅ) ¬∑ (‚àÇa‚ÇÅ/‚àÇz‚ÇÅ) ¬∑ (‚àÇz‚ÇÅ/‚àÇw)")
print(f"            = {dL_da1:.4f} ¬∑ {da1_dz1} ¬∑ {dz1_dw}")
print(f"            = {dL_dw:.4f}")
print("  ‚úì Gradient computed! Can now update: w_new = w - Œ±¬∑‚àÇL/‚àÇw")

# Gradient descent update
alpha = 0.1
w_new = w - alpha * dL_dw
print(f"\n  With learning rate Œ± = {alpha}:")
print(f"  w_new = {w} - {alpha}¬∑{dL_dw:.4f} = {w_new:.4f}")

In [None]:
"""
3. NESTED CHAIN RULE
"""

print("\n" + "="*80)
print("3. NESTED CHAIN RULE (MULTIPLE COMPOSITIONS)")
print("="*80)

# Example: sin(e^(x¬≤))
print("\nExample: f(x) = sin(e^(x¬≤))")
print("  Let u = x¬≤, v = e^u, y = sin(v)")
print("  ‚àÇu/‚àÇx = 2x")
print("  ‚àÇv/‚àÇu = e^u = e^(x¬≤)")
print("  ‚àÇy/‚àÇv = cos(v) = cos(e^(x¬≤))")
print("\n  Chain rule: dy/dx = (dy/dv)¬∑(dv/du)¬∑(du/dx)")
print("            = cos(e^(x¬≤))¬∑e^(x¬≤)¬∑2x")
print("            = 2x¬∑e^(x¬≤)¬∑cos(e^(x¬≤))")

# Verify with SymPy
x = sp.Symbol('x')
f_nested = sp.sin(sp.exp(x**2))
df_nested = sp.diff(f_nested, x)
print(f"\n  SymPy verification: {df_nested}")

In [None]:
"""
4. COMPREHENSIVE VISUALIZATIONS (6 plots)
"""

print("\n" + "="*80)
print("4. COMPREHENSIVE VISUALIZATIONS (6 plots)")
print("="*80)

fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 2, hspace=0.35, wspace=0.35)

# Plot 1: (x¬≤+1)¬≥ and derivative
print("\n  Creating Plot 1: Polynomial composite...")
ax = fig.add_subplot(gs[0, 0])
x_vals = np.linspace(-2, 2, 500)

f = (x_vals**2 + 1)**3
df = 6*x_vals*(x_vals**2 + 1)**2

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = (x¬≤+1)¬≥')
ax.plot(x_vals, df, 'r-', linewidth=2.5, label="f'(x) = 6x(x¬≤+1)¬≤")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Chain Rule: (x¬≤+1)¬≥', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 2: sin(3x) and derivative
print("  Creating Plot 2: Trigonometric composite...")
ax = fig.add_subplot(gs[0, 1])
x_vals = np.linspace(0, 2*np.pi, 500)

f = np.sin(3*x_vals)
df = 3*np.cos(3*x_vals)

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = sin(3x)')
ax.plot(x_vals, df, 'r-', linewidth=2.5, label="f'(x) = 3cos(3x)")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Chain Rule: sin(3x)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 3: e^(x¬≤) and derivative
print("  Creating Plot 3: Exponential composite...")
ax = fig.add_subplot(gs[1, 0])
x_vals = np.linspace(-2, 2, 500)

f = np.exp(x_vals**2)
df = 2*x_vals*np.exp(x_vals**2)

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = e^(x¬≤)')
ax.plot(x_vals, df, 'r-', linewidth=2.5, label="f'(x) = 2x¬∑e^(x¬≤)")

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Chain Rule: e^(x¬≤)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 20])

# Plot 4: ‚àö(x¬≤+1) and derivative
print("  Creating Plot 4: Square root composite...")
ax = fig.add_subplot(gs[1, 1])
x_vals = np.linspace(-3, 3, 500)

f = np.sqrt(x_vals**2 + 1)
df = x_vals / np.sqrt(x_vals**2 + 1)

ax.plot(x_vals, f, 'b-', linewidth=2.5, label='f(x) = ‚àö(x¬≤+1)')
ax.plot(x_vals, df, 'r-', linewidth=2.5, label="f'(x) = x/‚àö(x¬≤+1)")
ax.axhline(0, color='black', linewidth=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Chain Rule: ‚àö(x¬≤+1)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 5: Backpropagation flow diagram
print("  Creating Plot 5: Backpropagation diagram...")
ax = fig.add_subplot(gs[2, 0])
ax.text(0.5, 0.9, 'Neural Network: Backpropagation', 
        ha='center', fontsize=13, fontweight='bold')

ax.text(0.5, 0.75, 'Forward Pass ‚ûú', 
        ha='center', fontsize=11, color='blue', fontweight='bold')
ax.text(0.5, 0.65, 'x ‚Üí z‚ÇÅ=wx+b ‚Üí a‚ÇÅ=ReLU(z‚ÇÅ) ‚Üí L=a‚ÇÅ¬≤', 
        ha='center', fontsize=10, family='monospace')

ax.text(0.5, 0.45, '‚¨Ö Backward Pass (Gradients)', 
        ha='center', fontsize=11, color='red', fontweight='bold')
ax.text(0.5, 0.35, '‚àÇL/‚àÇx ‚Üê ‚àÇL/‚àÇw ‚Üê ‚àÇL/‚àÇa‚ÇÅ ‚Üê ‚àÇL/‚àÇL', 
        ha='center', fontsize=10, family='monospace', color='red')

ax.text(0.5, 0.15, 'Chain Rule: ‚àÇL/‚àÇw = (‚àÇL/‚àÇa‚ÇÅ)¬∑(‚àÇa‚ÇÅ/‚àÇz‚ÇÅ)¬∑(‚àÇz‚ÇÅ/‚àÇw)', 
        ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

ax.text(0.5, 0.02, '‚úì Used to train neural networks!', 
        ha='center', fontsize=10, style='italic', color='green')

ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.axis('off')

# Plot 6: Comparison of chain rule applications
print("  Creating Plot 6: Multiple composites comparison...")
ax = fig.add_subplot(gs[2, 1])
x_vals = np.linspace(0.1, 3, 500)

functions = [
    (lambda x: (x**2)**3, lambda x: 6*x**5, '(x¬≤)¬≥', 'blue'),
    (lambda x: x**(2**3), lambda x: 8*x**7, 'x^(2¬≥)', 'red'),
]

for f, df, label, color in functions:
    ax.plot(x_vals, df(x_vals), color=color, linewidth=2.5, 
            label=f"{label}' = {label.replace('(', '').replace(')', '')} derivative", alpha=0.8)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel("f'(x)", fontsize=11)
ax.set_title('Chain Rule vs Power Rule\nNote: (x¬≤)¬≥ ‚â† x^(2¬≥)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 100])

plt.tight_layout()
plt.show()

print("\n‚úì All 6 visualizations complete")

print("\n" + "="*80)
print("SECTION 3 COMPLETE: Chain Rule")
print("="*80)

In [None]:
"""
CHAIN RULE - SECTION HEADER
"""

print("="*80)
print("SECTION 3: CHAIN RULE")
print("="*80)

## üéØ 4. Critical Points and Extrema

### Introduction

**Critical points** are where the derivative equals zero or is undefined. They help us find **maximum** and **minimum** values of functions.

---

### 4.1 Definitions

**Critical Point:** A point $x = c$ where:
- $f'(c) = 0$, OR
- $f'(c)$ is undefined

**Local Maximum:** $f(c) \geq f(x)$ for all $x$ near $c$

**Local Minimum:** $f(c) \leq f(x)$ for all $x$ near $c$

**Global Maximum:** $f(c) \geq f(x)$ for all $x$ in domain

**Global Minimum:** $f(c) \leq f(x)$ for all $x$ in domain

---

### 4.2 First Derivative Test

To classify critical points using $f'(x)$:

| $f'$ changes from | Type of critical point |
|-------------------|------------------------|
| + to - | Local maximum |
| - to + | Local minimum |
| No change | Neither (inflection) |

---

### 4.3 Second Derivative Test

At a critical point $x = c$ where $f'(c) = 0$:

- If $f''(c) > 0$: **Local minimum** (concave up ‚å£)
- If $f''(c) < 0$: **Local maximum** (concave down ‚å¢)
- If $f''(c) = 0$: Test is **inconclusive** (use first derivative test)

---

### 4.4 Example

Find extrema of $f(x) = x^3 - 3x^2 - 9x + 5$

**Step 1:** Find $f'(x)$

$$f'(x) = 3x^2 - 6x - 9$$

**Step 2:** Set $f'(x) = 0$

$$3x^2 - 6x - 9 = 0$$
$$x^2 - 2x - 3 = 0$$
$$(x - 3)(x + 1) = 0$$
$$x = 3 \text{ or } x = -1$$

**Step 3:** Second derivative test

$$f''(x) = 6x - 6$$

- At $x = -1$: $f''(-1) = -12 < 0$ ‚Üí **Local maximum**
- At $x = 3$: $f''(3) = 12 > 0$ ‚Üí **Local minimum**

---

### 4.5 Applications

**1. Profit Maximization**

Find production level where $\frac{d(\text{Profit})}{dq} = 0$

**2. Loss Minimization (ML)**

Find parameters where $\frac{\partial L}{\partial \theta} = 0$

**3. Data Analysis**

Find peaks and troughs in time series data

---

In [None]:
"""
CRITICAL POINTS AND EXTREMA - SECTION HEADER
"""

print("="*80)
print("SECTION 4: CRITICAL POINTS AND EXTREMA")
print("="*80)

In [None]:
"""
1. FINDING CRITICAL POINTS - Example 1
"""

print("\n" + "="*80)
print("1. FINDING CRITICAL POINTS")
print("="*80)

# Example 1: f(x) = x¬≥ - 3x¬≤ - 9x + 5
print("\nExample 1: f(x) = x¬≥ - 3x¬≤ - 9x + 5")

x = sp.Symbol('x')
f = x**3 - 3*x**2 - 9*x + 5
df = sp.diff(f, x)
ddf = sp.diff(df, x)

print(f"  f'(x) = {df}")
print(f"  f''(x) = {ddf}")

# Solve f'(x) = 0
critical_points = sp.solve(df, x)
print(f"\n  Critical points (f'=0): {critical_points}")

# Classify using second derivative test
print("\n  Classification using second derivative test:")
for cp in critical_points:
    cp_val = float(cp)
    f_val = float(f.subs(x, cp))
    ddf_val = float(ddf.subs(x, cp))
    
    if ddf_val > 0:
        classification = "Local MINIMUM"
    elif ddf_val < 0:
        classification = "Local MAXIMUM"
    else:
        classification = "Inconclusive"
    
    print(f"    x = {cp_val}: f({cp_val:.0f}) = {f_val:.2f}, f''({cp_val:.0f}) = {ddf_val:.0f} ‚Üí {classification}")

In [None]:
"""
2. FIRST DERIVATIVE TEST
"""

print("\n" + "="*80)
print("2. FIRST DERIVATIVE TEST")
print("="*80)

print("\nFor f(x) = x¬≥ - 3x¬≤ - 9x + 5 with critical points at x = -1, 3")

# Test intervals around critical points
test_points = {
    'x < -1': -2,
    '-1 < x < 3': 0,
    'x > 3': 4
}

f_func = lambda x: x**3 - 3*x**2 - 9*x + 5
df_func = lambda x: 3*x**2 - 6*x - 9

print(f"\n  {'Interval':>12} | {'Test point':>11} | {'f\'(x)':>8} | {'Sign':>6}")
print("  " + "-"*45)

for interval, test_pt in test_points.items():
    df_val = df_func(test_pt)
    sign = "+" if df_val > 0 else "-"
    print(f"  {interval:>12} | {test_pt:>11} | {df_val:>8.1f} | {sign:>6}")

print("\n  Analysis:")
print("    At x = -1: f' changes from + to - ‚Üí LOCAL MAXIMUM")
print("    At x = 3: f' changes from - to + ‚Üí LOCAL MINIMUM")

In [None]:
"""
4. USING SCIPY FOR OPTIMIZATION
"""

print("\n" + "="*80)
print("4. USING SCIPY FOR FINDING EXTREMA")
print("="*80)

# Find minimum
def f_to_minimize(x):
    return x**3 - 3*x**2 - 9*x + 5

result_min = optimize.minimize_scalar(f_to_minimize, bounds=(-10, 10), method='bounded')
print(f"\nMinimizing f(x) = x¬≥ - 3x¬≤ - 9x + 5:")
print(f"  Minimum at x = {result_min.x:.4f}")
print(f"  Minimum value = {result_min.fun:.4f}")

# Find maximum (minimize negative)
result_max = optimize.minimize_scalar(lambda x: -f_to_minimize(x), bounds=(-10, 10), method='bounded')
print(f"\nMaximizing f(x) (minimize -f(x)):")
print(f"  Maximum at x = {result_max.x:.4f}")
print(f"  Maximum value = {-result_max.fun:.4f}")

In [None]:
"""
5. COMPREHENSIVE VISUALIZATIONS (6 plots) - Section 4 Complete
"""

print("\n" + "="*80)
print("5. COMPREHENSIVE VISUALIZATIONS (6 plots)")
print("="*80)

# Import signal for peak finding
from scipy.signal import argrelextrema

fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 2, hspace=0.35, wspace=0.35)

# Plot 1: Critical points and classification
print("\n  Creating Plot 1: Critical points...")
ax = fig.add_subplot(gs[0, 0])
x_vals = np.linspace(-2, 4, 500)
f_vals = x_vals**3 - 3*x_vals**2 - 9*x_vals + 5
df_vals = 3*x_vals**2 - 6*x_vals - 9

ax.plot(x_vals, f_vals, 'b-', linewidth=2.5, label='f(x)')
ax.plot(x_vals, df_vals, 'r--', linewidth=2, label="f'(x)", alpha=0.7)
ax.axhline(0, color='black', linewidth=0.8)

# Mark critical points
critical_x = [-1, 3]
for cx in critical_x:
    cy = cx**3 - 3*cx**2 - 9*cx + 5
    ax.plot(cx, cy, 'go', markersize=12, zorder=5)
    ax.annotate(f'({cx}, {cy:.1f})', xy=(cx, cy), xytext=(cx+0.5, cy+3),
                fontsize=9, arrowprops=dict(arrowstyle='->', lw=1))

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Critical Points\nGreen dots: f\'=0', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 2: First derivative test
print("  Creating Plot 2: First derivative test...")
ax = fig.add_subplot(gs[0, 1])
x_vals = np.linspace(-3, 3, 500)
f_vals = x_vals**3 - 3*x_vals
df_vals = 3*x_vals**2 - 3

ax.plot(x_vals, df_vals, 'r-', linewidth=2.5, label="f'(x) = 3x¬≤ - 3")
ax.axhline(0, color='black', linewidth=0.8)
ax.axvline(-1, color='green', linestyle=':', alpha=0.5, label='Critical points')
ax.axvline(1, color='green', linestyle=':', alpha=0.5)

# Shade regions
ax.fill_between(x_vals, 0, df_vals, where=(df_vals > 0), alpha=0.2, color='green', label='f\' > 0 (increasing)')
ax.fill_between(x_vals, 0, df_vals, where=(df_vals < 0), alpha=0.2, color='red', label='f\' < 0 (decreasing)')

ax.plot([-1, 1], [0, 0], 'go', markersize=12, zorder=5)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel("f'(x)", fontsize=11)
ax.set_title('First Derivative Test\nSign changes indicate extrema', fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

# Plot 3: Second derivative and concavity
print("  Creating Plot 3: Second derivative test...")
ax = fig.add_subplot(gs[1, 0])
x_vals = np.linspace(-2, 4, 500)
f_vals = x_vals**3 - 3*x_vals**2 - 9*x_vals + 5
ddf_vals = 6*x_vals - 6

ax.plot(x_vals, f_vals, 'b-', linewidth=2.5, label='f(x)')
ax.axhline(0, color='black', linewidth=0.8)

# Mark critical points with second derivative info
ax.plot(-1, f_vals[np.argmin(np.abs(x_vals + 1))], 'ro', markersize=15, 
        label='Max (f"<0)', zorder=5)
ax.plot(3, f_vals[np.argmin(np.abs(x_vals - 3))], 'go', markersize=15,
        label='Min (f">0)', zorder=5)

# Shade concavity
ax.fill_between(x_vals, -30, 30, where=(ddf_vals > 0), alpha=0.1, color='yellow',
                label='Concave up (f">0)')
ax.fill_between(x_vals, -30, 30, where=(ddf_vals < 0), alpha=0.1, color='cyan',
                label='Concave down (f"<0)')

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Second Derivative Test\nConcavity determines extrema type', fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
ax.set_ylim([-30, 30])

# Plot 4: Global vs local extrema
print("  Creating Plot 4: Global vs local extrema...")
ax = fig.add_subplot(gs[1, 1])
x_vals = np.linspace(-3, 3, 500)
f_vals = np.sin(x_vals) + 0.1*x_vals**2

ax.plot(x_vals, f_vals, 'b-', linewidth=2.5, label='f(x)')

# Find local extrema (simplified)
local_max_idx = argrelextrema(f_vals, np.greater)[0]
local_min_idx = argrelextrema(f_vals, np.less)[0]

for idx in local_max_idx:
    ax.plot(x_vals[idx], f_vals[idx], 'yo', markersize=10, label='Local max' if idx == local_max_idx[0] else '')
for idx in local_min_idx:
    ax.plot(x_vals[idx], f_vals[idx], 'co', markersize=10, label='Local min' if idx == local_min_idx[0] else '')

# Global extrema
global_max_idx = np.argmax(f_vals)
global_min_idx = np.argmin(f_vals)
ax.plot(x_vals[global_max_idx], f_vals[global_max_idx], 'r*', markersize=20, label='Global max', zorder=5)
ax.plot(x_vals[global_min_idx], f_vals[global_min_idx], 'g*', markersize=20, label='Global min', zorder=5)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Global vs Local Extrema', fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

# Plot 5: Optimization landscape (2D contour)
print("  Creating Plot 5: 2D optimization landscape...")
ax = fig.add_subplot(gs[2, 0])

x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = (1 - X)**2 + 100*(Y - X**2)**2  # Rosenbrock function

contour = ax.contour(X, Y, np.log(Z + 1), levels=20, cmap='viridis')
ax.clabel(contour, inline=True, fontsize=8)
ax.plot(1, 1, 'r*', markersize=20, label='Global minimum', zorder=5)

ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('2D Optimization Landscape\nRosenbrock function (log scale)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 6: Critical points summary diagram
print("  Creating Plot 6: Critical points summary...")
ax = fig.add_subplot(gs[2, 1])

ax.text(0.5, 0.9, 'Finding & Classifying Critical Points', 
        ha='center', fontsize=13, fontweight='bold')

steps = [
    "Step 1: Find f'(x)",
    "Step 2: Solve f'(x) = 0 ‚Üí critical points",
    "Step 3: Classify using tests:",
    "  ‚Ä¢ First Derivative Test:",
    "    - f' changes + to - ‚Üí Local MAX",
    "    - f' changes - to + ‚Üí Local MIN",
    "  ‚Ä¢ Second Derivative Test:",
    "    - f''(c) > 0 ‚Üí Local MIN (‚å£)",
    "    - f''(c) < 0 ‚Üí Local MAX (‚å¢)",
    "    - f''(c) = 0 ‚Üí Inconclusive",
    "Step 4: Check endpoints for global extrema"
]

y_pos = 0.75
for step in steps:
    if step.startswith('Step'):
        ax.text(0.05, y_pos, step, fontsize=10, fontweight='bold', family='monospace')
    else:
        ax.text(0.05, y_pos, step, fontsize=9, family='monospace')
    y_pos -= 0.065

ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.axis('off')

plt.tight_layout()
plt.show()

print("\n‚úì All 6 visualizations complete")

print("\n" + "="*80)
print("SECTION 4 COMPLETE: Critical Points and Extrema")
print("="*80)

In [None]:
"""
3. FINDING GLOBAL EXTREMA ON CLOSED INTERVAL
"""

print("\n" + "="*80)
print("3. FINDING GLOBAL EXTREMA ON CLOSED INTERVAL")
print("="*80)

print("\nFind global max/min of f(x) = x¬≥ - 3x on [-2, 3]")

f3_func = lambda x: x**3 - 3*x
df3_func = lambda x: 3*x**2 - 3

# Critical points in interval
print("\n  Step 1: Find critical points in [-2, 3]")
print("    f'(x) = 3x¬≤ - 3 = 0")
print("    x¬≤ = 1 ‚Üí x = ¬±1")
print("    Critical points in interval: x = -1, 1")

# Evaluate at critical points and endpoints
candidates = [-2, -1, 1, 3]
print(f"\n  Step 2: Evaluate f at critical points and endpoints:")
print(f"  {'x':>5} | {'f(x)':>8}")
print("  " + "-"*16)

values = {}
for x_val in candidates:
    f_val = f3_func(x_val)
    values[x_val] = f_val
    print(f"  {x_val:>5} | {f_val:>8.2f}")

global_max = max(values.items(), key=lambda item: item[1])
global_min = min(values.items(), key=lambda item: item[1])

print(f"\n  Global maximum: f({global_max[0]}) = {global_max[1]:.2f}")
print(f"  Global minimum: f({global_min[0]}) = {global_min[1]:.2f}")

In [None]:
"""
1. FINDING CRITICAL POINTS - Example 2
"""

# Example 2: f(x) = x‚Å¥ - 4x¬≥
print("\n" + "="*60)
print("Example 2: f(x) = x‚Å¥ - 4x¬≥")

f2 = x**4 - 4*x**3
df2 = sp.diff(f2, x)
ddf2 = sp.diff(df2, x)

print(f"  f'(x) = {df2}")
print(f"  f''(x) = {ddf2}")

critical_points2 = sp.solve(df2, x)
print(f"\n  Critical points: {critical_points2}")

print("\n  Classification:")
for cp in critical_points2:
    cp_val = float(cp)
    f_val = float(f2.subs(x, cp))
    ddf_val = float(ddf2.subs(x, cp))
    
    if ddf_val > 0:
        classification = "Local MINIMUM"
    elif ddf_val < 0:
        classification = "Local MAXIMUM"
    else:
        classification = "Inconclusive (use 1st derivative test)"
    
    print(f"    x = {cp_val}: f({cp_val:.0f}) = {f_val:.2f}, f''({cp_val:.0f}) = {ddf_val:.0f} ‚Üí {classification}")

## 5. Optimization Applications

### 5.1 Optimization Framework

**Optimization** is the process of finding the best solution from a set of alternatives, typically by maximizing or minimizing an objective function.

**General Optimization Problem:**
$$\text{Minimize (or Maximize)} \quad f(x)$$
$$\text{Subject to} \quad g_i(x) \leq 0, \quad i = 1, \ldots, m$$
$$\quad \quad \quad \quad h_j(x) = 0, \quad j = 1, \ldots, p$$

Where:
- $f(x)$: **Objective function** (what we want to optimize)
- $g_i(x) \leq 0$: **Inequality constraints**
- $h_j(x) = 0$: **Equality constraints**
- $x$: **Decision variables**

**Unconstrained Optimization:** Only objective function, no constraints.

---

### 5.2 Gradient Descent

**Gradient Descent** is an iterative optimization algorithm for finding local minima of differentiable functions.

**Algorithm:**
1. Start with initial guess $x_0$
2. Update: $x_{k+1} = x_k - \alpha \nabla f(x_k)$
3. Repeat until convergence

Where:
- $\alpha$: **Learning rate** (step size)
- $\nabla f(x_k)$: **Gradient** (direction of steepest ascent)
- Negative gradient: direction of steepest descent

**Intuition:** Move in the direction opposite to the gradient to decrease function value.

**Learning Rate Selection:**
- Too small: slow convergence
- Too large: oscillation or divergence
- Adaptive methods: adjust $\alpha$ dynamically

**Variants:**
- **Batch Gradient Descent:** Use entire dataset
- **Stochastic Gradient Descent (SGD):** Use one sample at a time
- **Mini-batch GD:** Use small batches

**Example:** Minimize $f(x) = x^2 + 4x + 4 = (x+2)^2$
- $f'(x) = 2x + 4$
- Update: $x_{k+1} = x_k - \alpha (2x_k + 4)$
- Starting from $x_0 = 5$ with $\alpha = 0.1$:
  - $x_1 = 5 - 0.1(10 + 4) = 5 - 1.4 = 3.6$
  - $x_2 = 3.6 - 0.1(7.2 + 4) = 3.6 - 1.12 = 2.48$
  - $\vdots$
  - Converges to $x = -2$ (global minimum)

---

### 5.3 Newton's Method for Optimization

**Newton's Method** uses second-order information (second derivatives) for faster convergence.

**Update Rule:**
$$x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}$$

**Matrix Form (multivariate):**
$$\mathbf{x}_{k+1} = \mathbf{x}_k - [\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$$

Where $\nabla^2 f$ is the **Hessian matrix** (matrix of second derivatives).

**Advantages:**
- Quadratic convergence (very fast near minimum)
- Requires fewer iterations than gradient descent

**Disadvantages:**
- Requires computing and inverting Hessian (expensive for large $n$)
- May not converge if starting point is far from minimum

**Example:** Minimize $f(x) = x^2 - 4x + 4$
- $f'(x) = 2x - 4$
- $f''(x) = 2$
- Update: $x_{k+1} = x_k - \frac{2x_k - 4}{2} = x_k - (x_k - 2) = 2$
- Converges in **one step** from any starting point!

---

### 5.4 Convex vs Non-Convex Optimization

**Convex Function:** A function $f$ is convex if:
$$f(\lambda x_1 + (1-\lambda)x_2) \leq \lambda f(x_1) + (1-\lambda)f(x_2)$$
for all $x_1, x_2$ and $\lambda \in [0, 1]$.

**Graphically:** Line segment between any two points on the graph lies above the graph.

**Properties:**
- Any local minimum is a global minimum
- Easier to optimize
- Gradient descent guaranteed to find global minimum

**Examples:**
- Convex: $x^2$, $e^x$, $-\log(x)$, linear functions
- Non-convex: $\sin(x)$, $x^3$, neural networks

**Non-Convex Optimization:**
- Multiple local minima
- Gradient descent may get stuck in local minimum
- Require sophisticated techniques (momentum, random restarts)

---

### 5.5 Applications in Machine Learning

#### Loss Minimization

**Goal:** Find parameters $\theta$ that minimize loss function $L(\theta)$.

**Linear Regression:**
$$L(\theta) = \frac{1}{2n}\sum_{i=1}^n (y_i - \theta^T x_i)^2$$
- Gradient: $\nabla L = -\frac{1}{n}\sum_{i=1}^n (y_i - \theta^T x_i)x_i$
- Update: $\theta_{k+1} = \theta_k + \alpha \cdot \frac{1}{n}\sum (y_i - \theta_k^T x_i)x_i$

**Logistic Regression:**
$$L(\theta) = -\frac{1}{n}\sum_{i=1}^n [y_i \log(\sigma(\theta^T x_i)) + (1-y_i)\log(1-\sigma(\theta^T x_i))]$$
where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function.

#### Neural Network Training

**Backpropagation = Chain Rule + Gradient Descent**
1. Forward pass: compute loss
2. Backward pass: compute gradients using chain rule
3. Update weights using gradient descent

**Training = Optimization:**
- Minimize loss over training data
- Navigate high-dimensional parameter space
- Balance convergence speed and stability

#### Hyperparameter Tuning

**Goal:** Find optimal hyperparameters (learning rate, regularization strength, architecture choices).

**Methods:**
- Grid search: exhaustive evaluation
- Random search: sample randomly
- Bayesian optimization: model-based approach

---

### 5.6 Practical Considerations

**Convergence Criteria:**
1. $|x_{k+1} - x_k| < \epsilon$ (small change in $x$)
2. $|f(x_{k+1}) - f(x_k)| < \epsilon$ (small change in $f$)
3. $\|\nabla f(x_k)\| < \epsilon$ (gradient near zero)

**Common Issues:**
- **Oscillation:** Learning rate too large ‚Üí reduce $\alpha$
- **Slow convergence:** Learning rate too small ‚Üí increase $\alpha$
- **Local minima:** Non-convex function ‚Üí try different starting points
- **Saddle points:** Gradient is zero but not a minimum ‚Üí use momentum

**Improvements:**
- **Momentum:** $v_{k+1} = \beta v_k + \nabla f(x_k)$, $x_{k+1} = x_k - \alpha v_{k+1}$
- **Adaptive learning rates:** Adam, RMSprop, Adagrad
- **Line search:** Optimize $\alpha$ at each step

---

### 5.7 Summary

| Method | Update Rule | Convergence | Computational Cost | Best For |
|--------|-------------|-------------|-------------------|----------|
| Gradient Descent | $x_{k+1} = x_k - \alpha \nabla f(x_k)$ | Linear | Low | Large-scale problems |
| Newton's Method | $x_{k+1} = x_k - [f''(x_k)]^{-1}f'(x_k)$ | Quadratic | High | Small-scale, smooth functions |
| SGD | $x_{k+1} = x_k - \alpha \nabla f(x_k; \text{sample})$ | Noisy | Very Low | Large datasets, online learning |

**Key Takeaway:** Derivatives are the foundation of optimization, which powers modern machine learning!

In [None]:
"""
OPTIMIZATION APPLICATIONS - SECTION HEADER
"""

print("="*80)
print("SECTION 5: OPTIMIZATION APPLICATIONS")
print("="*80)

In [None]:
"""
1. GRADIENT DESCENT IMPLEMENTATION & EXAMPLES - Complete Demonstration
"""

print("\n" + "="*80)
print("1. GRADIENT DESCENT")
print("="*80)

def gradient_descent(f, df, x0, learning_rate=0.1, max_iters=100, tol=1e-6):
    """
    Gradient descent optimization algorithm.
    
    Parameters:
    -----------
    f : callable - objective function
    df : callable - derivative of objective function
    x0 : float - initial point
    learning_rate : float - step size
    max_iters : int - maximum iterations
    tol : float - convergence tolerance
    
    Returns:
    --------
    x_history : list - trajectory of x values
    f_history : list - trajectory of function values
    """
    x = x0
    x_history = [x]
    f_history = [f(x)]
    
    for i in range(max_iters):
        # Compute gradient
        grad = df(x)
        
        # Update rule
        x_new = x - learning_rate * grad
        
        # Check convergence
        if abs(x_new - x) < tol:
            print(f"  Converged in {i+1} iterations")
            break
        
        x = x_new
        x_history.append(x)
        f_history.append(f(x))
    
    return np.array(x_history), np.array(f_history)

# Example 1: Simple quadratic
print("\nExample 1: f(x) = (x + 2)¬≤ (minimum at x = -2)")

f1 = lambda x: (x + 2)**2
df1 = lambda x: 2*(x + 2)

x_hist, f_hist = gradient_descent(f1, df1, x0=5.0, learning_rate=0.1)
print(f"  Initial: x‚ÇÄ = {x_hist[0]:.4f}, f(x‚ÇÄ) = {f_hist[0]:.4f}")
print(f"  Final: x = {x_hist[-1]:.4f}, f(x) = {f_hist[-1]:.4f}")
print(f"  True minimum: x = -2.0, f(-2) = 0.0")

# Example 2: Effect of learning rate
print("\n" + "="*60)
print("Example 2: Learning rate comparison")

learning_rates = [0.01, 0.1, 0.5, 0.9]
results = {}

for lr in learning_rates:
    x_hist_lr, f_hist_lr = gradient_descent(f1, df1, x0=5.0, learning_rate=lr, max_iters=50)
    results[lr] = (x_hist_lr, f_hist_lr)
    print(f"  Œ± = {lr:.2f}: {len(x_hist_lr)} iterations, final x = {x_hist_lr[-1]:.4f}")

print("\n  All complete with different convergence speeds")

In [None]:
"""
3. MACHINE LEARNING APPLICATION: Linear Regression with Gradient Descent
"""

print("\n" + "="*80)
print("3. ML APPLICATION: LINEAR REGRESSION WITH GRADIENT DESCENT")
print("="*80)

# Generate synthetic data
np.random.seed(42)
n_samples = 100
X_train = 2 * np.random.rand(n_samples, 1)
y_train = 4 + 3 * X_train + np.random.randn(n_samples, 1)

print(f"\n  Data: {n_samples} samples")
print(f"  True model: y = 4 + 3x + noise")

def compute_mse(X, y, theta):
    """Mean squared error"""
    predictions = X.dot(theta)
    errors = predictions - y
    return (1/(2*len(y))) * np.sum(errors**2)

def compute_mse_gradient(X, y, theta):
    """Gradient of MSE"""
    predictions = X.dot(theta)
    errors = predictions - y
    return (1/len(y)) * X.T.dot(errors)

# Add intercept term
X_b = np.c_[np.ones((n_samples, 1)), X_train]

# Initialize parameters
theta = np.random.randn(2, 1)

# Gradient descent for linear regression
learning_rate = 0.1
n_iterations = 1000
theta_history = [theta.copy()]
mse_history = [compute_mse(X_b, y_train, theta)]

for iteration in range(n_iterations):
    gradients = compute_mse_gradient(X_b, y_train, theta)
    theta = theta - learning_rate * gradients
    
    if iteration % 100 == 0:
        theta_history.append(theta.copy())
        mse_history.append(compute_mse(X_b, y_train, theta))

print(f"\n  Initial parameters: Œ∏‚ÇÄ = {theta_history[0][0,0]:.4f}, Œ∏‚ÇÅ = {theta_history[0][1,0]:.4f}")
print(f"  Final parameters: Œ∏‚ÇÄ = {theta[0,0]:.4f}, Œ∏‚ÇÅ = {theta[1,0]:.4f}")
print(f"  True parameters: Œ∏‚ÇÄ = 4.0, Œ∏‚ÇÅ = 3.0")
print(f"  Initial MSE: {mse_history[0]:.4f}")
print(f"  Final MSE: {mse_history[-1]:.4f}")
print("\n  ‚úì Successfully learned parameters close to true values!")

In [None]:
"""
4. COMPREHENSIVE VISUALIZATIONS (8 plots) - Section 5 Complete
"""

print("\n" + "="*80)
print("4. COMPREHENSIVE VISUALIZATIONS (8 plots)")
print("="*80)

fig = plt.figure(figsize=(20, 15))
gs = fig.add_gridspec(3, 3, hspace=0.35, wspace=0.35)

# Plot 1: Gradient descent trajectory
print("\n  Creating Plot 1: Gradient descent trajectory...")
ax = fig.add_subplot(gs[0, 0])
x_vals = np.linspace(-3, 6, 500)
ax.plot(x_vals, f1(x_vals), 'b-', linewidth=2, label='f(x) = (x+2)¬≤')
ax.plot(x_hist, f_hist, 'ro-', markersize=6, linewidth=1.5, label='GD trajectory', alpha=0.7)
ax.plot(x_hist[0], f_hist[0], 'go', markersize=12, label='Start', zorder=5)
ax.plot(x_hist[-1], f_hist[-1], 'r*', markersize=15, label='Optimum', zorder=5)
ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Gradient Descent Trajectory\nŒ±=0.1, converges to x=-2', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 2: Learning rate comparison
print("  Creating Plot 2: Learning rate effects...")
ax = fig.add_subplot(gs[0, 1])
for lr in learning_rates:
    x_hist_lr, f_hist_lr = results[lr]
    ax.plot(range(len(f_hist_lr)), f_hist_lr, '-o', markersize=3, label=f'Œ±={lr}', linewidth=1.5)
ax.set_xlabel('Iteration', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Learning Rate Comparison\nFaster convergence with larger Œ±', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Plot 3: Newton vs Gradient Descent
print("  Creating Plot 3: Newton vs GD comparison...")
ax = fig.add_subplot(gs[0, 2])
ax.plot(range(len(f_gd)), f_gd, 'b-o', markersize=3, label='Gradient Descent', linewidth=1.5, alpha=0.7)
ax.plot(range(len(f_newton)), f_newton, 'r-s', markersize=5, label="Newton's Method", linewidth=2)
ax.set_xlabel('Iteration', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title("Newton's Method vs Gradient Descent\nNewton converges much faster!", 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Plot 4: Linear regression fit
print("  Creating Plot 4: Linear regression...")
ax = fig.add_subplot(gs[1, 0])
ax.scatter(X_train, y_train, alpha=0.5, s=30, label='Training data')
x_plot = np.linspace(0, 2, 100).reshape(-1, 1)
x_plot_b = np.c_[np.ones((100, 1)), x_plot]
y_pred = x_plot_b.dot(theta)
ax.plot(x_plot, y_pred, 'r-', linewidth=2.5, label=f'Fitted: y={theta[0,0]:.2f}+{theta[1,0]:.2f}x')
ax.plot(x_plot, 4 + 3*x_plot, 'g--', linewidth=2, label='True: y=4+3x', alpha=0.7)
ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Linear Regression with Gradient Descent\nLearned parameters match true model', 
             fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 5: MSE convergence
print("  Creating Plot 5: MSE convergence...")
ax = fig.add_subplot(gs[1, 1])
ax.plot(np.arange(0, n_iterations+1, 100), mse_history, 'b-o', linewidth=2, markersize=6)
ax.set_xlabel('Iteration', fontsize=11)
ax.set_ylabel('Mean Squared Error', fontsize=11)
ax.set_title('Training Loss Convergence\nMSE decreases as model learns', fontsize=11, fontweight='bold')
ax.grid(True, alpha=0.3)

# Plot 6: Convex vs non-convex
print("  Creating Plot 6: Convex vs non-convex...")
ax = fig.add_subplot(gs[1, 2])
x_vals = np.linspace(-6, 6, 500)
f_convex = lambda x: x**2
f_nonconvex = lambda x: np.sin(x) + 0.1*x**2
ax.plot(x_vals, f_convex(x_vals), 'b-', linewidth=2, label='Convex: x¬≤')
ax.plot(x_vals, f_nonconvex(x_vals), 'r-', linewidth=2, label='Non-convex: sin(x)+0.1x¬≤')
ax.axhline(0, color='black', linewidth=0.8)
ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('f(x)', fontsize=11)
ax.set_title('Convex vs Non-Convex Functions', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 7: Rosenbrock function contour with GD trajectory
print("  Creating Plot 7: Rosenbrock function...")
ax = fig.add_subplot(gs[2, 0])

def rosenbrock(x, y):
    return (1 - x)**2 + 100*(y - x**2)**2

def rosenbrock_gradient(x, y):
    dx = -2*(1 - x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

# Optimize on Rosenbrock
point = np.array([-1.0, 1.0])
alpha_gd = 0.001
trajectory = [point.copy()]

for i in range(1000):
    grad = rosenbrock_gradient(point[0], point[1])
    point = point - alpha_gd * grad
    if i % 100 == 0:
        trajectory.append(point.copy())

trajectory = np.array(trajectory)

x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X_mesh, Y_mesh = np.meshgrid(x, y)
Z = rosenbrock(X_mesh, Y_mesh)
contour = ax.contour(X_mesh, Y_mesh, np.log(Z + 1), levels=20, cmap='viridis')
ax.plot(trajectory[:, 0], trajectory[:, 1], 'r-o', markersize=4, linewidth=2, label='GD path')
ax.plot(1, 1, 'r*', markersize=20, label='Global minimum', zorder=5)
ax.set_xlabel('x', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('Rosenbrock Function Optimization', fontsize=11, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 8: Gradient descent algorithm summary
print("  Creating Plot 8: Algorithm summary...")
ax = fig.add_subplot(gs[2, 1:])
ax.text(0.5, 0.95, 'Gradient Descent Algorithm', ha='center', fontsize=14, fontweight='bold')

algorithm_steps = [
    "1. Initialize: Choose starting point x‚ÇÄ and learning rate Œ±",
    "",
    "2. Repeat until convergence:",
    "   a) Compute gradient: g = ‚àáf(x‚Çñ)",
    "   b) Update: x‚Çñ‚Çä‚ÇÅ = x‚Çñ - Œ±¬∑g",
    "   c) Check: |x‚Çñ‚Çä‚ÇÅ - x‚Çñ| < Œµ ?",
    "",
    "3. Return: x‚Çñ‚Çä‚ÇÅ as optimal solution",
    "",
    "Key Parameters:",
    "  ‚Ä¢ Œ± (learning rate): Controls step size",
    "    - Too small: slow convergence",
    "    - Too large: oscillation/divergence",
    "    - Typical: 0.001 to 0.1",
    "",
    "Applications:",
    "  ‚úì Neural network training (backprop + GD)",
    "  ‚úì Linear/logistic regression",
    "  ‚úì Support vector machines",
    "  ‚úì Any differentiable optimization problem"
]

y_pos = 0.85
for step in algorithm_steps:
    if step.startswith(('1.', '2.', '3.', 'Key', 'Applications')):
        ax.text(0.05, y_pos, step, fontsize=10, fontweight='bold', family='monospace')
    else:
        ax.text(0.05, y_pos, step, fontsize=9, family='monospace')
    y_pos -= 0.04

ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.axis('off')

plt.tight_layout()
plt.show()

print("\n‚úì All 8 visualizations complete")

print("\n" + "="*80)
print("SECTION 5 COMPLETE: Optimization Applications")
print("="*80)

In [None]:
"""
2. NEWTON'S METHOD vs GRADIENT DESCENT - Comparison
"""

print("\n" + "="*80)
print("2. NEWTON'S METHOD")
print("="*80)

def newtons_method(f, df, ddf, x0, max_iters=20, tol=1e-6):
    """Newton's method for optimization."""
    x = x0
    x_history = [x]
    f_history = [f(x)]
    
    for i in range(max_iters):
        # Newton's update
        x_new = x - df(x) / ddf(x)
        
        if abs(x_new - x) < tol:
            print(f"  Converged in {i+1} iterations")
            break
        
        x = x_new
        x_history.append(x)
        f_history.append(f(x))
    
    return np.array(x_history), np.array(f_history)

# Example: Compare Newton vs Gradient Descent
print("\nCompare Newton's method vs Gradient Descent")
print("  Function: f(x) = x‚Å¥ - 3x¬≥ + 2")

f2 = lambda x: x**4 - 3*x**3 + 2
df2 = lambda x: 4*x**3 - 9*x**2
ddf2 = lambda x: 12*x**2 - 18*x

print("\n  Gradient Descent (Œ±=0.01):")
x_gd, f_gd = gradient_descent(f2, df2, x0=3.0, learning_rate=0.01, max_iters=1000)
print(f"    Final x = {x_gd[-1]:.6f}, f(x) = {f_gd[-1]:.6f}, iterations = {len(x_gd)}")

print("\n  Newton's Method:")
x_newton, f_newton = newtons_method(f2, df2, ddf2, x0=3.0)
print(f"    Final x = {x_newton[-1]:.6f}, f(x) = {f_newton[-1]:.6f}, iterations = {len(x_newton)}")

print("\n  ‚úì Newton's method converges much faster!")

In [None]:
# PROBLEM 11 (BONUS): Neural Network Backpropagation

print("\n" + "="*80)
print("PROBLEM 11 (BONUS): Neural Network Backpropagation")
print("="*80)

x_nn = 3
w1 = 0.5
b1 = 1.0
alpha_nn = 0.1

print(f"\nGiven: x={x_nn}, w‚ÇÅ={w1}, b‚ÇÅ={b1}")
print("Network: L = (a‚ÇÅ)¬≤ where a‚ÇÅ = ReLU(w‚ÇÅx + b‚ÇÅ)")

print("\na) Forward pass:")
z1 = w1 * x_nn + b1
print(f"   z‚ÇÅ = w‚ÇÅ¬∑x + b‚ÇÅ = {w1}¬∑{x_nn} + {b1} = {z1}")
a1 = max(0, z1)
print(f"   a‚ÇÅ = ReLU(z‚ÇÅ) = max(0, {z1}) = {a1}")
L = a1**2
print(f"   L = (a‚ÇÅ)¬≤ = {a1}¬≤ = {L}")

print("\nb) Backpropagation using chain rule:")
print("   ‚àÇL/‚àÇw‚ÇÅ = (‚àÇL/‚àÇa‚ÇÅ)¬∑(‚àÇa‚ÇÅ/‚àÇz‚ÇÅ)¬∑(‚àÇz‚ÇÅ/‚àÇw‚ÇÅ)")
print(f"   ‚àÇL/‚àÇa‚ÇÅ = 2a‚ÇÅ = 2¬∑{a1} = {2*a1}")
relu_grad = 1 if z1 > 0 else 0
print(f"   ‚àÇa‚ÇÅ/‚àÇz‚ÇÅ = ReLU'(z‚ÇÅ) = {relu_grad} (since z‚ÇÅ={z1}>0)")
print(f"   ‚àÇz‚ÇÅ/‚àÇw‚ÇÅ = x = {x_nn}")
grad_w1 = 2*a1 * relu_grad * x_nn
print(f"   ‚àÇL/‚àÇw‚ÇÅ = {2*a1}¬∑{relu_grad}¬∑{x_nn} = {grad_w1}")

print("\nc) Update w‚ÇÅ:")
w1_new = w1 - alpha_nn * grad_w1
print(f"   w‚ÇÅ_new = w‚ÇÅ - Œ±¬∑(‚àÇL/‚àÇw‚ÇÅ) = {w1} - {alpha_nn}¬∑{grad_w1} = {w1_new}")

print("\nd) Numerical verification:")
h = 1e-7
L_plus = (max(0, (w1+h)*x_nn + b1))**2
L_original = (max(0, w1*x_nn + b1))**2
numerical_grad = (L_plus - L_original) / h
print(f"   Numerical gradient ‚âà {numerical_grad:.4f}")
print(f"   Analytical gradient = {grad_w1:.4f}")
print(f"   ‚úì Match (difference: {abs(numerical_grad - grad_w1):.10f})")

print("\n" + "="*80)
print("‚úì ALL 11 PRACTICE PROBLEMS COMPLETE!")
print("="*80)

In [None]:
# PROBLEM 10: Newton's Method

print("\n" + "="*80)
print("PROBLEM 10: Newton's Method")
print("="*80)

print("\nFind root of f(x) = x¬≥ - 2x - 5 starting from x‚ÇÄ=2")

f10 = lambda x: x**3 - 2*x - 5
df10 = lambda x: 3*x**2 - 2

print("\na) Update formula: x_{n+1} = x_n - f(x_n)/f'(x_n)")
print("   = x_n - (x_n¬≥ - 2x_n - 5)/(3x_n¬≤ - 2)")

print("\nb) Iterations:")
x_newton = 2.0
for i in range(4):
    f_val = f10(x_newton)
    df_val = df10(x_newton)
    x_new = x_newton - f_val / df_val
    print(f"   Iteration {i+1}: x={x_newton:.6f}, f(x)={f_val:.6f}, x_new={x_new:.6f}")
    x_newton = x_new

print(f"\nc) ‚úì Verification: f({x_newton:.6f}) = {f10(x_newton):.8f} ‚âà 0")

In [None]:
# PROBLEM 9: Implicit Differentiation

print("\n" + "="*80)
print("PROBLEM 9: Implicit Differentiation")
print("="*80)

print("\nEquation: x¬≤ + y¬≤ = 25")
print("\nSolution:")
print("  Differentiate both sides: 2x + 2y(dy/dx) = 0")
print("  Solve for dy/dx: dy/dx = -x/y")
print("\n  At point (3, 4):")
print("  dy/dx = -3/4 = -0.75")
print("\n  ‚úì Slope of tangent line at (3,4) is -3/4")

In [None]:
# PROBLEM 8: Machine Learning - Linear Regression with Gradient Descent

print("\n" + "="*80)
print("PROBLEM 8: Linear Regression with Gradient Descent")
print("="*80)

X = np.array([1, 2, 3])
y = np.array([3, 5, 7])
n = len(X)

print(f"\nData: {list(zip(X, y))}")

print("\na) Loss: L(Œ∏) = (1/2n)Œ£(y·µ¢ - Œ∏x·µ¢)¬≤")
print("   Gradient: ‚àÇL/‚àÇŒ∏ = -(1/n)Œ£(y·µ¢ - Œ∏x·µ¢)x·µ¢")

print("\nb) Gradient descent with Œ±=0.1, Œ∏‚ÇÄ=0:")
theta = 0.0
for iter in range(3):
    predictions = theta * X
    errors = y - predictions
    gradient = -(1/n) * np.sum(errors * X)
    theta_new = theta - 0.1 * gradient
    loss = (1/(2*n)) * np.sum(errors**2)
    print(f"   Iteration {iter+1}: Œ∏={theta:.4f}, gradient={gradient:.4f}, loss={loss:.4f}, Œ∏_new={theta_new:.4f}")
    theta = theta_new

print(f"\n   After 3 iterations: Œ∏ = {theta:.4f}")

print("\nc) Optimal Œ∏ (analytical solution):")
theta_opt = np.sum(X * y) / np.sum(X * X)
print(f"   Œ∏* = Œ£(x·µ¢y·µ¢) / Œ£(x·µ¢¬≤) = {np.sum(X*y)} / {np.sum(X*X)} = {theta_opt:.4f}")

In [None]:
# PROBLEM 7: Related Rates - Sliding Ladder

print("\n" + "="*80)
print("PROBLEM 7: Related Rates - Sliding Ladder")
print("="*80)

print("\nLadder: 10m long, bottom sliding away at 1 m/s")
print("Find: How fast is top sliding down when bottom is 6m from wall?")

print("\nSolution:")
print("  Let x = distance from wall to bottom, y = height of top")
print("  Pythagorean theorem: x¬≤ + y¬≤ = 100")
print("  Differentiate: 2x(dx/dt) + 2y(dy/dt) = 0")
print("  Solve for dy/dt: dy/dt = -x(dx/dt) / y")
print("\n  Given: dx/dt = 1 m/s, x = 6m")
print("  Find y: 6¬≤ + y¬≤ = 100  ‚Üí  y = 8m")
print("  dy/dt = -6(1) / 8 = -0.75 m/s")
print("\n  ‚úì Answer: Top sliding down at 0.75 m/s (negative = downward)")

In [None]:
# PROBLEM 6: Gradient Descent on polynomial

print("\n" + "="*80)
print("PROBLEM 6: Gradient Descent on f(x) = x‚Å¥ - 3x¬≤ + 2")
print("="*80)

f6_func = lambda x: x**4 - 3*x**2 + 2
df6_func = lambda x: 4*x**3 - 6*x

print("\na) f'(x) = 4x¬≥ - 6x")

print("\nb-c) Implementing gradient descent:")
x_gd = 2.0
alpha = 0.1
iterations = 0
tolerance = 1e-6

gd_history = [x_gd]

while iterations < 1000:
    x_new = x_gd - alpha * df6_func(x_gd)
    if abs(x_new - x_gd) < tolerance:
        break
    x_gd = x_new
    gd_history.append(x_gd)
    iterations += 1

print(f"   Converged in {iterations} iterations")
print(f"   Final x = {x_gd:.6f}")
print(f"   f(x) = {f6_func(x_gd):.6f}")

print("\nd) Analysis:")
print(f"   Starting from x‚ÇÄ=2, algorithm found local minimum at x‚âà{x_gd:.4f}")
print("   f(x) has 3 critical points: x ‚âà -1.22, 0, 1.22")
print("   Global minimum is at x ‚âà ¬±1.22 with f(x) ‚âà -0.25")
print("   Converged to local minimum at x‚âà1.22 (depends on starting point)")

In [None]:
# PROBLEM 5: Optimization - Rectangular Field

print("\n" + "="*80)
print("PROBLEM 5: Optimization - Rectangular Field")
print("="*80)

print("\nFarmer has 200m fencing, one side is river (no fence needed)")
print("\nLet x = width (perpendicular to river), y = length (parallel to river)")
print("Constraint: 2x + y = 200  ‚Üí  y = 200 - 2x")
print("Area: A(x) = x¬∑y = x(200 - 2x) = 200x - 2x¬≤")

print("\na) A(x) = 200x - 2x¬≤")

print("\nb) Find critical points:")
print("   A'(x) = 200 - 4x = 0")
print("   x = 50")

print("\nc) Second derivative test:")
print("   A''(x) = -4 < 0  ‚Üí  Maximum at x=50")

print("\nd) Maximum area:")
print("   x = 50m, y = 200 - 2(50) = 100m")
print("   A_max = 50 √ó 100 = 5000 m¬≤")

In [None]:
# PROBLEM 4: Critical Points and Classification

print("\n" + "="*80)
print("PROBLEM 4: Critical Points and Classification")
print("="*80)

print("\nf(x) = x‚Å¥ - 4x¬≥ + 10")
f4 = x**4 - 4*x**3 + 10
df4 = sp.diff(f4, x)
ddf4 = sp.diff(df4, x)

print(f"\na) f'(x) = {df4}")
critical_pts = sp.solve(df4, x)
print(f"   Critical points: {critical_pts}")

print("\nb) Second derivative test:")
print(f"   f''(x) = {ddf4}")
for cp in critical_pts:
    ddf_val = ddf4.subs(x, cp)
    f_val = f4.subs(x, cp)
    if ddf_val > 0:
        classification = "Local MINIMUM"
    elif ddf_val < 0:
        classification = "Local MAXIMUM"
    else:
        classification = "Inconclusive"
    print(f"   x = {cp}: f''({cp}) = {ddf_val} ‚Üí {classification}, f({cp}) = {f_val}")

print("\nc) Global extrema on [-1, 4]:")
endpoints = [-1, 4]
candidates = list(critical_pts) + endpoints
print("   Evaluating at critical points and endpoints:")
for pt in candidates:
    val = f4.subs(x, pt)
    print(f"   f({pt}) = {val}")

In [None]:
# PROBLEM 3: Chain Rule Applications

print("\n" + "="*80)
print("PROBLEM 3: Chain Rule Applications")
print("="*80)

# a)
print("\na) f(x) = (2x¬≤ + 3x + 1)‚Åµ")
f3a = (2*x**2 + 3*x + 1)**5
df3a = sp.diff(f3a, x)
print(f"   f'(x) = {df3a}")
print("   Using chain rule: 5(2x¬≤+3x+1)‚Å¥ ¬∑ (4x+3)")

# b)
print("\nb) g(x) = e^(x¬≥+2x)")
f3b = sp.exp(x**3 + 2*x)
df3b = sp.diff(f3b, x)
print(f"   g'(x) = {df3b}")

# c)
print("\nc) h(x) = sin(x¬≤+1)")
f3c = sp.sin(x**2 + 1)
df3c = sp.diff(f3c, x)
print(f"   h'(x) = {df3c}")

# d)
print("\nd) k(x) = ln(‚àö(x¬≤+1))")
f3d = sp.log(sp.sqrt(x**2 + 1))
df3d = sp.diff(f3d, x)
df3d_simplified = sp.simplify(df3d)
print(f"   k'(x) = {df3d_simplified}")

In [None]:
# PROBLEM 2: Differentiation Rules (Power, Product, Quotient)

print("\n" + "="*80)
print("PROBLEM 2: Differentiation Rules")
print("="*80)

# a) Power rule
print("\na) f(x) = 3x‚Åµ - 2x¬≥ + 7x - 4")
f2a = 3*x**5 - 2*x**3 + 7*x - 4
df2a = sp.diff(f2a, x)
print(f"   f'(x) = {df2a}")

# b) Product rule
print("\nb) g(x) = x¬≤eÀ£")
f2b = x**2 * sp.exp(x)
df2b = sp.diff(f2b, x)
print(f"   g'(x) = {df2b}")
print("   Using product rule: (x¬≤)'(eÀ£) + (x¬≤)(eÀ£)' = 2x¬∑eÀ£ + x¬≤¬∑eÀ£ = eÀ£(x¬≤+2x)")

# c) Quotient rule
print("\nc) h(x) = (x¬≤+1)/(x-1)")
f2c = (x**2 + 1) / (x - 1)
df2c = sp.diff(f2c, x)
df2c_simplified = sp.simplify(df2c)
print(f"   h'(x) = {df2c_simplified}")
print("   Using quotient rule: [(2x)(x-1) - (x¬≤+1)(1)] / (x-1)¬≤")

# d) Product + trig
print("\nd) k(x) = sin(x)cos(x)")
f2d = sp.sin(x) * sp.cos(x)
df2d = sp.diff(f2d, x)
df2d_simplified = sp.simplify(df2d)
print(f"   k'(x) = {df2d_simplified}")
print("   Note: sin(x)cos(x) = ¬Ωsin(2x), so derivative is cos(2x)")

In [None]:
# PROBLEM 1: First Principles

print("\n" + "="*80)
print("PROBLEM 1: Derivative from First Principles")
print("="*80)

print("\nFind derivative of f(x) = ‚àö(x+1) using limit definition:")
print("\nSolution:")
print("  f'(x) = lim[h‚Üí0] [‚àö(x+h+1) - ‚àö(x+1)] / h")
print("  Multiply by conjugate: [‚àö(x+h+1) + ‚àö(x+1)] / [‚àö(x+h+1) + ‚àö(x+1)]")
print("  = lim[h‚Üí0] [(x+h+1) - (x+1)] / [h(‚àö(x+h+1) + ‚àö(x+1))]")
print("  = lim[h‚Üí0] h / [h(‚àö(x+h+1) + ‚àö(x+1))]")
print("  = lim[h‚Üí0] 1 / [‚àö(x+h+1) + ‚àö(x+1)]")
print("  = 1 / [2‚àö(x+1)]")

# Verify with SymPy
x = sp.Symbol('x')
f1 = sp.sqrt(x + 1)
df1 = sp.diff(f1, x)
print(f"\n  ‚úì SymPy verification: f'(x) = {df1}")

In [None]:
"""
PRACTICE PROBLEMS - DETAILED SOLUTIONS
Section header
"""

print("="*80)
print("PRACTICE PROBLEMS - SOLUTIONS")
print("="*80)
print("\n‚úì 11 comprehensive problems covering all derivative concepts")
print("  - First principles and limit definition")
print("  - Differentiation rules (power, product, quotient)")
print("  - Chain rule applications")
print("  - Critical points and optimization")
print("  - Gradient descent and Newton's method")
print("  - Related rates and implicit differentiation")
print("  - Machine learning applications (linear regression, backpropagation)")
print("\n" + "="*80)

## Practice Problems

Test your understanding of derivatives with these comprehensive problems covering all topics from this week.

---

### Problem 1: Computing Derivatives from First Principles

Compute the derivative of $f(x) = \sqrt{x+1}$ using the limit definition:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

**Hint:** Multiply by the conjugate to simplify.

---

### Problem 2: Differentiation Rules

Find the derivatives of the following functions:

a) $f(x) = 3x^5 - 2x^3 + 7x - 4$

b) $g(x) = x^2 e^x$

c) $h(x) = \frac{x^2 + 1}{x - 1}$

d) $k(x) = \sin(x) \cos(x)$

---

### Problem 3: Chain Rule Applications

Compute the derivatives using the chain rule:

a) $f(x) = (2x^2 + 3x + 1)^5$

b) $g(x) = e^{x^3 + 2x}$

c) $h(x) = \sin(x^2 + 1)$

d) $k(x) = \ln(\sqrt{x^2 + 1})$

---

### Problem 4: Critical Points and Classification

For $f(x) = x^4 - 4x^3 + 10$:

a) Find all critical points by solving $f'(x) = 0$

b) Classify each critical point using the second derivative test

c) Determine the global maximum and minimum on the interval $[-1, 4]$

d) Sketch the function showing all critical points

---

### Problem 5: Optimization Problem

A farmer has 200 meters of fencing to enclose a rectangular field adjacent to a river (so one side doesn't need fencing). What dimensions will maximize the area?

a) Express the area $A$ as a function of one variable

b) Find the critical points

c) Verify that your answer gives a maximum

d) What is the maximum area?

---

### Problem 6: Gradient Descent Implementation

Consider $f(x) = x^4 - 3x^2 + 2$:

a) Compute $f'(x)$ symbolically

b) Implement gradient descent starting from $x_0 = 2$ with $\alpha = 0.1$

c) How many iterations until convergence (tolerance $10^{-6}$)?

d) Does the algorithm find the global minimum? Why or why not?

---

### Problem 7: Related Rates

A ladder 10 meters long rests against a vertical wall. If the bottom slides away from the wall at 1 m/s, how fast is the top sliding down when the bottom is 6 meters from the wall?

**Hint:** Use the Pythagorean theorem and implicit differentiation.

---

### Problem 8: Machine Learning Application

Consider a simple linear regression problem with loss function:
$$L(\theta) = \frac{1}{2n}\sum_{i=1}^n (y_i - \theta x_i)^2$$

Given data: $(x_1, y_1) = (1, 3)$, $(x_2, y_2) = (2, 5)$, $(x_3, y_3) = (3, 7)$

a) Compute the gradient $\frac{\partial L}{\partial \theta}$ symbolically

b) Starting from $\theta_0 = 0$, perform 3 iterations of gradient descent with $\alpha = 0.1$

c) What is the optimal value of $\theta$ (you can solve analytically)?

---

### Problem 9: Implicit Differentiation

Find $\frac{dy}{dx}$ for the equation:
$$x^2 + y^2 = 25$$

Then find the slope of the tangent line at the point $(3, 4)$.

---

### Problem 10: Newton's Method

Use Newton's method to find the root of $f(x) = x^3 - 2x - 5$ starting from $x_0 = 2$.

a) Write the Newton's method update formula for this specific function

b) Perform 4 iterations manually (or with code)

c) Verify your answer by checking $f(x) \approx 0$

---

### Problem 11 (Bonus): Neural Network Backpropagation

Consider a simple neural network: $L = (a_1)^2$ where $a_1 = \text{ReLU}(w_1 x + b_1)$

Given: $x = 3$, $w_1 = 0.5$, $b_1 = 1$

a) Compute the forward pass to get $L$

b) Compute $\frac{\partial L}{\partial w_1}$ using the chain rule

c) Update $w_1$ using gradient descent with $\alpha = 0.1$

d) Verify numerically using $\frac{\partial L}{\partial w_1} \approx \frac{L(w_1 + h) - L(w_1)}{h}$

---

**Check your answers in the next cell!**

## Summary and Key Takeaways

### üéØ Core Concepts Mastered

#### 1. **Derivative Definition**
- **Limit definition:** $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$
- **Interpretation:** Instantaneous rate of change, slope of tangent line
- **Connection:** Limits ‚Üí Derivatives ‚Üí Foundation of calculus

#### 2. **Differentiation Rules**

| Rule | Formula | When to Use |
|------|---------|-------------|
| Power Rule | $(x^n)' = nx^{n-1}$ | Polynomials, power functions |
| Product Rule | $(fg)' = f'g + fg'$ | Product of two functions |
| Quotient Rule | $\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}$ | Ratio of functions |
| Chain Rule | $(f \circ g)' = f'(g(x)) \cdot g'(x)$ | Composite functions |

**Mnemonic for Quotient Rule:** "Low d-high minus high d-low, over low-low"

#### 3. **Critical Points and Extrema**

**Finding Critical Points:**
1. Compute $f'(x)$
2. Solve $f'(x) = 0$
3. Check where $f'(x)$ is undefined

**Classification Methods:**

**First Derivative Test:**
- $f'$ changes + to ‚àí ‚Üí **Local Maximum**
- $f'$ changes ‚àí to + ‚Üí **Local Minimum**

**Second Derivative Test:**
- $f''(c) > 0$ ‚Üí **Local Minimum** (concave up ‚å£)
- $f''(c) < 0$ ‚Üí **Local Maximum** (concave down ‚å¢)
- $f''(c) = 0$ ‚Üí Inconclusive (use 1st derivative test)

#### 4. **Optimization**

**Gradient Descent Algorithm:**
```
Initialize: x‚ÇÄ, Œ± (learning rate)
Repeat:
  x_{k+1} = x_k - Œ±¬∑‚àáf(x_k)
Until convergence
```

**Key Parameters:**
- **Learning rate (Œ±):** Too small = slow, too large = divergence
- **Convergence criterion:** $|x_{k+1} - x_k| < \epsilon$

**Newton's Method:** Faster convergence using second derivatives
- Update: $x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}$

---

### üìê Essential Formulas Reference

#### Common Derivatives

| Function | Derivative |
|----------|------------|
| $c$ (constant) | $0$ |
| $x^n$ | $nx^{n-1}$ |
| $e^x$ | $e^x$ |
| $\ln(x)$ | $\frac{1}{x}$ |
| $\sin(x)$ | $\cos(x)$ |
| $\cos(x)$ | $-\sin(x)$ |
| $\tan(x)$ | $\sec^2(x)$ |

#### Chain Rule Examples
- $(x^2 + 1)^5$ ‚Üí $5(x^2+1)^4 \cdot 2x$
- $e^{x^2}$ ‚Üí $e^{x^2} \cdot 2x$
- $\sin(3x)$ ‚Üí $\cos(3x) \cdot 3$

---

### ü§ñ Machine Learning Connections

#### 1. **Gradient Descent = Core of Training**
- Neural networks: Update weights using $w := w - \alpha \frac{\partial L}{\partial w}$
- Linear regression: Minimize MSE using gradients
- All ML optimization relies on computing derivatives!

#### 2. **Backpropagation = Chain Rule**
- Forward pass: Compute predictions
- Backward pass: Apply chain rule to compute gradients
- Example: $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

#### 3. **Loss Minimization**
- **Objective:** Find parameters that minimize loss function
- **Method:** Gradient descent or variants (SGD, Adam)
- **Derivatives:** Tell us direction to move in parameter space

---

### üîó Connections to Other Topics

#### From Week 9 (Limits)
- Derivatives **defined using limits**: $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$
- Continuity required for differentiability
- Limit laws enable computing derivatives

#### To Week 11 (Integration)
- Integration is **inverse of differentiation**
- Fundamental Theorem of Calculus links them
- Applications: Area, accumulation, probability

#### To Statistics
- Probability density functions
- Maximum likelihood estimation
- Normal distribution derived using derivatives

---

### üí° Problem-Solving Strategy

#### Differentiation Problems:
1. **Identify** which rule(s) to apply
2. **Apply** rules systematically (inside-out for chain rule)
3. **Simplify** the result
4. **Verify** using SymPy or numerical methods

#### Optimization Problems:
1. **Define** objective function $f(x)$
2. **Compute** $f'(x)$ and find critical points
3. **Classify** using second derivative test
4. **Check** endpoints if on closed interval
5. **Verify** answer makes sense in context

#### ML Applications:
1. **Define** loss function $L(\theta)$
2. **Compute** gradient $\nabla L(\theta)$
3. **Initialize** parameters
4. **Iterate** gradient descent: $\theta := \theta - \alpha \nabla L$
5. **Monitor** convergence

---

### üéì Self-Assessment Checklist

Check off each item you can confidently do:

- [ ] Compute derivatives using limit definition
- [ ] Apply power rule to polynomials
- [ ] Use product rule for products of functions
- [ ] Apply quotient rule correctly
- [ ] Use chain rule for composite functions
- [ ] Find critical points by solving $f'(x) = 0$
- [ ] Classify critical points using first derivative test
- [ ] Classify critical points using second derivative test
- [ ] Find global extrema on closed intervals
- [ ] Set up and solve optimization word problems
- [ ] Implement gradient descent algorithm
- [ ] Understand connection to machine learning
- [ ] Apply derivatives to related rates problems
- [ ] Use implicit differentiation
- [ ] Understand backpropagation as chain rule

**Goal:** Check all boxes! If not, review relevant sections.

---

### üìö Additional Resources

#### For Deeper Understanding:
- **3Blue1Brown:** "Essence of Calculus" YouTube series (visual intuition)
- **Khan Academy:** Calculus I and optimization
- **MIT OCW:** Single Variable Calculus (18.01)

#### For ML Applications:
- **Andrew Ng:** Machine Learning course (Coursera)
- **Deep Learning Book:** Chapter 4 (Numerical Computation)
- **Fast.ai:** Practical Deep Learning

#### Practice Problems:
- **Paul's Online Math Notes:** Calculus I
- **Stewart Calculus:** Classic textbook exercises
- **Kaggle:** ML competitions applying optimization

---

### üöÄ What's Next?

#### Week 11: Integration
- Antiderivatives and indefinite integrals
- Definite integrals and area under curves
- Fundamental Theorem of Calculus
- Applications to probability and statistics

#### Weeks 12: Advanced Integration
- Integration by substitution
- Integration by parts
- Numerical integration methods

#### Connection to Data Science:
- **Probability:** Integration for continuous distributions
- **Statistics:** Maximum likelihood estimation
- **Machine Learning:** Loss functions, regularization
- **Optimization:** Constrained optimization, Lagrange multipliers

---

### üéâ Congratulations!

You've completed **Week 10: Derivatives**!

You now understand:
- ‚úÖ How to compute derivatives using multiple methods
- ‚úÖ How to find and classify critical points
- ‚úÖ How optimization algorithms work
- ‚úÖ The mathematical foundation of machine learning
- ‚úÖ How calculus powers modern AI and data science

**Derivatives are everywhere in data science and machine learning. You've just learned one of the most powerful tools in mathematics!**

---

### üìù Final Notes

**Key Insight:** Derivatives measure **rates of change**. In ML, we use derivatives to measure how loss changes with parameters, allowing us to optimize models.

**Practice Tip:** The best way to master derivatives is to solve many problems. Work through the practice problems, verify with code, and build intuition.

**Looking Ahead:** Integration (Week 11) completes the calculus foundation. Together, derivatives and integrals form the language of change and accumulation‚Äîessential for understanding probability, statistics, and advanced ML.

**Keep Learning!** üöÄüìäü§ñ