# Module 06: Calculus Basics

**Difficulty**: ⭐⭐⭐ Advanced

**Estimated Time**: 90 minutes

**Prerequisites**: 
- Module 00: Setup and Introduction
- Basic understanding of functions and algebra

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand limits and continuity concepts
2. Calculate derivatives using various rules
3. Apply the chain rule and partial derivatives
4. Understand gradient descent optimization
5. Visualize optimization problems
6. Apply calculus concepts to neural network training

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import optimize
from matplotlib import animation
from IPython.display import HTML

# Configure visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Set random seed
np.random.seed(42)

# Display options
np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)

print("Setup complete!")

## 1. Limits and Continuity

**Limit**: The value a function approaches as the input approaches some value.

$$\lim_{x \to a} f(x) = L$$

**Continuity**: A function is continuous at $x=a$ if:
1. $f(a)$ exists
2. $\lim_{x \to a} f(x)$ exists
3. $\lim_{x \to a} f(x) = f(a)$

**Why important in ML:**
- Ensures smooth optimization landscapes
- Guarantees convergence of gradient descent
- Essential for theoretical guarantees

In [None]:
# Visualizing limits

def f(x):
    """Example function with a discontinuity"""
    return np.where(x < 0, x**2, 2*x + 1)

x = np.linspace(-3, 3, 1000)
y = f(x)

plt.figure(figsize=(12, 6))
plt.plot(x[x < 0], y[x < 0], 'b-', linewidth=2.5, label='f(x) = x² for x < 0')
plt.plot(x[x >= 0], y[x >= 0], 'r-', linewidth=2.5, label='f(x) = 2x + 1 for x ≥ 0')

# Mark the point at x=0
plt.scatter([0], [0], s=100, c='blue', zorder=5, marker='o')
plt.scatter([0], [1], s=100, c='red', zorder=5, marker='o')

plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Piecewise Function with Discontinuity at x=0', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The function jumps from 0 to 1 at x=0 (discontinuous).")
print("Left limit: lim(x→0⁻) f(x) = 0")
print("Right limit: lim(x→0⁺) f(x) = 1")

## 2. Derivatives

**Derivative**: Rate of change of a function. The slope of the tangent line.

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

**Notation**: $f'(x)$, $\frac{df}{dx}$, $\frac{d}{dx}f(x)$

**Basic Rules:**
- Constant: $(c)' = 0$
- Power rule: $(x^n)' = nx^{n-1}$
- Sum: $(f+g)' = f' + g'$
- Product: $(fg)' = f'g + fg'$
- Quotient: $(\frac{f}{g})' = \frac{f'g - fg'}{g^2}$

**Why important:**
- Find minima/maxima of loss functions
- Gradient descent optimization
- Backpropagation in neural networks

In [None]:
# Numerical derivative approximation

def f(x):
    return x**2

def numerical_derivative(f, x, h=1e-5):
    """Approximate derivative using finite differences"""
    return (f(x + h) - f(x)) / h

# Test at x = 3
x_val = 3
analytical_deriv = 2 * x_val  # f'(x) = 2x for f(x) = x²
numerical_deriv = numerical_derivative(f, x_val)

print(f"Function: f(x) = x²")
print(f"Point: x = {x_val}")
print(f"\nAnalytical derivative f'({x_val}) = {analytical_deriv}")
print(f"Numerical derivative f'({x_val}) ≈ {numerical_deriv:.6f}")
print(f"Difference: {abs(analytical_deriv - numerical_deriv):.10f}")

In [None]:
# Visualize derivative as slope

x = np.linspace(-2, 4, 100)
y = x**2

# Point of interest
x0 = 1.5
y0 = x0**2
slope = 2 * x0  # Derivative at x0

# Tangent line: y - y0 = slope * (x - x0)
tangent_line = slope * (x - x0) + y0

plt.figure(figsize=(10, 7))
plt.plot(x, y, 'b-', linewidth=2.5, label='f(x) = x²')
plt.plot(x, tangent_line, 'r--', linewidth=2, label=f'Tangent line (slope = {slope:.2f})')
plt.scatter([x0], [y0], s=150, c='red', zorder=5, edgecolors='black', linewidths=2)
plt.text(x0 + 0.2, y0 + 0.3, f'({x0}, {y0:.2f})', fontsize=11, fontweight='bold')

plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Derivative as Slope of Tangent Line', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(-2, 4)
plt.ylim(-1, 10)
plt.tight_layout()
plt.show()

print(f"At x = {x0}, the derivative (slope) is {slope:.2f}")
print(f"This means the function is increasing at a rate of {slope:.2f} units per unit x.")

## 3. Chain Rule and Partial Derivatives

**Chain Rule**: For composite functions $f(g(x))$:

$$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$$

**Example**: $h(x) = (2x + 1)^3$
- Let $u = 2x + 1$, so $h = u^3$
- $\frac{dh}{dx} = \frac{dh}{du} \cdot \frac{du}{dx} = 3u^2 \cdot 2 = 6(2x+1)^2$

**Partial Derivatives**: For functions of multiple variables $f(x, y)$:

$$\frac{\partial f}{\partial x} = \text{derivative with respect to } x \text{ (treat } y \text{ as constant)}$$

**Applications:**
- Backpropagation (chain rule through network layers)
- Gradient computation for optimization

In [None]:
# Chain rule example

# Function: h(x) = (2x + 1)³
def h(x):
    return (2*x + 1)**3

# Analytical derivative: h'(x) = 6(2x + 1)²
def h_prime(x):
    return 6 * (2*x + 1)**2

# Test
x_test = 2
analytical = h_prime(x_test)
numerical = numerical_derivative(h, x_test)

print("Function: h(x) = (2x + 1)³")
print(f"\nUsing chain rule:")
print(f"h'(x) = 3(2x + 1)² × 2 = 6(2x + 1)²")
print(f"\nAt x = {x_test}:")
print(f"Analytical: h'({x_test}) = {analytical}")
print(f"Numerical: h'({x_test}) ≈ {numerical:.6f}")
print(f"Match: {np.isclose(analytical, numerical)}")

In [None]:
# Partial derivatives

def f(x, y):
    """Example: f(x, y) = x² + 2xy + y²"""
    return x**2 + 2*x*y + y**2

# Partial derivatives
# ∂f/∂x = 2x + 2y
# ∂f/∂y = 2x + 2y

x0, y0 = 1, 2

# Numerical partial derivatives
h = 1e-5
df_dx = (f(x0 + h, y0) - f(x0, y0)) / h
df_dy = (f(x0, y0 + h) - f(x0, y0)) / h

# Analytical
df_dx_analytical = 2*x0 + 2*y0
df_dy_analytical = 2*x0 + 2*y0

print(f"Function: f(x, y) = x² + 2xy + y²")
print(f"\nAt point ({x0}, {y0}):")
print(f"\n∂f/∂x = 2x + 2y")
print(f"Analytical: {df_dx_analytical}")
print(f"Numerical: {df_dx:.6f}")
print(f"\n∂f/∂y = 2x + 2y")
print(f"Analytical: {df_dy_analytical}")
print(f"Numerical: {df_dy:.6f}")
print(f"\nGradient vector: ∇f = [{df_dx_analytical}, {df_dy_analytical}]")

## 4. Gradient Descent Optimization

**Gradient**: Vector of partial derivatives

$$\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]$$

**Gradient Descent Algorithm**:
1. Start with initial guess $x_0$
2. Compute gradient $\nabla f(x_t)$
3. Update: $x_{t+1} = x_t - \alpha \nabla f(x_t)$
4. Repeat until convergence

where $\alpha$ is the learning rate.

**Key insight**: Gradient points in direction of steepest increase, so we go opposite direction to minimize.

**Applications:**
- Training neural networks
- Optimizing loss functions
- Finding minimum of any differentiable function

In [None]:
# Gradient descent on simple function

def f(x):
    """Objective function: f(x) = (x - 3)² + 5"""
    return (x - 3)**2 + 5

def df_dx(x):
    """Gradient: f'(x) = 2(x - 3)"""
    return 2 * (x - 3)

# Gradient descent
x = 0.0  # Starting point
learning_rate = 0.1
num_iterations = 20

history = [x]

print("=== Gradient Descent ===")
print(f"Function: f(x) = (x - 3)² + 5")
print(f"Starting point: x = {x}")
print(f"Learning rate: α = {learning_rate}")
print(f"\nIterations:")

for i in range(num_iterations):
    grad = df_dx(x)
    x = x - learning_rate * grad
    history.append(x)
    if i < 5 or i == num_iterations - 1:
        print(f"  {i+1}: x = {x:.6f}, f(x) = {f(x):.6f}, gradient = {grad:.6f}")

print(f"\nConverged to: x = {x:.6f}")
print(f"Minimum value: f({x:.6f}) = {f(x):.6f}")
print(f"True minimum: x = 3.0, f(3) = 5.0")

In [None]:
# Visualize gradient descent

x_range = np.linspace(-1, 7, 200)
y_range = f(x_range)

plt.figure(figsize=(12, 7))
plt.plot(x_range, y_range, 'b-', linewidth=2.5, label='f(x) = (x-3)² + 5')

# Plot gradient descent path
history_array = np.array(history)
plt.plot(history_array, f(history_array), 'ro-', markersize=6, linewidth=1.5, 
        label='Gradient descent path', alpha=0.7)

# Mark start and end
plt.scatter([history[0]], [f(history[0])], s=200, c='green', marker='*', 
           zorder=5, edgecolors='black', linewidths=2, label='Start')
plt.scatter([history[-1]], [f(history[-1])], s=200, c='red', marker='*', 
           zorder=5, edgecolors='black', linewidths=2, label='End')

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Gradient Descent Optimization', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Converged in {len(history)-1} iterations!")

## 5. Application to Neural Networks

In neural networks, calculus is used for:

**Forward pass**: Compute predictions
$$y = f(Wx + b)$$

**Backward pass (Backpropagation)**: Compute gradients using chain rule
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial W}$$

where:
- $L$ is the loss function
- $W$ are the weights
- Chain rule connects layers

**Optimization**: Update weights using gradient descent
$$W_{new} = W_{old} - \alpha \frac{\partial L}{\partial W}$$

In [None]:
# Simple neural network gradient example

# Simple model: y = sigmoid(wx + b)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

# Data
x = np.array([0.5, 1.0, 1.5])
y_true = np.array([0, 1, 1])

# Initial parameters
w = 0.5
b = 0.0
learning_rate = 0.5

print("=== Simple Neural Network Training ===")
print(f"Data: x = {x}, y_true = {y_true}")
print(f"\nInitial: w = {w}, b = {b}")
print(f"Learning rate: {learning_rate}\n")

# Training
for epoch in range(10):
    # Forward pass
    z = w * x + b
    y_pred = sigmoid(z)
    
    # Loss (MSE)
    loss = np.mean((y_pred - y_true)**2)
    
    # Backward pass (gradients)
    dL_dy = 2 * (y_pred - y_true) / len(x)
    dy_dz = sigmoid_derivative(z)
    dL_dz = dL_dy * dy_dz
    
    dL_dw = np.sum(dL_dz * x)
    dL_db = np.sum(dL_dz)
    
    # Update weights
    w = w - learning_rate * dL_dw
    b = b - learning_rate * dL_db
    
    if epoch < 3 or epoch == 9:
        print(f"Epoch {epoch+1}: Loss = {loss:.6f}, w = {w:.4f}, b = {b:.4f}")

print(f"\nFinal predictions: {sigmoid(w * x + b)}")
print(f"True values: {y_true}")

## 6. Practice Exercises

### Exercise 1: Compute Derivatives

Compute the derivatives of the following functions:

1. $f(x) = 3x^4 - 2x^2 + 5$
2. $g(x) = \sin(2x)$ (use chain rule)
3. $h(x, y) = x^2y + xy^2$ (partial derivatives)

In [None]:
# Exercise 1 Solution
print("=== Exercise 1 Solution ===\n")

# 1. f(x) = 3x⁴ - 2x² + 5
print("1. f(x) = 3x⁴ - 2x² + 5")
print("   Using power rule:")
print("   f'(x) = 12x³ - 4x")
print("   At x = 2:")
x = 2
f_prime = 12 * x**3 - 4 * x
print(f"   f'(2) = {f_prime}")

# 2. g(x) = sin(2x)
print("\n2. g(x) = sin(2x)")
print("   Using chain rule:")
print("   g'(x) = cos(2x) × 2 = 2cos(2x)")
print("   At x = π/4:")
x = np.pi / 4
g_prime = 2 * np.cos(2 * x)
print(f"   g'(π/4) = {g_prime:.6f}")

# 3. h(x, y) = x²y + xy²
print("\n3. h(x, y) = x²y + xy²")
print("   Partial derivatives:")
print("   ∂h/∂x = 2xy + y²")
print("   ∂h/∂y = x² + 2xy")
print("   At (x, y) = (1, 2):")
x, y = 1, 2
dh_dx = 2*x*y + y**2
dh_dy = x**2 + 2*x*y
print(f"   ∂h/∂x = {dh_dx}")
print(f"   ∂h/∂y = {dh_dy}")
print(f"   Gradient: ∇h = [{dh_dx}, {dh_dy}]")

### Exercise 2: Gradient Descent

Use gradient descent to minimize: $f(x) = x^4 - 4x^2 + 5$

Tasks:
1. Compute the derivative
2. Implement gradient descent
3. Find local minima starting from different points
4. Visualize the optimization paths

In [None]:
# Exercise 2 Solution
def f_ex2(x):
    return x**4 - 4*x**2 + 5

def df_ex2(x):
    return 4*x**3 - 8*x

def gradient_descent_ex2(start_x, lr=0.1, iterations=50):
    x = start_x
    path = [x]
    for _ in range(iterations):
        grad = df_ex2(x)
        x = x - lr * grad
        path.append(x)
    return np.array(path)

print("=== Exercise 2 Solution ===")
print("Function: f(x) = x⁴ - 4x² + 5")
print("Derivative: f'(x) = 4x³ - 8x\n")

# Try different starting points
starting_points = [-2.5, 0.5, 2.5]
paths = []

for start in starting_points:
    path = gradient_descent_ex2(start, lr=0.1, iterations=50)
    paths.append(path)
    print(f"Starting from x = {start}:")
    print(f"  Converged to: x = {path[-1]:.6f}")
    print(f"  Minimum value: f(x) = {f_ex2(path[-1]):.6f}\n")

# Visualize
x_range = np.linspace(-3, 3, 300)
y_range = f_ex2(x_range)

plt.figure(figsize=(12, 7))
plt.plot(x_range, y_range, 'b-', linewidth=2.5, label='f(x) = x⁴ - 4x² + 5')

colors = ['red', 'green', 'orange']
for i, (start, path) in enumerate(zip(starting_points, paths)):
    plt.plot(path, f_ex2(path), 'o-', color=colors[i], markersize=4, linewidth=1.5,
            alpha=0.7, label=f'Path from x={start}')
    plt.scatter([path[-1]], [f_ex2(path[-1])], s=150, c=colors[i], 
               marker='*', zorder=5, edgecolors='black', linewidths=2)

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Gradient Descent from Different Starting Points', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice: The function has two local minima at x ≈ ±√2")
print("Gradient descent finds different minima depending on starting point!")

## 7. Summary and Key Takeaways

In this module, you learned:

✅ **Limits and Continuity**
- Foundation of calculus
- Ensures smooth optimization
- Critical for convergence guarantees

✅ **Derivatives**
- Rate of change of functions
- Power rule, chain rule, product rule
- Geometric interpretation as slope

✅ **Partial Derivatives**
- Derivatives for multivariable functions
- Gradient vector points uphill
- Essential for multivariate optimization

✅ **Gradient Descent**
- Iterative optimization algorithm
- Update: $x_{t+1} = x_t - \alpha \nabla f(x_t)$
- Foundation of neural network training

✅ **Applications to ML**
- Backpropagation uses chain rule
- Weight updates use gradient descent
- All modern deep learning relies on these concepts

### What's Next?

In **Module 07: Final Project**, you'll:
- Combine all mathematical concepts learned
- Analyze a real dataset
- Implement ML algorithms from scratch
- Apply statistics, linear algebra, and calculus together

### Additional Resources

- [3Blue1Brown - Essence of Calculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr)
- [Khan Academy - Calculus](https://www.khanacademy.org/math/calculus-1)
- [MIT OpenCourseWare - Single Variable Calculus](https://ocw.mit.edu/courses/18-01-single-variable-calculus-fall-2006/)
- [Gradient Descent Visualization](https://www.benfrederickson.com/numerical-optimization/)

---

**Phenomenal work!** You now understand the calculus behind machine learning optimization and neural network training.

**Next**: Proceed to `07_final_project.ipynb`