# Calculus and Optimization for Machine Learning

This notebook covers fundamental calculus concepts and optimization techniques essential for machine learning, including limits, continuity, derivatives, partial derivatives, gradients, and multivariate calculus with practical applications in ML.

## Introduction

Calculus forms the mathematical foundation for optimization in machine learning. Understanding derivatives, gradients, and optimization methods is crucial for training ML models effectively. This notebook explores:

- Limits and Continuity
- Derivatives and their applications
- Partial Derivatives and Gradients
- Multivariate Calculus
- Optimization techniques (Gradient Descent)
- Real-world ML applications

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import linprog

## Limits

**Definition:**
A limit describes the value a function approaches as the input approaches some point.

**Formally:**
$$\lim_{x \to a} f(x) = L$$

This means that as $x$ gets arbitrarily close to $a$, $f(x)$ gets arbitrarily close to $L$.

**Why in ML?**
- Helps in understanding optimization behavior (e.g., convergence in gradient descent)
- Essential for analyzing algorithm convergence rates
- Used to understand the behavior of loss functions at critical points

## Continuity

**Definition:**
A function $f(x)$ is continuous at a point $x = a$ if:

$$\lim_{x \to a} f(x) = f(a)$$

This means:
1. $f(a)$ is defined
2. $\lim_{x \to a} f(x)$ exists
3. The limit equals the function value

**Why in ML?**
- Ensures smoothness in loss functions for stable optimization
- Continuous functions are easier to optimize
- Discontinuities can cause gradient descent to fail or converge slowly

## Derivatives

**Definition:**
The derivative measures the rate of change of a function with respect to its input:

$$f'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

**Interpretation:**
- The derivative represents the slope of the tangent line to the function at a point
- It tells us how the function changes when we make a small change to the input

**Why in ML?**
- Used in gradient descent to minimize loss functions
- Essential for backpropagation in neural networks
- Helps identify optimal parameters for models

## Gradient Descent - Core Optimization Algorithm

**What is Gradient Descent?**

Gradient descent is an optimization algorithm used in machine learning and other fields to find the minimum of a function by iteratively moving in the direction of the steepest decrease.

**How it Works:**

**Problem:** Minimize a loss function $L(\theta)$

**Solution:** Update rule using derivatives:

$$\theta_{new} = \theta_{old} - \alpha \frac{dL}{d\theta}$$

where:
- $\theta$ represents the parameters we want to optimize
- $\alpha$ is the learning rate (step size)
- $\frac{dL}{d\theta}$ is the gradient (derivative) of the loss function

**Example:** Linear regression uses derivatives to find optimal weights.

**Key Concepts:**
- Gradient descent is a family of algorithms used to search for parameters that minimize the loss function
- It iteratively adjusts parameters in the direction opposite to the gradient
- The learning rate controls how big each step is

### Mean Square Error (MSE) / L2 Loss

**Definition:**

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

where:
- $y_i$ is the actual target value
- $\hat{y}_i$ is the predicted value
- $n$ is the number of samples

**Properties:**
- Squaring the difference between predictions and actual target values results in a higher penalty assigned to larger deviations from the target value
- Taking the mean normalizes the total error against the number of samples in a dataset
- Commonly used in regression tasks
- Differentiable, making it suitable for gradient-based optimization

## Partial Derivatives

**Definition:**

For a function of multiple variables $f(x, y, z, ...)$, a partial derivative measures the rate of change with respect to one variable while keeping others constant.

**Notation:**

$$\frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y}, \quad \frac{\partial f}{\partial z}$$

**Example:**

For $f(x, y) = x^2 + 3xy + y^2$:

$$\frac{\partial f}{\partial x} = 2x + 3y$$

$$\frac{\partial f}{\partial y} = 3x + 2y$$

**Why in ML?**
- Training neural networks requires computing partial derivatives for each weight
- Each parameter is updated independently based on its partial derivative
- Essential for backpropagation algorithm

## Gradients

**Definition:**

The gradient is a vector of all partial derivatives of a multivariable function:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

**Properties:**
- The gradient points in the direction of steepest ascent
- The negative gradient points in the direction of steepest descent
- The magnitude of the gradient indicates how steep the slope is

**Why in ML?**
- Gradient descent uses the negative gradient to minimize loss functions
- Backpropagation in deep learning computes gradients efficiently
- Essential for optimizing high-dimensional parameter spaces

## Real Example in ML: Neural Networks

**Problem:** Optimize weights $W$ in a neural network

**Solution:** Compute gradients of the loss with respect to each weight

**Backpropagation:**
- Uses the chain rule to compute gradients efficiently
- Propagates error backwards through the network
- Updates each weight based on its contribution to the total error

**Update Rule:**

$$W_{new} = W_{old} - \alpha \frac{\partial L}{\partial W}$$

where $L$ is the loss function and $\alpha$ is the learning rate.

## Multivariate Calculus

**Multivariate Functions:**

**Definition:**
- Functions with multiple inputs, e.g., $f(x, y, z)$

**Why in ML?**
- Most ML models have high-dimensional parameter spaces
- Image recognition models process pixel arrays (hundreds or thousands of dimensions)
- Natural language models handle word embeddings in high-dimensional spaces
- Requires understanding of partial derivatives and gradients

## Optimization Basics

**Goal:** Find the minimum (or maximum) of a function

**Key Concepts:**

1. **Local Minimum:** A point where the function value is smaller than all nearby points
2. **Global Minimum:** The smallest value of the function over its entire domain
3. **Critical Points:** Points where the gradient is zero ($\nabla f = 0$)
4. **Saddle Points:** Critical points that are neither minima nor maxima

**Optimization Methods:**
- Gradient Descent: First-order method using gradients
- Newton's Method: Second-order method using Hessian matrix
- Constrained Optimization: Using Lagrange multipliers

## Real Example in ML: Support Vector Machines (SVM)

**Problem:** Maximize the margin between classes

**Solution:** Use Lagrange multipliers (multivariate optimization)

**Formulation:**
- Find the hyperplane that maximally separates two classes
- Subject to constraints that all points are correctly classified
- Involves solving a constrained optimization problem

**Mathematical Setup:**
$$\min_{w,b} \frac{1}{2} \|w\|^2$$

Subject to: $y_i(w^T x_i + b) \geq 1$ for all $i$

## Summary of Key Concepts

**Core Principles:**
- **Limits & Derivatives:** Foundation for optimization
- **Partial Derivatives & Gradients:** Key for training ML models
- **Multivariate Calculus:** Essential for high-dimensional optimization

**Applications in ML:**
- Gradient descent for minimizing loss functions
- Backpropagation for training neural networks
- Constrained optimization for SVMs
- Convergence analysis for optimization algorithms

## Comprehensive Concept Reference Table

| Concept | Definition | Use in ML & Optimization | Real-Life Example |
|---------|-----------|-------------------------|-------------------|
| **Limits** | The value a function approaches as input approaches a point | Analyzing model convergence (e.g., gradient descent) | Predicting stock prices as time intervals shrink (instantaneous rate of change) |
| **Continuity** | A function is continuous if small input changes cause small output changes | Ensuring smooth loss functions for stable training | Temperature sensor readings over time (no sudden jumps) |
| **Derivatives** | Measures the rate of change of a function w.r.t. its input | Gradient descent for minimizing loss (e.g., linear regression) | Optimizing delivery routes by adjusting speed to minimize fuel costs |
| **Partial Derivatives** | Derivative of a multivariable function w.r.t. one variable (others fixed) | Training neural networks (updating individual weights) | Adjusting thermostat settings for energy efficiency (keeping other factors constant) |
| **Gradient** | Vector of partial derivatives of a multivariable function | Direction of steepest ascent/descent in optimization (e.g., backpropagation in deep learning) | Robot path planning to avoid obstacles (following the steepest safety gradient) |
| **Multivariate Calculus** | Calculus applied to functions with multiple variables | Handling high-dimensional data (e.g., image recognition with pixel arrays) | Weather forecasting using multiple variables (temperature, humidity, pressure) |
| **Optimization** | Finding minima/maxima of functions | Training all ML models (e.g., gradient descent, Adam optimizer) | warehouse inventory management (minimizing costs while meeting demand) |

# Practical Exercises and Solutions

This section contains hands-on exercises to practice calculus and optimization concepts in machine learning contexts.

## Exercise 1: Gradient Descent for a Simple Quadratic Function

**Problem:** Minimize the function $f(x) = x^2 + 3x + 2$ using gradient descent.

**Steps:**
1. Define the function $f(x) = x^2 + 3x + 2$
2. Compute the gradient (derivative): $f'(x) = 2x + 3$
3. Initialize $x$ with a starting value
4. Iteratively update: $x_{new} = x_{old} - \alpha \cdot f'(x_{old})$
5. Repeat until convergence

**Expected Output:** The value of $x$ converges to $-1.5$ (the minimum of the quadratic function).

In [None]:
import numpy as np

# Function to minimize: f(x) = x² + 3x + 2
def f(x):
    return x**2 + 3*x + 2

# Gradient of f(x): df/dx = 2x + 3
def gradient(x):
    return 2*x + 3

# Gradient Descent
x = 0  # Initial guess
learning_rate = 0.1
steps = 20

print("Gradient Descent for f(x) = x² + 3x + 2")
print("="*50)

for i in range(steps):
    grad = gradient(x)
    x = x - learning_rate * grad
    print(f"Step {i+1}: x = {x:.4f}, f(x) = {f(x):.4f}")

print("\n" + "="*50)
print(f"Optimal x: {x:.4f}")
print(f"Minimum value: {f(x):.4f}")
print(f"\nAnalytical minimum: x = -1.5, f(-1.5) = {f(-1.5):.4f}")

## Exercise 2: Gradient Descent for Linear Regression (Single Variable)

**Problem:** Fit a line $y = mx + b$ to some data using gradient descent.

**Data:**
```python
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
```

**Steps:**
1. Initialize parameters $m$ (slope) and $b$ (intercept) to 0
2. Define the prediction: $\hat{y} = mx + b$
3. Use MSE loss: $L = \frac{1}{n} \sum (\hat{y}_i - y_i)^2$
4. Compute gradients:
   - $\frac{\partial L}{\partial m} = \frac{2}{n} \sum (\hat{y}_i - y_i) \cdot x_i$
   - $\frac{\partial L}{\partial b} = \frac{2}{n} \sum (\hat{y}_i - y_i)$
5. Update parameters using gradient descent

**Expected Output:** The line should converge to $y = 2x + 0$ ($m \approx 2.0$ and $b \approx 0.0$).

In [None]:
import numpy as np

# Sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Initialize parameters (slope m and intercept b)
m, b = 0, 0
learning_rate = 0.01
steps = 100

print("Linear Regression using Gradient Descent")
print("="*60)
print(f"Data: X = {X}")
print(f"      y = {y}")
print("\nTraining Progress:")
print("="*60)

# Gradient Descent
for step in range(steps):
    # Predictions: y_pred = m*X + b
    y_pred = m * X + b

    # Gradients (MSE loss)
    grad_m = np.mean(2 * (y_pred - y) * X)  # dL/dm
    grad_b = np.mean(2 * (y_pred - y))      # dL/db

    # Update m and b
    m -= learning_rate * grad_m
    b -= learning_rate * grad_b

    if step % 10 == 0:
        loss = np.mean((y_pred - y)**2)
        print(f"Step {step:3d}: m = {m:.4f}, b = {b:.4f}, Loss = {loss:.4f}")

print("\n" + "="*60)
print(f"\nFinal line: y = {round(m, 2)} * x + {round(b, 2)}")
print(f"\nPredictions vs Actual:")
y_final = m * X + b
for i in range(len(X)):
    print(f"x={X[i]}: predicted={y_final[i]:.2f}, actual={y[i]}")

## Exercise 3: Visualizing Gradient Descent

**Problem:** Plot how the loss decreases over time during gradient descent.

**Function:** Use the simple function $y = x^2$

**Visualization:**
- Track the value of $x$ and $f(x)$ at each iteration
- Plot the convergence of the loss over iterations
- Show how the algorithm finds the minimum

**Expected Output:** A plot showing the loss decreasing and converging to the minimum.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Function: f(x) = x²
def f(x):
    return x**2

# Gradient: df/dx = 2x
def gradient(x):
    return 2*x

# Gradient Descent
x = 3  # Start at x=3
learning_rate = 0.1
steps = 20

x_history = []
loss_history = []

for step in range(steps):
    x_history.append(x)
    loss_history.append(f(x))

    grad = gradient(x)
    x = x - learning_rate * grad

# Plotting
plt.plot(loss_history, 'bo-')
plt.xlabel("Step")
plt.ylabel("Loss (f(x))")
plt.title("Gradient Descent Convergence")
plt.show()

print(f"Starting position: x = {x_history[0]:.4f}, f(x) = {loss_history[0]:.4f}")
print(f"Final position: x = {x_history[-1]:.4f}, f(x) = {loss_history[-1]:.4f}")
print(f"\nConverged to minimum at x ≈ 0")

## Exercise 4: Linear Programming Optimization

**Problem:** Solve a constrained optimization problem using linear programming.

**Objective:** Maximize $f(x, y) = x + 2y$

**Subject to constraints:**
- $2x + y \leq 20$
- $-4x + 5y \leq 10$
- $x - 2y \leq 2$
- $x \geq 0, y \geq 0$

**Method:** Use scipy's linear programming solver (linprog)

**Note:** linprog minimizes by default, so we minimize $-x - 2y$ to maximize $x + 2y$

In [None]:
from scipy.optimize import linprog
import numpy as np

# Objective function coefficients (minimize -x - 2y to maximize x + 2y)
c = [-1, -2]

# Inequality constraints: A_ub @ x <= b_ub
A = [[2, 1],    # 2x + y <= 20
     [-4, 5],   # -4x + 5y <= 10
     [1, -2]]   # x - 2y <= 2

b = [20, 10, 2]

# Bounds for variables (x >= 0, y >= 0)
x_bounds = (0, None)
y_bounds = (0, None)

print("Linear Programming Optimization")
print("="*60)
print("Objective: Maximize f(x, y) = x + 2y")
print("\nConstraints:")
print("  2x + y ≤ 20")
print("  -4x + 5y ≤ 10")
print("  x - 2y ≤ 2")
print("  x ≥ 0, y ≥ 0")
print("="*60)

# Solve the linear program
result = linprog(c, A_ub=A, b_ub=b, bounds=[x_bounds, y_bounds], method='highs')

if result.success:
    print("\nOptimization successful!")
    print(f"\nOptimal solution:")
    print(f"  x = {result.x[0]:.4f}")
    print(f"  y = {result.x[1]:.4f}")
    print(f"\nMaximum value: f(x, y) = {-result.fun:.4f}")

    # Verify constraints
    print("\nConstraint verification:")
    x_opt, y_opt = result.x
    print(f"  2x + y = {2*x_opt + y_opt:.4f} ≤ 20 ✓")
    print(f"  -4x + 5y = {-4*x_opt + 5*y_opt:.4f} ≤ 10 ✓")
    print(f"  x - 2y = {x_opt - 2*y_opt:.4f} ≤ 2 ✓")
else:
    print("\nOptimization failed!")
    print(f"Reason: {result.message}")

## Additional Practice: Multivariate Gradient Descent

**Problem:** Minimize a multivariate function using gradient descent.

**Function:** $f(x, y) = x^2 + y^2 + 2x - 4y + 5$

**Gradients:**
- $\frac{\partial f}{\partial x} = 2x + 2$
- $\frac{\partial f}{\partial y} = 2y - 4$

**Expected minimum:** $x = -1, y = 2$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Function: f(x, y) = x² + y² + 2x - 4y + 5
def f(x, y):
    return x**2 + y**2 + 2*x - 4*y + 5

# Gradient
def gradient(x, y):
    df_dx = 2*x + 2
    df_dy = 2*y - 4
    return np.array([df_dx, df_dy])

# Gradient Descent
position = np.array([5.0, 5.0])  # Starting point
learning_rate = 0.1
steps = 30

history = [position.copy()]

print("Multivariate Gradient Descent")
print("="*60)
print(f"Function: f(x, y) = x² + y² + 2x - 4y + 5")
print(f"Starting point: ({position[0]:.2f}, {position[1]:.2f})")
print("\nOptimization progress:")

for i in range(steps):
    grad = gradient(position[0], position[1])
    position = position - learning_rate * grad
    history.append(position.copy())

    if i % 5 == 0:
        print(f"Step {i:2d}: x={position[0]:6.3f}, y={position[1]:6.3f}, f(x,y)={f(position[0], position[1]):6.3f}")

print("\n" + "="*60)
print(f"\nFinal position: x = {position[0]:.4f}, y = {position[1]:.4f}")
print(f"Minimum value: f(x, y) = {f(position[0], position[1]):.4f}")
print(f"\nAnalytical minimum: x = -1, y = 2, f(-1, 2) = {f(-1, 2):.4f}")

# Visualization
history = np.array(history)

# Create meshgrid for contour plot
x_range = np.linspace(-3, 6, 100)
y_range = np.linspace(-2, 6, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = f(X, Y)

plt.figure(figsize=(12, 5))

# Contour plot
plt.subplot(1, 2, 1)
contour = plt.contour(X, Y, Z, levels=20, cmap='viridis')
plt.colorbar(contour)
plt.plot(history[:, 0], history[:, 1], 'ro-', markersize=4, linewidth=1.5, label='Gradient Descent Path')
plt.scatter([history[0, 0]], [history[0, 1]], color='green', s=100, zorder=5, label='Start')
plt.scatter([history[-1, 0]], [history[-1, 1]], color='red', s=100, zorder=5, label='End')
plt.scatter([-1], [2], color='yellow', s=150, marker='*', zorder=6, label='True Minimum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Contour Plot with Gradient Descent Path')
plt.legend()
plt.grid(True, alpha=0.3)

# 3D surface plot
ax = plt.subplot(1, 2, 2, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.6)
ax.plot(history[:, 0], history[:, 1], [f(x, y) for x, y in history], 'ro-', markersize=4, linewidth=2)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f(x, y)')
ax.set_title('3D Surface with Optimization Path')

plt.tight_layout()
plt.show()

## Summary

In this notebook, we covered:

1. **Fundamental Concepts:**
   - Limits and their role in understanding convergence
   - Continuity for stable optimization
   - Derivatives as the foundation of gradient-based methods

2. **Gradient Descent:**
   - Core optimization algorithm for machine learning
   - Update rule using derivatives
   - Applications to simple functions and linear regression

3. **Multivariate Calculus:**
   - Partial derivatives for functions of multiple variables
   - Gradients as vectors pointing to steepest ascent
   - Applications in neural networks and high-dimensional optimization

4. **Practical Applications:**
   - Gradient descent for quadratic functions
   - Linear regression using gradient descent
   - Visualization of convergence
   - Constrained optimization using linear programming
   - Multivariate optimization

5. **Real-World ML Examples:**
   - Neural network training via backpropagation
   - Support Vector Machines with Lagrange multipliers
   - Various optimization scenarios

These concepts form the mathematical backbone of modern machine learning, enabling us to train models effectively and understand their behavior.