# Complete Guide to Derivatives and Partial Derivatives

## Learning Objectives
By the end of this notebook, you will understand:
1. What derivatives are and why they matter in ML
2. Computing derivatives by hand and with code
3. Derivative rules and techniques
4. Partial derivatives for multivariable functions
5. Gradients and their role in optimization
6. Applications in machine learning

---

## 1. What is a Derivative?

### Intuitive Understanding

A **derivative** tells us how fast something is changing.

**Real-world examples:**
- **Speed** is the derivative of position (how fast position changes)
- **Acceleration** is the derivative of speed (how fast speed changes)
- **Slope** of a function at a point

### Mathematical Definition

For a function $f(x)$, the derivative $f'(x)$ is:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

**In ML:** Derivatives tell us how to adjust our model parameters to reduce error!

### Why Derivatives Matter in ML

1. **Optimization**: Finding minimum of loss function
2. **Backpropagation**: Training neural networks
3. **Feature importance**: Understanding model behavior
4. **Convergence**: Ensuring training stability

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import derivative
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Visualize the concept of derivative
def f(x):
    """Example function: x^2"""
    return x**2

# Create data
x = np.linspace(-3, 3, 100)
y = f(x)

# Pick a point
x0 = 1
y0 = f(x0)

# Calculate derivative at x0 (for x^2, derivative is 2x)
slope = 2 * x0  # f'(x) = 2x

# Tangent line: y = mx + b, where m is slope
tangent_x = np.linspace(x0 - 1, x0 + 1, 50)
tangent_y = slope * (tangent_x - x0) + y0

# Plot
plt.figure(figsize=(12, 5))

# Left plot: Function and tangent
plt.subplot(1, 2, 1)
plt.plot(x, y, 'b-', linewidth=2, label='f(x) = x¬≤')
plt.plot(tangent_x, tangent_y, 'r--', linewidth=2, label=f'Tangent (slope={slope})')
plt.scatter([x0], [y0], color='red', s=100, zorder=5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Derivative = Slope of Tangent Line')
plt.legend()
plt.grid(True, alpha=0.3)

# Right plot: Derivative function
plt.subplot(1, 2, 2)
derivative_y = 2 * x  # f'(x) = 2x
plt.plot(x, derivative_y, 'g-', linewidth=2, label="f'(x) = 2x")
plt.scatter([x0], [slope], color='red', s=100, zorder=5)
plt.xlabel('x')
plt.ylabel("f'(x)")
plt.title('Derivative Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)

plt.tight_layout()
plt.show()

print(f"At x = {x0}:")
print(f"  Function value: f({x0}) = {y0}")
print(f"  Derivative (slope): f'({x0}) = {slope}")
print(f"  Interpretation: Function increasing at rate of {slope} units per unit change in x")

  from scipy.misc import derivative


ImportError: cannot import name 'derivative' from 'scipy.misc' (/Users/savithavijayarangan/miniconda3/lib/python3.13/site-packages/scipy/misc/__init__.py)

---
## 2. Basic Derivative Rules

### Power Rule
$$\frac{d}{dx}[x^n] = nx^{n-1}$$

### Constant Rule
$$\frac{d}{dx}[c] = 0$$

### Constant Multiple Rule
$$\frac{d}{dx}[cf(x)] = c \cdot f'(x)$$

### Sum Rule
$$\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)$$

In [None]:
# Demonstrate basic rules

print("=== POWER RULE ===")
print("f(x) = x¬≥  ‚Üí  f'(x) = 3x¬≤")
print("f(x) = x‚Åµ  ‚Üí  f'(x) = 5x‚Å¥")
print("f(x) = x   ‚Üí  f'(x) = 1")
print("f(x) = x‚Å∞  ‚Üí  f'(x) = 0 (constant)")

print("\n=== CONSTANT RULE ===")
print("f(x) = 5   ‚Üí  f'(x) = 0")
print("f(x) = 100 ‚Üí  f'(x) = 0")

print("\n=== CONSTANT MULTIPLE RULE ===")
print("f(x) = 3x¬≤ ‚Üí  f'(x) = 3¬∑(2x) = 6x")
print("f(x) = 5x¬≥ ‚Üí  f'(x) = 5¬∑(3x¬≤) = 15x¬≤")

print("\n=== SUM RULE ===")
print("f(x) = x¬≤ + x¬≥  ‚Üí  f'(x) = 2x + 3x¬≤")
print("f(x) = 4x¬≤ - 2x + 7  ‚Üí  f'(x) = 8x - 2")

# Verify with code
def verify_derivative(f, f_prime, x_vals, label):
    """Verify derivative numerically"""
    print(f"\n{label}")
    for x in x_vals:
        analytical = f_prime(x)
        numerical = derivative(f, x, dx=1e-8)
        print(f"  x={x}: Analytical={analytical:.4f}, Numerical={numerical:.4f}, Match={np.isclose(analytical, numerical)}")

# Test examples
verify_derivative(
    f=lambda x: x**3,
    f_prime=lambda x: 3*x**2,
    x_vals=[0, 1, 2],
    label="Verify: f(x)=x¬≥, f'(x)=3x¬≤"
)

verify_derivative(
    f=lambda x: 4*x**2 - 2*x + 7,
    f_prime=lambda x: 8*x - 2,
    x_vals=[0, 1, 2],
    label="Verify: f(x)=4x¬≤-2x+7, f'(x)=8x-2"
)

### More Advanced Rules

In [None]:
print("=== PRODUCT RULE ===")
print("d/dx[f(x)¬∑g(x)] = f'(x)¬∑g(x) + f(x)¬∑g'(x)")
print("\nExample: f(x) = x¬≤¬∑sin(x)")
print("f'(x) = 2x¬∑sin(x) + x¬≤¬∑cos(x)")

print("\n=== QUOTIENT RULE ===")
print("d/dx[f(x)/g(x)] = [f'(x)¬∑g(x) - f(x)¬∑g'(x)] / [g(x)]¬≤")
print("\nExample: f(x) = x¬≤/x¬≥ = 1/x")
print("f'(x) = -1/x¬≤")

print("\n=== EXPONENTIAL ===")
print("d/dx[eÀ£] = eÀ£")
print("d/dx[aÀ£] = aÀ£¬∑ln(a)")

print("\n=== LOGARITHM ===")
print("d/dx[ln(x)] = 1/x")
print("d/dx[log_a(x)] = 1/(x¬∑ln(a))")

print("\n=== TRIGONOMETRIC ===")
print("d/dx[sin(x)] = cos(x)")
print("d/dx[cos(x)] = -sin(x)")
print("d/dx[tan(x)] = sec¬≤(x) = 1/cos¬≤(x)")

# Visualize common derivatives
x = np.linspace(-2, 2, 200)

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.suptitle('Common Functions and Their Derivatives', fontsize=16)

# x^2
axes[0, 0].plot(x, x**2, 'b-', label='f(x)=x¬≤', linewidth=2)
axes[0, 0].plot(x, 2*x, 'r--', label="f'(x)=2x", linewidth=2)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_title('Polynomial')

# e^x
axes[0, 1].plot(x, np.exp(x), 'b-', label='f(x)=eÀ£', linewidth=2)
axes[0, 1].plot(x, np.exp(x), 'r--', label="f'(x)=eÀ£", linewidth=2)
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_title('Exponential')
axes[0, 1].set_ylim(-1, 8)

# ln(x)
x_pos = x[x > 0.1]
axes[0, 2].plot(x_pos, np.log(x_pos), 'b-', label='f(x)=ln(x)', linewidth=2)
axes[0, 2].plot(x_pos, 1/x_pos, 'r--', label="f'(x)=1/x", linewidth=2)
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)
axes[0, 2].set_title('Logarithm')
axes[0, 2].set_ylim(-2, 2)

# sin(x)
axes[1, 0].plot(x, np.sin(x), 'b-', label='f(x)=sin(x)', linewidth=2)
axes[1, 0].plot(x, np.cos(x), 'r--', label="f'(x)=cos(x)", linewidth=2)
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_title('Sine')

# cos(x)
axes[1, 1].plot(x, np.cos(x), 'b-', label='f(x)=cos(x)', linewidth=2)
axes[1, 1].plot(x, -np.sin(x), 'r--', label="f'(x)=-sin(x)", linewidth=2)
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_title('Cosine')

# 1/x
x_nonzero = x[np.abs(x) > 0.1]
axes[1, 2].plot(x_nonzero, 1/x_nonzero, 'b-', label='f(x)=1/x', linewidth=2)
axes[1, 2].plot(x_nonzero, -1/x_nonzero**2, 'r--', label="f'(x)=-1/x¬≤", linewidth=2)
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
axes[1, 2].set_title('Reciprocal')
axes[1, 2].set_ylim(-5, 5)

plt.tight_layout()
plt.show()

---
## 3. Partial Derivatives

### What are Partial Derivatives?

For functions with multiple variables $f(x, y, z, ...)$, a **partial derivative** measures how the function changes when we vary ONE variable while keeping others constant.

**Notation:**
- $\frac{\partial f}{\partial x}$ = partial derivative with respect to $x$
- $f_x$ = shorthand notation

### Example

For $f(x, y) = x^2 + xy + y^2$:

$$\frac{\partial f}{\partial x} = 2x + y \quad \text{(treat y as constant)}$$

$$\frac{\partial f}{\partial y} = x + 2y \quad \text{(treat x as constant)}$$

In [None]:
# Visualize partial derivatives
def f(x, y):
    """Function of two variables"""
    return x**2 + x*y + y**2

# Create mesh
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# Partial derivatives
def df_dx(x, y):
    return 2*x + y

def df_dy(x, y):
    return x + 2*y

# Plot
fig = plt.figure(figsize=(16, 5))

# 3D surface
ax1 = fig.add_subplot(131, projection='3d')
surf = ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')
ax1.set_title('f(x,y) = x¬≤ + xy + y¬≤')
plt.colorbar(surf, ax=ax1, shrink=0.5)

# Partial derivative with respect to x
ax2 = fig.add_subplot(132, projection='3d')
dZ_dx = df_dx(X, Y)
surf2 = ax2.plot_surface(X, Y, dZ_dx, cmap='Reds', alpha=0.8)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('‚àÇf/‚àÇx')
ax2.set_title('Partial Derivative: ‚àÇf/‚àÇx = 2x + y')
plt.colorbar(surf2, ax=ax2, shrink=0.5)

# Partial derivative with respect to y
ax3 = fig.add_subplot(133, projection='3d')
dZ_dy = df_dy(X, Y)
surf3 = ax3.plot_surface(X, Y, dZ_dy, cmap='Blues', alpha=0.8)
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.set_zlabel('‚àÇf/‚àÇy')
ax3.set_title('Partial Derivative: ‚àÇf/‚àÇy = x + 2y')
plt.colorbar(surf3, ax=ax3, shrink=0.5)

plt.tight_layout()
plt.show()

# Numerical example
x0, y0 = 1, 2
print(f"\nAt point ({x0}, {y0}):")
print(f"  f({x0}, {y0}) = {f(x0, y0)}")
print(f"  ‚àÇf/‚àÇx = {df_dx(x0, y0)} (rate of change in x-direction)")
print(f"  ‚àÇf/‚àÇy = {df_dy(x0, y0)} (rate of change in y-direction)")

### Computing Partial Derivatives

In [None]:
print("=== COMPUTING PARTIAL DERIVATIVES ===")
print("\nExample 1: f(x, y) = x¬≤y + y¬≥")
print("  ‚àÇf/‚àÇx = 2xy (treat y as constant)")
print("  ‚àÇf/‚àÇy = x¬≤ + 3y¬≤ (treat x as constant)")

print("\nExample 2: f(x, y, z) = x¬≤yz + sin(xy) + z¬≥")
print("  ‚àÇf/‚àÇx = 2xyz + y¬∑cos(xy)")
print("  ‚àÇf/‚àÇy = x¬≤z + x¬∑cos(xy)")
print("  ‚àÇf/‚àÇz = x¬≤y + 3z¬≤")

print("\nExample 3: f(x, y) = e^(xy)")
print("  ‚àÇf/‚àÇx = y¬∑e^(xy)")
print("  ‚àÇf/‚àÇy = x¬∑e^(xy)")

# Verify with numerical differentiation
def numerical_partial(f, x, y, var='x', h=1e-8):
    """Compute partial derivative numerically"""
    if var == 'x':
        return (f(x + h, y) - f(x, y)) / h
    else:
        return (f(x, y + h) - f(x, y)) / h

# Test function
def test_f(x, y):
    return x**2 * y + y**3

x0, y0 = 2, 3

# Analytical
analytical_x = 2 * x0 * y0
analytical_y = x0**2 + 3 * y0**2

# Numerical
numerical_x = numerical_partial(test_f, x0, y0, var='x')
numerical_y = numerical_partial(test_f, x0, y0, var='y')

print(f"\n=== VERIFICATION ===")
print(f"At ({x0}, {y0}):")
print(f"  ‚àÇf/‚àÇx: Analytical={analytical_x}, Numerical={numerical_x:.6f}")
print(f"  ‚àÇf/‚àÇy: Analytical={analytical_y}, Numerical={numerical_y:.6f}")

---
## 4. The Gradient Vector

### Definition

The **gradient** $\nabla f$ is a vector of all partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

### Key Properties:
1. **Points in direction of steepest ascent**
2. **Magnitude = rate of steepest increase**
3. **Perpendicular to level curves/surfaces**

### In Machine Learning:
The negative gradient $-\nabla f$ points toward the minimum - this is the foundation of gradient descent!

In [None]:
# Visualize gradient vectors
def f(x, y):
    return x**2 + y**2  # Simple bowl shape

def gradient(x, y):
    """Compute gradient vector"""
    df_dx = 2 * x
    df_dy = 2 * y
    return np.array([df_dx, df_dy])

# Create grid
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# Sample points for gradient vectors
x_sample = np.linspace(-2, 2, 8)
y_sample = np.linspace(-2, 2, 8)
X_sample, Y_sample = np.meshgrid(x_sample, y_sample)

# Compute gradients at sample points
U = 2 * X_sample  # ‚àÇf/‚àÇx
V = 2 * Y_sample  # ‚àÇf/‚àÇy

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Contour plot with gradient vectors
contour = ax1.contour(X, Y, Z, levels=15, cmap='viridis')
ax1.clabel(contour, inline=True, fontsize=8)
ax1.quiver(X_sample, Y_sample, U, V, color='red', alpha=0.7, scale=20)
ax1.scatter([0], [0], color='yellow', s=200, marker='*', 
           edgecolors='black', linewidths=2, label='Minimum', zorder=5)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Gradient Vectors Point Uphill\n(Perpendicular to Contours)')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_aspect('equal')

# 3D surface with gradient
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z, cmap='viridis', alpha=0.6)
# Add a few gradient vectors in 3D
for i in range(0, len(x_sample), 2):
    for j in range(0, len(y_sample), 2):
        x0, y0 = x_sample[i], y_sample[j]
        z0 = f(x0, y0)
        grad = gradient(x0, y0)
        ax2.quiver(x0, y0, z0, grad[0], grad[1], 0, 
                  color='red', arrow_length_ratio=0.3, linewidth=2)

ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')
ax2.set_title('3D View: Gradients Point Uphill')

plt.tight_layout()
plt.show()

print("\nGradient Properties:")
print("‚úì Gradient points in direction of steepest ascent")
print("‚úì Negative gradient (-‚àáf) points toward minimum")
print("‚úì Gradient is perpendicular to contour lines")
print("‚úì Magnitude of gradient = rate of steepest increase")

### Computing Gradients in Code

In [None]:
def compute_gradient(f, x, epsilon=1e-8):
    """
    Numerically compute gradient using finite differences
    
    Args:
        f: function to differentiate
        x: point at which to evaluate gradient (numpy array)
        epsilon: small perturbation for numerical differentiation
    
    Returns:
        gradient vector
    """
    grad = np.zeros_like(x)
    
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += epsilon
        
        x_minus = x.copy()
        x_minus[i] -= epsilon
        
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    
    return grad

# Example 1: Simple quadratic
def f1(x):
    return x[0]**2 + x[1]**2

x1 = np.array([2.0, 3.0])
grad1_numerical = compute_gradient(f1, x1)
grad1_analytical = 2 * x1  # [2x, 2y]

print("=== Example 1: f(x,y) = x¬≤ + y¬≤ ===")
print(f"At point {x1}:")
print(f"  Numerical gradient: {grad1_numerical}")
print(f"  Analytical gradient: {grad1_analytical}")
print(f"  Match: {np.allclose(grad1_numerical, grad1_analytical)}")

# Example 2: More complex function
def f2(x):
    return x[0]**2 * x[1] + np.sin(x[0]) + x[1]**3

def f2_gradient_analytical(x):
    df_dx = 2*x[0]*x[1] + np.cos(x[0])
    df_dy = x[0]**2 + 3*x[1]**2
    return np.array([df_dx, df_dy])

x2 = np.array([1.0, 2.0])
grad2_numerical = compute_gradient(f2, x2)
grad2_analytical = f2_gradient_analytical(x2)

print("\n=== Example 2: f(x,y) = x¬≤y + sin(x) + y¬≥ ===")
print(f"At point {x2}:")
print(f"  Numerical gradient: {grad2_numerical}")
print(f"  Analytical gradient: {grad2_analytical}")
print(f"  Match: {np.allclose(grad2_numerical, grad2_analytical)}")

# Example 3: Higher dimensions
def f3(x):
    """Sum of squares in n dimensions"""
    return np.sum(x**2)

x3 = np.array([1.0, 2.0, 3.0, 4.0])
grad3_numerical = compute_gradient(f3, x3)
grad3_analytical = 2 * x3

print("\n=== Example 3: f(x) = Œ£x·µ¢¬≤ (4D) ===")
print(f"At point {x3}:")
print(f"  Numerical gradient: {grad3_numerical}")
print(f"  Analytical gradient: {grad3_analytical}")
print(f"  Match: {np.allclose(grad3_numerical, grad3_analytical)}")

---
## 5. Applications in Machine Learning

### 5.1 Linear Regression - Computing Gradients

In [None]:
# Linear regression example
# Model: y = wx + b
# Loss: L = (1/n)Œ£(y_pred - y_true)¬≤

# Generate synthetic data
np.random.seed(42)
X_data = np.linspace(0, 10, 50)
y_data = 2 * X_data + 1 + np.random.randn(50) * 2  # True: y = 2x + 1 + noise

# Initial parameters
w = 0.0  # weight
b = 0.0  # bias

def predict(X, w, b):
    """Make predictions"""
    return w * X + b

def mse_loss(X, y, w, b):
    """Mean squared error loss"""
    y_pred = predict(X, w, b)
    return np.mean((y_pred - y)**2)

def compute_gradients(X, y, w, b):
    """
    Compute gradients of MSE loss with respect to w and b
    
    L = (1/n)Œ£(wx + b - y)¬≤
    ‚àÇL/‚àÇw = (2/n)Œ£(wx + b - y)¬∑x
    ‚àÇL/‚àÇb = (2/n)Œ£(wx + b - y)
    """
    n = len(X)
    y_pred = predict(X, w, b)
    error = y_pred - y
    
    dL_dw = (2 / n) * np.sum(error * X)
    dL_db = (2 / n) * np.sum(error)
    
    return dL_dw, dL_db

# Compute gradients at initial parameters
dL_dw, dL_db = compute_gradients(X_data, y_data, w, b)

print("=== LINEAR REGRESSION GRADIENTS ===")
print(f"Initial parameters: w={w:.4f}, b={b:.4f}")
print(f"Initial loss: {mse_loss(X_data, y_data, w, b):.4f}")
print(f"\nGradients:")
print(f"  ‚àÇL/‚àÇw = {dL_dw:.4f}")
print(f"  ‚àÇL/‚àÇb = {dL_db:.4f}")
print(f"\nInterpretation:")
if dL_dw > 0:
    print(f"  ‚Ä¢ Decreasing w will reduce loss (gradient is positive)")
else:
    print(f"  ‚Ä¢ Increasing w will reduce loss (gradient is negative)")
    
if dL_db > 0:
    print(f"  ‚Ä¢ Decreasing b will reduce loss (gradient is positive)")
else:
    print(f"  ‚Ä¢ Increasing b will reduce loss (gradient is negative)")

# Visualize
plt.figure(figsize=(12, 5))

# Data and initial model
plt.subplot(1, 2, 1)
plt.scatter(X_data, y_data, alpha=0.5, label='Data')
plt.plot(X_data, predict(X_data, w, b), 'r-', linewidth=2, label=f'Initial: y={w:.2f}x+{b:.2f}')
plt.plot(X_data, predict(X_data, 2, 1), 'g--', linewidth=2, label='True: y=2x+1')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression')
plt.grid(True, alpha=0.3)

# Loss surface and gradient
plt.subplot(1, 2, 2)
w_range = np.linspace(-1, 4, 50)
b_range = np.linspace(-2, 3, 50)
W, B = np.meshgrid(w_range, b_range)
L = np.zeros_like(W)

for i in range(len(w_range)):
    for j in range(len(b_range)):
        L[j, i] = mse_loss(X_data, y_data, W[j, i], B[j, i])

contour = plt.contour(W, B, L, levels=20, cmap='viridis')
plt.clabel(contour, inline=True, fontsize=8)
plt.scatter([w], [b], color='red', s=200, marker='o', label='Current position', zorder=5)
plt.arrow(w, b, -0.3*dL_dw, -0.3*dL_db, head_width=0.1, head_length=0.1, 
         fc='red', ec='red', linewidth=2, label='Negative gradient')
plt.scatter([2], [1], color='yellow', s=200, marker='*', 
           edgecolors='black', linewidths=2, label='True optimum', zorder=5)
plt.xlabel('w (weight)')
plt.ylabel('b (bias)')
plt.title('Loss Surface')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 5.2 Logistic Regression Gradients

In [None]:
# Logistic regression for binary classification
# Model: p = sigmoid(wx + b)
# Loss: L = -(1/n)Œ£[y¬∑log(p) + (1-y)¬∑log(1-p)]  (Binary Cross-Entropy)

def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

# Generate synthetic data
np.random.seed(42)
X_pos = np.random.randn(50, 2) + np.array([2, 2])
X_neg = np.random.randn(50, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.hstack([np.ones(50), np.zeros(50)])

def predict_proba(X, w, b):
    """Predict probabilities"""
    z = X @ w + b
    return sigmoid(z)

def binary_cross_entropy(X, y, w, b):
    """Binary cross-entropy loss"""
    p = predict_proba(X, w, b)
    epsilon = 1e-15  # Avoid log(0)
    p = np.clip(p, epsilon, 1 - epsilon)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

def compute_gradients_logistic(X, y, w, b):
    """
    Compute gradients for logistic regression
    
    ‚àÇL/‚àÇw = (1/n)Œ£(p - y)¬∑x
    ‚àÇL/‚àÇb = (1/n)Œ£(p - y)
    """
    n = len(X)
    p = predict_proba(X, w, b)
    error = p - y
    
    dL_dw = (1 / n) * (X.T @ error)
    dL_db = (1 / n) * np.sum(error)
    
    return dL_dw, dL_db

# Initialize parameters
w = np.zeros(2)
b = 0.0

# Compute gradients
dL_dw, dL_db = compute_gradients_logistic(X, y, w, b)

print("=== LOGISTIC REGRESSION GRADIENTS ===")
print(f"Initial parameters: w={w}, b={b:.4f}")
print(f"Initial loss: {binary_cross_entropy(X, y, w, b):.4f}")
print(f"\nGradients:")
print(f"  ‚àÇL/‚àÇw = {dL_dw}")
print(f"  ‚àÇL/‚àÇb = {dL_db:.4f}")

# Visualize
plt.figure(figsize=(10, 5))

# Data
plt.subplot(1, 2, 1)
plt.scatter(X_pos[:, 0], X_pos[:, 1], c='blue', label='Class 1', alpha=0.6)
plt.scatter(X_neg[:, 0], X_neg[:, 1], c='red', label='Class 0', alpha=0.6)

# Decision boundary (wx + b = 0)
if w[1] != 0:
    x_boundary = np.linspace(-5, 5, 100)
    y_boundary = -(w[0] * x_boundary + b) / w[1]
    plt.plot(x_boundary, y_boundary, 'k-', linewidth=2, label='Decision boundary')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Binary Classification Data')
plt.legend()
plt.grid(True, alpha=0.3)

# Gradient visualization
plt.subplot(1, 2, 2)
plt.scatter(X_pos[:, 0], X_pos[:, 1], c='blue', alpha=0.3)
plt.scatter(X_neg[:, 0], X_neg[:, 1], c='red', alpha=0.3)
plt.arrow(0, 0, -2*dL_dw[0], -2*dL_dw[1], head_width=0.3, head_length=0.3,
         fc='green', ec='green', linewidth=3, label='Negative gradient')
plt.xlabel('w1 direction')
plt.ylabel('w2 direction')
plt.title('Gradient Direction')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 5.3 Neural Network: Single Layer

In [None]:
# Simple neural network with one hidden layer
# Forward: h = sigmoid(W1¬∑x + b1), y = sigmoid(W2¬∑h + b2)

def initialize_network(input_size, hidden_size, output_size):
    """Initialize network parameters"""
    np.random.seed(42)
    params = {
        'W1': np.random.randn(hidden_size, input_size) * 0.01,
        'b1': np.zeros(hidden_size),
        'W2': np.random.randn(output_size, hidden_size) * 0.01,
        'b2': np.zeros(output_size)
    }
    return params

def forward_pass(X, params):
    """Forward propagation"""
    # Layer 1
    z1 = params['W1'] @ X + params['b1'].reshape(-1, 1)
    h = sigmoid(z1)
    
    # Layer 2
    z2 = params['W2'] @ h + params['b2'].reshape(-1, 1)
    y_pred = sigmoid(z2)
    
    cache = {'X': X, 'z1': z1, 'h': h, 'z2': z2, 'y_pred': y_pred}
    return y_pred, cache

def compute_loss(y_pred, y_true):
    """Binary cross-entropy loss"""
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def backward_pass(y_true, cache, params):
    """
    Backward propagation - compute gradients
    
    Uses chain rule to compute gradients layer by layer
    """
    m = y_true.shape[1]  # number of examples
    
    # Output layer gradients
    dz2 = cache['y_pred'] - y_true  # ‚àÇL/‚àÇz2
    dW2 = (1/m) * (dz2 @ cache['h'].T)  # ‚àÇL/‚àÇW2
    db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True)  # ‚àÇL/‚àÇb2
    
    # Hidden layer gradients (using chain rule)
    dh = params['W2'].T @ dz2  # ‚àÇL/‚àÇh
    dz1 = dh * cache['h'] * (1 - cache['h'])  # ‚àÇL/‚àÇz1 (sigmoid derivative)
    dW1 = (1/m) * (dz1 @ cache['X'].T)  # ‚àÇL/‚àÇW1
    db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True)  # ‚àÇL/‚àÇb1
    
    gradients = {
        'dW1': dW1,
        'db1': db1.flatten(),
        'dW2': dW2,
        'db2': db2.flatten()
    }
    
    return gradients

# Example with XOR problem (classic non-linear problem)
X_train = np.array([[0, 0, 1, 1],
                    [0, 1, 0, 1]])
y_train = np.array([[0, 1, 1, 0]])  # XOR outputs

# Initialize network
params = initialize_network(input_size=2, hidden_size=4, output_size=1)

# Forward pass
y_pred, cache = forward_pass(X_train, params)
loss = compute_loss(y_pred, y_train)

# Backward pass
grads = backward_pass(y_train, cache, params)

print("=== NEURAL NETWORK GRADIENTS (XOR Problem) ===")
print(f"\nInput shape: {X_train.shape}")
print(f"Hidden size: 4")
print(f"Output shape: {y_train.shape}")
print(f"\nInitial loss: {loss:.4f}")
print(f"\nPredictions: {y_pred.flatten()}")
print(f"True labels:  {y_train.flatten()}")
print(f"\nGradient shapes:")
print(f"  dW1: {grads['dW1'].shape} (hidden √ó input)")
print(f"  db1: {grads['db1'].shape} (hidden,)")
print(f"  dW2: {grads['dW2'].shape} (output √ó hidden)")
print(f"  db2: {grads['db2'].shape} (output,)")
print(f"\nSample gradient values:")
print(f"  dW1[0,0] = {grads['dW1'][0,0]:.6f}")
print(f"  dW2[0,0] = {grads['dW2'][0,0]:.6f}")
print(f"\n‚úì These gradients tell us how to adjust each weight to reduce loss!")

---
## 6. Practice Exercises

### Exercise 1: Compute Derivatives
Find the derivatives of the following functions:
1. $f(x) = 3x^4 - 2x^2 + 5$
2. $f(x) = e^{2x}$
3. $f(x) = \frac{1}{x^2}$

In [None]:
# Exercise 1 - Your code here

# Define functions and their derivatives
# Verify numerically


### Exercise 2: Partial Derivatives
For $f(x, y) = x^2y + e^{xy}$, compute:
1. $\frac{\partial f}{\partial x}$
2. $\frac{\partial f}{\partial y}$
3. Evaluate both at point $(1, 2)$

In [None]:
# Exercise 2 - Your code here


### Exercise 3: Gradient Computation
Compute the gradient of $f(x, y, z) = x^2 + 2y^2 + 3z^2$ at point $(1, 1, 1)$

In [None]:
# Exercise 3 - Your code here


### Exercise 4: MSE Gradient
Derive and implement the gradient of MSE loss for linear regression:
$L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2$

In [None]:
# Exercise 4 - Your code here


---
## 7. Solutions

In [None]:
print("=== SOLUTIONS ===")

# Exercise 1
print("\nExercise 1:")
print("1. f(x) = 3x‚Å¥ - 2x¬≤ + 5")
print("   f'(x) = 12x¬≥ - 4x")
print("\n2. f(x) = e^(2x)")
print("   f'(x) = 2e^(2x)")
print("\n3. f(x) = 1/x¬≤ = x^(-2)")
print("   f'(x) = -2x^(-3) = -2/x¬≥")

# Exercise 2
print("\nExercise 2: f(x,y) = x¬≤y + e^(xy)")
print("‚àÇf/‚àÇx = 2xy + ye^(xy)")
print("‚àÇf/‚àÇy = x¬≤ + xe^(xy)")
x, y = 1, 2
df_dx = 2*x*y + y*np.exp(x*y)
df_dy = x**2 + x*np.exp(x*y)
print(f"At (1,2): ‚àÇf/‚àÇx = {df_dx:.4f}, ‚àÇf/‚àÇy = {df_dy:.4f}")

# Exercise 3
print("\nExercise 3: f(x,y,z) = x¬≤ + 2y¬≤ + 3z¬≤")
print("‚àáf = [2x, 4y, 6z]")
grad = np.array([2*1, 4*1, 6*1])
print(f"At (1,1,1): ‚àáf = {grad}")

# Exercise 4
print("\nExercise 4: MSE Loss Gradient")
print("‚àÇL/‚àÇw = (2/n)Œ£(wx+b-y)¬∑x")
print("‚àÇL/‚àÇb = (2/n)Œ£(wx+b-y)")
print("(See implementation in Section 5.1)")

---
## 8. Key Takeaways

### Core Concepts:
1. ‚úÖ **Derivative** = rate of change = slope of tangent line
2. ‚úÖ **Partial derivative** = derivative with respect to one variable (others constant)
3. ‚úÖ **Gradient** = vector of all partial derivatives
4. ‚úÖ **Chain rule** = foundation of backpropagation (next notebook!)

### Essential Rules:
- Power rule: $\frac{d}{dx}[x^n] = nx^{n-1}$
- Product rule: $(f \cdot g)' = f'g + fg'$
- Chain rule: $(f(g(x)))' = f'(g(x)) \cdot g'(x)$

### ML Applications:
1. **Gradient descent**: Follow negative gradient to minimize loss
2. **Backpropagation**: Compute gradients efficiently using chain rule
3. **Optimization**: Adjust parameters to reduce error
4. **Feature importance**: Gradient magnitude shows sensitivity

### Computational Tips:
- Use numerical differentiation for verification
- Analytical gradients are faster and more accurate
- Modern frameworks (PyTorch, TensorFlow) compute gradients automatically

---

**Congratulations! You understand derivatives and gradients! üéâ**

**Next: Gradient Descent and Optimization - where we USE these gradients!**