# Linear Regression from Scratch
**Objective:** Implement Linear Regression using only NumPy (Batch Gradient Descent) and compare with the analytical Closed-Form solution.

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

: 

## Problem Setup
**Linear Regression** attempts to model the relationship between two variables by fitting a linear equation to observed data.

**Model:**
$$\hat{y} = Xw + b$$

**Cost Function (Mean Squared Error):**
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}^{(i)} - y^{(i)})^2$$

**Gradients (for Batch Gradient Descent):**
To minimize the MSE, we derive gradients with respect to weights ($w$) and bias ($b$):
*   $\frac{\partial J}{\partial w} = \frac{2}{n} X^T (\hat{y} - y)$
*   $\frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}^{(i)} - y^{(i)})$

## Data

In [None]:
# Synthetic Data Generation (1D)
n_samples = 100
X = 2 * np.random.rand(n_samples, 1)
true_w = 3
true_b = 4
# Equation: y = 3x + 4 + noise
y = true_w * X + true_b + np.random.randn(n_samples, 1)

# Visualization
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.6, label='Data points')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Synthetic Data (y = 3x + 4 + noise)')
plt.legend()
plt.show()

## Implementation (NumPy)

In [None]:
def predict(X, w, b):
    """Computes the prediction y_hat = Xw + b."""
    return X.dot(w) + b

def mse(y_true, y_pred):
    """Computes Mean Squared Error."""
    return np.mean((y_true - y_pred)**2)

def r2_score(y_true, y_pred):
    """Computes R^2 Score."""
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.mean(y_true))**2)
    return 1 - (ss_res / ss_tot)

def fit_gd(X, y, lr=0.01, epochs=1000, normalize=False):
    """
    Fits Linear Regression using Batch Gradient Descent.
    Returns: w, b, history
    """
    m = len(y)
    history = []
    
    # Handle Normalization
    X_train = np.copy(X)
    mean, std = 0, 1
    if normalize:
        mean = np.mean(X, axis=0)
        std = np.std(X, axis=0)
        X_train = (X - mean) / std
    
    # Random initialization
    w = np.random.randn(X.shape[1], 1)
    b = np.random.randn(1)
    
    for epoch in range(epochs):
        # Prediction
        y_hat = predict(X_train, w, b)
        
        # Error
        error = y_hat - y
        
        # Gradients
        dw = (2/m) * X_train.T.dot(error)
        db = (2/m) * np.sum(error)
        
        # Update
        w -= lr * dw
        b -= lr * db
        
        # Log Loss
        loss = mse(y, y_hat)
        history.append(loss)
        
        # Safety Check for Divergence
        if np.isnan(loss) or loss > 1e10:
            print(f"Warning: Divergence detected at epoch {epoch}. Stopping.")
            break

    # If normalized, we need to adjust weights back to original scale for interpretation
    # OR we just return the raw params and remember to normalize inputs during inference.
    # For simplicity here, we return the params learned on the (potentially) normalized data
    # and return the stats so the user can normalize future inputs.
    return w, b, history, (mean, std)

## Closed-Form Solution (Validation)
Comparison using the Normal Equation: $\theta = (X^T X)^{-1} X^T y$.

In [None]:
# Add bias term to X for closed-form solution (X_b = [1, X])
X_b = np.c_[np.ones((len(X), 1)), X]

# Calculate using pseudo-inverse (stable alternative to inv)
theta_best = np.linalg.pinv(X_b).dot(y)

b_closed = theta_best[0][0]
w_closed = theta_best[1:]

print(f"Closed-form solution:\n Bias (b): {b_closed:.4f}\n Weights (w): {w_closed.flatten()}")

# Validation Metrics
y_closed_pred = X.dot(w_closed) + b_closed
print(f" MSE: {mse(y, y_closed_pred):.4f}")
print(f" R2:  {r2_score(y, y_closed_pred):.4f}")

# Explanation
print("\nUse Case: Closed-form is exact and fast for small datasets. GD is preferred for large n or online learning.")

## Experiments

In [None]:
configs = [
    {"name": "Small LR, No Norm", "lr": 0.01, "norm": False, "epochs": 100},
    {"name": "Med LR, Norm", "lr": 0.1, "norm": True, "epochs": 100},
    {"name": "Huge LR (Divergence)", "lr": 1.1, "norm": False, "epochs": 20} # Will likely explode
]

plt.figure(figsize=(15, 5))

for i, config in enumerate(configs):
    print(f"Running: {config['name']}")
    w, b, hist, stats = fit_gd(X, y, lr=config['lr'], epochs=config['epochs'], normalize=config['norm'])
    
    # Metrics on final model
    # Note: For prediction, if we normalized, we must normalize X using the same stats
    mean, std = stats
    X_eval = (X - mean) / std if config['norm'] else X
    y_pred = predict(X_eval, w, b)
    
    final_mse = mse(y, y_pred)
    
    # Plotting
    plt.subplot(1, 3, i+1)
    plt.plot(hist)
    plt.title(f"{config['name']} (MSE: {final_mse:.2f})")
    plt.xlabel('Epochs')
    plt.ylabel('MSE Loss')
    if config['lr'] > 1.0:
        plt.yscale('log') # Log scale for divergence
    plt.grid(True)

plt.tight_layout()
plt.show()

## Results & Takeaways
*   **Normalization**: Drastically speeds up convergence. Notice how the normalized model achieves low MSE much faster than the unnormalized one with low LR.
*   **Learning Rate (LR)**: 
    *   Too low: Convergence is slow.
    *   Too high: The loss oscillates or diverges (explodes) as seen in the third experiment.
    *   Just right: Smooth, rapid decrease in loss.
*   **GD vs Closed-Form**: The GD solution approaches the Closed-Form coefficients given enough epochs and an appropriate LR.
*   **Linear Limitation**: This model works well here because the data is linear ($y \approx 3x + 4$). It would fail to capture non-linear patterns (like parabolas) without feature engineering.

## Next Steps
*   Implement **Logistic Regression** for binary classification.
*   [Go to Logistic Regression Notebook](./logistic-regression-implementation.ipynb)