# Linear Regression and Gradient Descent

## Linear Regression (Analytical Solution)
Linear regression aims to find the best-fit line by minimizing the error between the predicted and actual values. The analytical approach involves directly calculating the parameters (weights) using matrix algebra.

### 1. Problem Formulation
- Model: $ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon $
  - $ Y $: Dependent variable (target)
  - $ X_i $: Independent variables (features)
  - $ \beta_0, \beta_1, ..., \beta_p $: Coefficients/parameters
  - $ \epsilon $: Error term

### 2. Objective: Minimize the Cost Function
The cost function for linear regression is the Mean Squared Error (MSE):

$ J(\beta) = \frac{1}{2m} \sum_{i=1}^{m} (Y^{(i)} - h_\beta(X^{(i)}))^2 $

- $ m $: Number of data points
- $ h_\beta(X^{(i)}) = \beta_0 + \sum_{j=1}^{p} \beta_j X_j^{(i)} $: Prediction for the $ i $-th data point

### 3. Analytical Solution Using the Normal Equation
We solve the regression equation analytically using the Normal Equation:

$ \beta = (X^T X)^{-1} X^T Y $

- $ X $: Feature matrix (with a column of 1s for the intercept)
- $ Y $: Target vector
- $ X^T X $: Covariance matrix
- $ (X^T X)^{-1} $: Inverse of the covariance matrix

**Steps**:
1. Compute $ X^T X $
2. Compute the inverse of $ X^T X $: $ (X^T X)^{-1} $
3. Compute $ X^T Y $
4. Multiply these matrices to get $ \beta $

### Why Gradient Descent Instead of the Analytical Solution?
The analytical solution works well for small datasets, but:
- For large datasets, inverting $ X^T X $ becomes computationally expensive
- Gradient Descent provides an iterative, scalable approach that works for both small and large datasets

## Gradient Descent for Linear Regression

### 1. Initialize Parameters
Set initial values for all coefficients $ \beta_j $ (e.g., 0 or small random values):

$ \beta_0, \beta_1, ..., \beta_p $

Choose:
- Learning rate $ \alpha $: A small value (e.g., 0.01 or 0.001) controlling step size
- Number of iterations: A fixed number of steps or until convergence

### 2. Compute Predictions
For each data point $ i $, calculate the prediction $ h_\beta(X^{(i)}) $:

$ h_\beta(X^{(i)}) = \beta_0 + \sum_{j=1}^{p} \beta_j X_j^{(i)} $

### 3. Calculate the Cost Function
Evaluate the cost function $ J(\beta) $:

$ J(\beta) = \frac{1}{2m} \sum_{i=1}^{m} (Y^{(i)} - h_\beta(X^{(i)}))^2 $

### 4. Compute the Gradient
The gradient is the partial derivative of the cost function with respect to each parameter $ \beta_j $:

$ \frac{\partial J}{\partial \beta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\beta(X^{(i)}) - Y^{(i)}) X_j^{(i)} $

### 5. Update Parameters
Update each parameter $ \beta_j $ using the gradient descent rule:

$ \beta_j = \beta_j - \alpha \cdot \frac{\partial J}{\partial \beta_j} $

Repeat this for all $ j $ (from 0 to $ p $):

$ \beta_0 = \beta_0 - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} (h_\beta(X^{(i)}) - Y^{(i)}) $
$ \beta_1 = \beta_1 - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} (h_\beta(X^{(i)}) - Y^{(i)}) X_1^{(i)} $
$ \beta_2 = \beta_2 - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} (h_\beta(X^{(i)}) - Y^{(i)}) X_2^{(i)} $

### 6. Repeat Until Convergence
Repeat steps 2–5 for a fixed number of iterations or until the parameters stabilize. Check:
- The cost function $ J(\beta) $ decreases consistently
- The gradients $ \frac{\partial J}{\partial \beta_j} $ approach zero

## Python Example: Linear Regression Using Gradient Descent

```python
import numpy as np

# Generate example data (simple linear regression)
X = np.array([1, 2, 3, 4, 5])  # Features
y = np.array([2, 4, 6, 8, 10])  # Labels

# Add intercept term (column of 1s)
X = np.c_[np.ones(X.shape[0]), X]  # X becomes [[1, x1], [1, x2], ...]

# Initialize parameters
theta = np.zeros(X.shape[1])  # [theta_0, theta_1]
alpha = 0.01  # Learning rate
iterations = 1000
m = len(y)  # Number of data points

# Gradient Descent
for _ in range(iterations):
    # Compute predictions
    predictions = X.dot(theta)
    
    # Compute gradient
    gradients = (1 / m) * X.T.dot(predictions - y)
    
    # Update parameters
    theta -= alpha * gradients

# Final parameters
print(f"Final parameters: {theta}")

# Predictions
y_pred = X.dot(theta)
print(f"Predictions: {y_pred}")
```

## Summary: Steps in Gradient Descent
1. Initialize $ \beta_j $ and set $ \alpha $
2. Compute predictions $ h_\beta(X) $
3. Calculate the cost function $ J(\beta) $
4. Compute gradients for $ \beta_j $
5. Update $ \beta_j $ using the gradient descent rule
6. Repeat until convergence

# Claude answer 

# Linear Regression: Step-by-Step Methods

## Method 1: Direct Solution (Normal Equation)

### Step 1: Set Up the Equation
First, we write our linear regression equation:
$ y = X\beta + \epsilon $

Where:
- $ y $ is the target variable (n×1 vector)
- $ X $ is the feature matrix (n×p matrix, including a column of 1s for intercept)
- $ \beta $ is the coefficient vector (p×1 vector)
- $ \epsilon $ is the error term

### Step 2: Derive the Normal Equation
To find $ \beta $ that minimizes the sum of squared errors:

1. Write the sum of squared errors:
   $ SSE = (y - X\beta)^T(y - X\beta) $

2. Take derivative with respect to $ \beta $ and set to zero:
   $ \frac{\partial SSE}{\partial \beta} = -2X^T y + 2X^T X\beta = 0 $

3. Solve for $ \beta $:
   $ X^T X\beta = X^T y $
   $ \beta = (X^T X)^{-1}X^T y $

### Step 3: Solve Step by Step
1. Calculate $ X^T X $:
   - Multiply X transpose by X to get a p×p matrix

2. Find $ (X^T X)^{-1} $:
   - Calculate determinant of $ X^T X $
   - Find adjugate matrix
   - Divide adjugate by determinant

3. Calculate $ X^T y $:
   - Multiply X transpose by y vector

4. Final multiplication:
   - Multiply $ (X^T X)^{-1} $ by $ X^T y $

### Example with 2D Data
For simple linear regression ($ y = \beta_0 + \beta_1x $):

$ X = \begin{bmatrix}
1 & x_1 \\
1 & x_2 \\
\vdots & \vdots \\
1 & x_n
\end{bmatrix} $

$ X^T X = \begin{bmatrix}
n & \sum x_i \\
\sum x_i & \sum x_i^2
\end{bmatrix} $

$ X^T y = \begin{bmatrix}
\sum y_i \\
\sum x_i y_i
\end{bmatrix} $

## Method 2: Gradient Descent

### Step 1: Initialize Parameters
1. Choose initial values for $ \beta $ (usually zeros)
2. Set learning rate $ \alpha $ (e.g., 0.01)
3. Set maximum iterations and convergence threshold

### Step 2: Forward Pass
1. Calculate predictions:
   $ \hat{y} = X\beta $

2. Calculate error:
   $ e = y - \hat{y} $

3. Calculate cost (MSE):
   $ J = \frac{1}{2n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 $

### Step 3: Backward Pass
1. Calculate gradients for each parameter:
   $ \frac{\partial J}{\partial \beta} = -\frac{1}{n}X^T(y - X\beta) $

2. Update parameters:
   $ \beta_{new} = \beta_{old} - \alpha \frac{\partial J}{\partial \beta} $

### Step 4: Iterate and Check Convergence
1. Repeat Steps 2-3
2. Stop if either:
   - Maximum iterations reached
   - $ |J_{new} - J_{old}| < \text{threshold} $

## Python Implementation

```python
import numpy as np

class LinearRegression:
    def __init__(self, method='normal'):
        self.method = method
        self.coef_ = None
        
    def fit_normal(self, X, y):
        # Add column of ones for intercept
        X = np.c_[np.ones(X.shape[0]), X]
        
        # Calculate beta using normal equation
        XTX = X.T.dot(X)
        XTy = X.T.dot(y)
        self.coef_ = np.linalg.inv(XTX).dot(XTy)
        
    def fit_gradient_descent(self, X, y, learning_rate=0.01, max_iter=1000):
        # Add column of ones for intercept
        X = np.c_[np.ones(X.shape[0]), X]
        n_samples = X.shape[0]
        
        # Initialize parameters
        self.coef_ = np.zeros(X.shape[1])
        
        for i in range(max_iter):
            # Forward pass
            y_pred = X.dot(self.coef_)
            
            # Calculate gradients
            gradients = -(1/n_samples) * X.T.dot(y - y_pred)
            
            # Update parameters
            self.coef_ -= learning_rate * gradients
            
    def predict(self, X):
        # Add column of ones for intercept
        X = np.c_[np.ones(X.shape[0]), X]
        return X.dot(self.coef_)

# Example usage:
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Using normal equation
lr_normal = LinearRegression(method='normal')
lr_normal.fit_normal(X, y)
print("Normal Equation Coefficients:", lr_normal.coef_)

# Using gradient descent
lr_gd = LinearRegression(method='gradient_descent')
lr_gd.fit_gradient_descent(X, y)
print("Gradient Descent Coefficients:", lr_gd.coef_)
```

## Key Differences Between Methods

### Normal Equation
Advantages:
- Direct solution (no iterations)
- Exact solution
- No learning rate to tune

Disadvantages:
- Computationally expensive for large datasets ($ O(n^3) $)
- Can be numerically unstable
- Requires matrix inversion

### Gradient Descent
Advantages:
- Works well with large datasets
- Memory efficient
- Can be parallelized

Disadvantages:
- Requires learning rate tuning
- May need many iterations
- May converge to local minimum (not an issue for linear regression)
- May not find exact solution