> **NOTE** Linear regression **updates weights and bias using Gradient Descent**, which utilizes the gradients of the loss function with respect to parameters to iteratively adjust parameters, reducing prediction error (loss).  
> 
> Assume your linear regression model is:  
> $$
> \hat{y} = X W + b
> $$
> * $X$ is the input feature matrix ($N \times D_{in}$)
> * $W$ is the weight matrix ($D_{in} \times D_{out}$)
> * $b$ is the bias vector ($D_{out}$)
> * $\hat{y}$ is the predicted output  
>  
> The loss function uses Mean Squared Error (MSE):
> $$
> L = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2
> $$  
> ---
> **Update Formula (Gradient Descent)**
> 1. Calculate weight gradient:
> $$
> \frac{\partial L}{\partial W} = \frac{2}{N} X^\top (XW + b - Y)
> $$
> 2. Calculate bias gradient:
> $$
> \frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^N (\hat{y}_i - y_i)
> $$
> 3. Update parameters using learning rate $\eta$:
> $$
> W := W - \eta \frac{\partial L}{\partial W}
> $$
> $$
> b := b - \eta \frac{\partial L}{\partial b}
> $$
> ---
> **Implementation Principle in PyTorch**
> * **Forward pass**: Compute prediction $\hat{y}$
> * **Calculate loss**: MSE loss
> * **Backward pass**: Call `loss.backward()` to automatically compute gradients of $W$ and $b$ (i.e., the partial derivatives above)
> * **Parameter update**: Update parameters based on gradients using an optimizer (e.g., SGD). MUST USE `with torch.no_grad()`
> ---
> 
> **Summary**
> * Linear regression weight and bias updates are essentially gradient descent: fine-tuning parameters along the negative gradient direction.
> * The goal is to make model predictions closer to true labels, reducing loss.
> * PyTorch automatically handles differentiation; you just need to call `.backward()` and manually (or > automatically) update parameters.
> * During **forward pass and loss computation**, we need gradients → **autograd on**.
> * During **parameter updates** (like SGD step), we don’t want to track those operations → **autograd off**.
> * ✅ Always **zero out** gradients before calling `.backward()`.




In [None]:
import torch

class LinearRegression():
    def __init__(self, input_dim, output_dim, learning_rate=0.01):
        # Initialize W and B, set requires_grad=True for autograd
        self.weights = torch.randn(input_dim, output_dim, requires_grad=True)
        self.bias = torch.zeros(output_dim, requires_grad=True)
        self.lr = learning_rate
    
    def forward(self, X):
        return X @ self.weights + self.bias  # Be careful with the order of matrix multiplication @
    
    def mse_loss(self, predictions, targets):
        return (predictions - targets).pow(2).mean()  # Mean Squared Error loss
    
    def step(self):
        # SGD step to update weights and bias
        with torch.no_grad():
            self.weights -= self.lr * self.weights.grad
            self.bias -= self.lr * self.bias.grad

            # Zero the gradients after updating
            self.weights.grad.zero_()
            self.bias.grad.zero_()
    
    def train(self, X, y, epochs=100):
        for epoch in range(epochs):
            # Forward pass to get predictions
            y_pred = self.forward(X)

            # Compute loss
            loss = self.mse_loss(y_pred, y)
            
            # Backward pass
            loss.backward()

            # Update weights and bias
            self.step()

            if (epoch + 1) % 10 == 0 or epoch == 0:
                print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")
            


In [None]:
# Sythetic data： y = 2 * x + 3
torch.manual_seed(0)
x_train = torch.randn(100, 1)
y_train = 2 * x_train + 3 + 0.1 * torch.randn(100, 1)  # Adding some noise

model = LinearRegression(input_dim=1, output_dim=1, lr=0.1)
model.train(x_train, y_train, epochs=100)

# test the model
x_test = torch.tensor([[4.0]])
y_pred = model.predict(x_test)
print(f"Prediction for x=4.0: {y_pred.item():.4f}")


Epoch 1/100, Loss: 19.7387
Epoch 10/100, Loss: 0.2763
Epoch 20/100, Loss: 0.0107
Epoch 30/100, Loss: 0.0084
Epoch 40/100, Loss: 0.0084
Epoch 50/100, Loss: 0.0084
Epoch 60/100, Loss: 0.0084
Epoch 70/100, Loss: 0.0084
Epoch 80/100, Loss: 0.0084
Epoch 90/100, Loss: 0.0084
Epoch 100/100, Loss: 0.0084
Prediction for x=4.0: 10.9684


> **TIP** Or you can use `LinearRegression()` from `sklearn`.

In [1]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate dummy data
np.random.seed(0)
X = np.random.randn(100, 3)
true_w = np.array([2.0, -3.0, 1.0])
true_b = 0.5
y = X @ true_w + true_b + 0.1 * np.random.randn(100)

# Define and fit model
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Parameters
print("Learned weights:", model.coef_)
print("Learned bias:", model.intercept_)


Learned weights: [ 1.99579429 -3.00505965  1.00414443]
Learned bias: 0.48179536115176136


### Regularization：L1 vs L2

Use regularization to limit the complexity of model, and prevent overfitting.

#### L1 (Lasso)

**Penalty term：**

$$
\| \boldsymbol{w} \|_1 = \sum_{i=1}^{n} |w_i|
$$

**Function for linear regression：**

$$
\mathcal{L}_{\text{Lasso}} = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 + \lambda \sum_{j=1}^{p} |w_j|
$$


#### L2 (Ridge)

**Penalty term：**

$$
\| \boldsymbol{w} \|_2^2 = \sum_{i=1}^{n} w_i^2
$$

**Function for linear regression：**

$$
\mathcal{L}_{\text{Ridge}} = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 + \lambda \sum_{j=1}^{p} w_j^2
$$


#### 🔍 Comparision
| Property                | L1 (Lasso)                             | L2 (Ridge)                            |
|-------------------------|----------------------------------------|----------------------------------------|
| Objective               | Feature selection (sparsity)           | Feature shrinkage (stability)          |
| Sparse solution         | ✅ (some weights become 0)             | ❌ (weights are usually non-zero)      |
| Handling correlated features | May randomly keep some features       | Compresses all features evenly         |
| Use cases               | High-dimensional sparse data, interpretability | Multicollinearity, generalization     |
| Convex optimization     | ✅                                      | ✅                                      |
| Sensitivity to outliers | High (non-smooth gradient)             | Low (smooth penalty term)              |

---

#### Use `scikit-learn`

```python
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=0.1)  # L1
ridge = Ridge(alpha=1.0)  # L2

lasso.fit(X_train, y_train)
ridge.fit(X_train, y_train)
