### Mean Squared Error (MSE) Loss Function
 
The **Mean Squared Error (MSE)** measures the average squared difference between actual target values $(y_i)$ and predicted values $(\hat{y}_i)$:
 
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 
$$
 
### Deriving the Gradient of MSE with Respect to $ \mathbf{w} $ 
1. **Expressing MSE in Matrix Form:**
 
$$
\text{MSE} = \frac{1}{n} (\mathbf{y} - \mathbf{\hat{y}})^\top (\mathbf{y} - \mathbf{\hat{y}})
$$

Substituting $\mathbf{\hat{y}} = \mathbf{Xw}$:
 
$$
\text{MSE} = \frac{1}{n} (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw})
$$

1. **Differentiation Step-by-Step:**

- **Expand the Expression:**
 
$$
\text{MSE} = \frac{1}{n} \left[\mathbf{y}^\top \mathbf{y} - \mathbf{y}^\top \mathbf{Xw} - \mathbf{w}^\top \mathbf{X}^\top \mathbf{y} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw}\right]
$$

- **Differentiating Each Term:**

- $\frac{\partial}{\partial \mathbf{w}}(\mathbf{y}^\top \mathbf{y}) = 0$
- $\frac{\partial}{\partial \mathbf{w}}(-2\mathbf{w}^\top \mathbf{X}^\top \mathbf{y}) = -2\mathbf{X}^\top \mathbf{y}$
- $\frac{\partial}{\partial \mathbf{w}}(\mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw}) = 2\mathbf{X}^\top \mathbf{Xw}$

- **Combine the Derivatives:**

$$
\nabla_{\mathbf{w}}\text{MSE} = \frac{1}{n} \left[0 - 2\mathbf{X}^\top \mathbf{y} + 2\mathbf{X}^\top \mathbf{Xw}\right]
$$
 
Simplifies to:

$$
\nabla_{\mathbf{w}}\text{MSE} = \frac{2}{n} \mathbf{X}^\top (\mathbf{Xw} - \mathbf{y})
$$

### Implementation in Code

```python
def compute_gradients(X, y_true, y_pred):
  errors = y_pred - y_true
  gradients = np.dot(X.T, errors) * (2 / len(y_true))
  return gradients
 ```

The gradient of the MSE loss function with respect to the weight vector $ \mathbf{w} $

$ \mathbf{w} $ indicates the direction and magnitude by which $ \mathbf{w} $

should be adjusted to minimize the loss.

In [2]:
import numpy as np

def compute_mse_loss(y_true, y_pred):
    """
    Compute the Mean Squared Error loss.

    Parameters:
    - y_true: Actual target values.
    - y_pred: Predicted target values.

    Returns:
    - mse: Mean Squared Error.
    """
    mse = np.mean((y_true - y_pred) ** 2)
    return mse

def compute_gradients(X, y_true, y_pred):
    """
    Compute the gradients of the MSE loss with respect to the weights.

    Parameters:
    - X: Feature matrix.
    - y_true: Actual target values.
    - y_pred: Predicted target values.

    Returns:
    - gradients: Gradient of the loss with respect to weights.
    """
    errors = y_pred - y_true
    gradients = np.dot(X.T, errors) * (2 / len(y_true))
    return gradients

def gradient_descent(X, y, weights, learning_rate=0.01, n_iterations=1000, method='batch', batch_size=32):
    """
    Perform Gradient Descent optimization with different variants.

    Parameters:
    - X: Feature matrix.
    - y: Target vector.
    - weights: Initial weights.
    - learning_rate: Learning rate for weight updates.
    - n_iterations: Number of iterations.
    - method: Type of Gradient Descent ('stochastic', 'batch', 'mini_batch').
    - batch_size: Size of mini-batches for 'mini_batch' variant.

    Returns:
    - weights: Optimized weights after training.
    - history: Loss history over iterations.
    """
    n_samples, n_features = X.shape
    history = []

    for iteration in range(n_iterations):
        if method == 'batch':
            # Batch Gradient Descent: Update weights using the entire dataset
            y_pred = np.dot(X, weights)
            loss = compute_mse_loss(y, y_pred)
            gradients = compute_gradients(X, y, y_pred)
            weights -= learning_rate * gradients
            history.append(loss)

        elif method == 'stochastic':
            # Stochastic Gradient Descent: Update weights for each sample
            loss = 0.0  # Initialize as float
            for i in range(n_samples):
                xi = X[i].reshape(1, -1)
                yi = y[i]
                y_pred = np.dot(xi, weights)[0]  # Extract scalar
                loss += (yi - y_pred) ** 2
                gradient = compute_gradients(xi, np.array([yi]), y_pred)
                weights -= learning_rate * gradient.flatten()
            mse = loss / n_samples
            history.append(mse)

        elif method == 'mini_batch':
            # Mini-Batch Gradient Descent: Update weights using mini-batches
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]
            loss = 0.0  # Initialize as float

            for i in range(0, n_samples, batch_size):
                X_batch = X_shuffled[i:i + batch_size]
                y_batch = y_shuffled[i:i + batch_size]
                y_pred = np.dot(X_batch, weights)
                loss += compute_mse_loss(y_batch, y_pred)
                gradients = compute_gradients(X_batch, y_batch, y_pred)
                weights -= learning_rate * gradients
            mse = loss / (n_samples / batch_size)
            history.append(mse)

        else:
            raise ValueError("Method must be 'stochastic', 'batch', or 'mini_batch'.")

        # (Optional) Print loss every 100 iterations
        if (iteration + 1) % 100 == 0:
            print(f"Iteration {iteration + 1}/{n_iterations}, Loss: {history[-1]:.4f}")

    return weights, history

# Sample data
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])

# Parameters
learning_rate = 0.01
n_iterations = 1000
batch_size = 2

# Initialize weights
initial_weights = np.zeros(X.shape[1])

# Test Batch Gradient Descent
print("Batch Gradient Descent:")
final_weights_batch, loss_history_batch = gradient_descent(
	X, y, initial_weights.copy(),
	learning_rate=learning_rate,
	n_iterations=n_iterations,
	method='batch'
)
print("Final Weights:", final_weights_batch)
print("Final MSE Loss:", loss_history_batch[-1])
print("-" * 50)

# Test Stochastic Gradient Descent
print("Stochastic Gradient Descent:")
final_weights_sgd, loss_history_sgd = gradient_descent(
	X, y, initial_weights.copy(),
	learning_rate=learning_rate,
	n_iterations=n_iterations,
	method='stochastic'
)
print("Final Weights:", final_weights_sgd)
print("Final MSE Loss:", loss_history_sgd[-1])
print("-" * 50)

# Test Mini-Batch Gradient Descent
print("Mini-Batch Gradient Descent:")
final_weights_mbgd, loss_history_mbgd = gradient_descent(
	X, y, initial_weights.copy(),
	learning_rate=learning_rate,
	n_iterations=n_iterations,
	method='mini_batch',
	batch_size=batch_size
)
print("Final Weights:", final_weights_mbgd)
print("Final MSE Loss:", loss_history_mbgd[-1])

Batch Gradient Descent:
Iteration 100/1000, Loss: 0.0323
Iteration 200/1000, Loss: 0.0177
Iteration 300/1000, Loss: 0.0097
Iteration 400/1000, Loss: 0.0053
Iteration 500/1000, Loss: 0.0029
Iteration 600/1000, Loss: 0.0016
Iteration 700/1000, Loss: 0.0009
Iteration 800/1000, Loss: 0.0005
Iteration 900/1000, Loss: 0.0003
Iteration 1000/1000, Loss: 0.0001
Final Weights: [1.01003164 0.97050576]
Final MSE Loss: 0.00014615956788017066
--------------------------------------------------
Stochastic Gradient Descent:
Iteration 100/1000, Loss: 0.0049
Iteration 200/1000, Loss: 0.0004
Iteration 300/1000, Loss: 0.0000
Iteration 400/1000, Loss: 0.0000
Iteration 500/1000, Loss: 0.0000
Iteration 600/1000, Loss: 0.0000
Iteration 700/1000, Loss: 0.0000
Iteration 800/1000, Loss: 0.0000
Iteration 900/1000, Loss: 0.0000
Iteration 1000/1000, Loss: 0.0000
Final Weights: [1.00000058 0.99999813]
Final MSE Loss: 6.501074837144217e-13
--------------------------------------------------
Mini-Batch Gradient Descent: