# Batch Gradient Descent — Quick Overview

This notebook uses **batch gradient descent** to fit a linear model. Keep this short reference handy.

What is Batch Gradient Descent?
- We update the model parameters (`theta`) by computing the gradient of the loss over the *entire training set* each iteration.
- Update rule (mean squared error loss):

  theta := theta - eta * grad

How the gradient is calculated (intuitively and mathematically):
- Predictions: p = X · theta  
  (X shape = (m, n), theta shape = (n, 1))
- Error vector: e = p − y  
  (y shape = (m, 1), so e is (m, 1))
- Gradient of MSE w.r.t. theta:  
  grad = (2/m) · X^T · e  
  - X^T has shape (n, m), X^T·e → (n, 1) → one gradient value per parameter

Why the shapes matter
- Keep `theta` as (n, 1) and `y` as (m, 1) to avoid unintended broadcasting and to produce gradients shaped (n, 1).

Practical tips
- `eta` (learning rate) controls step size — too large diverges, too small is slow.
- Feature scaling helps convergence (standardize inputs before training).
- Initialize `theta` randomly or with zeros; set a random seed for reproducibility.
- Monitor training: print the loss or parameter values every few iterations.

This short note explains the code below that computes `error`, `calc_gradient`, and the loop that updates `theta`.

In [None]:
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)

In [44]:
def error(X:np.ndarray, theta, y_actual):
    return (X.dot(theta) - y_actual)

In [45]:
def calc_gradient(X:np.ndarray, theta, y_actual):
    err = error(X, theta, y_actual)
    m = len(y_actual)
    grad = 2/m * X.T.dot(err)
    return grad

In [53]:
#Gradient descent
def compute_weights(X, y_actual):
    eta = 0.01
    iterations = 5000
    theta = np.random.randn(3,1) #Init theta values. These are initial random weights
    for i in range(iterations):
        grads = calc_gradient(X, theta, y_actual)
        theta = theta - eta * grads
        if(i % 50 == 0):
            print(f"Iteration {i}: theta={theta}, error={error(X, theta, y_actual)}")
    return theta
    

In [56]:
np.random.seed(42)
#Suppose we have 3 sample and 2 features
features = np.random.rand(3,2)
#add ones for bias
features_with_bias = np.c_[np.ones((3,1)),features] #add x0=1 for each instance
# make weights a column vector so y_actual is (3,1)
weights_actual = np.array([3, 9, 5]).reshape(-1, 1)  # bias 3, theta_1=9, theta_2=5
# compute y_actual as a (3,1) column vector
y_actual = features_with_bias.dot(weights_actual)
print("Feature-matrix:", features_with_bias)
print("y_Actual:", y_actual)
theta_computed = compute_weights(features_with_bias, y_actual)
print("Theta computed using batch gradient descent:", theta_computed)

Feature-matrix: [[1.         0.37454012 0.95071431]
 [1.         0.73199394 0.59865848]
 [1.         0.15601864 0.15599452]]
y_Actual: [[11.1244326 ]
 [12.5812379 ]
 [ 5.18414037]]
Iteration 0: theta=[[ 1.73910529]
 [ 0.84768611]
 [-0.3614787 ]], error=[[ -9.41149783]
 [-10.4380338 ]
 [ -3.36916893]]
Iteration 50: theta=[[5.58038103]
 [2.97515301]
 [2.44096351]], error=[[-2.10907848]
 [-3.36175938]
 [ 1.24119692]]
Iteration 100: theta=[[6.18425716]
 [3.61956586]
 [3.19866108]], error=[[-0.54348997]
 [-1.83257486]
 [ 2.06381014]]
Iteration 150: theta=[[6.12706445]
 [3.94404316]
 [3.51389262]], error=[[-0.17945777]
 [-1.46353612]
 [ 2.10641633]]
Iteration 200: theta=[[5.94533037]
 [4.19192146]
 [3.7223351 ]], error=[[-0.07018223]
 [-1.33903892]
 [ 1.99587177]]
Iteration 250: theta=[[5.75000425]
 [4.41455808]
 [3.89524747]], error=[[-0.01773175]
 [-1.26788093]
 [ 1.86225449]]
Iteration 300: theta=[[5.56330417]
 [4.62321052]
 [4.04861288]], error=[[ 0.01952357]
 [-1.21003518]
 [ 1.73203225