---
title: "Parameter Estimation"
---

## Introduction

Parameter estimation is the foundation of machine learning and deep learning. In this notebook, we'll learn how neural networks learn by working through a simple example: converting temperatures from an unknown scale to Celsius.

**What you'll learn:**
- How to define a parametric model
- What a loss function is and why we need it
- How gradient descent optimizes model parameters
- Common pitfalls (exploding gradients) and their solutions
- The importance of data normalization

This hands-on example will give you intuition for the training process that powers all deep learning models.

In [1]:
%matplotlib inline
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, linewidth=75)

## Setup

First, let's import our required libraries and configure PyTorch's output formatting for cleaner display.

## The Problem: Temperature Conversion

We have 11 temperature measurements in an unknown unit (`t_u`) and their corresponding values in Celsius (`t_c`). Our goal is to learn the relationship between these two scales.

**The underlying relationship:** These measurements follow a linear relationship: `celsius = w Ã— unknown + b`

Our task is to **estimate** the parameters `w` (weight/slope) and `b` (bias/intercept) from the data. This is exactly what machine learning does - it finds the best parameters that fit the observed data.

In [2]:
t_c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)

## The Model

A **model** is a function that takes inputs and parameters, and produces predictions. Here we use the simplest possible model: a linear function.

```
prediction = w Ã— input + b
```

This is the same as the equation of a line: `y = mx + b`

- `w` (weight): Controls the slope - how much the output changes for each unit of input
- `b` (bias): Controls the intercept - the output when input is zero

Our model has only 2 parameters to learn, making it perfect for understanding the training process.

In [None]:
def model(t_u, w, b):
    return w * t_u + b

## The Loss Function

The **loss function** measures how wrong our predictions are. It's a single number that quantifies the difference between our model's predictions and the true values.

Here we use **Mean Squared Error (MSE)**:
1. Calculate the difference between prediction and truth: `(prediction - truth)`
2. Square each difference to make all errors positive: `(prediction - truth)Â²`
3. Take the average: `mean((prediction - truth)Â²)`

**Why square the errors?**
- Penalizes large errors more heavily (which is usually desirable)
- Makes the math convenient for computing gradients
- Always produces a positive value

In [None]:
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

## Initial Prediction

Let's make our first prediction with naive initial parameters:
- `w = 1.0` (assume a 1:1 relationship)
- `b = 0.0` (assume no offset)

With these initial values, our model simply returns the input unchanged.

In [None]:
w = torch.ones(())
b = torch.zeros(())

t_p = model(t_u, w, b)
t_p

tensor([35.7000, 55.9000, 58.2000, 81.9000, 56.3000, 48.9000, 33.9000,
        21.8000, 48.4000, 60.4000, 68.4000])

## Initial Loss

The loss of `1763.88` tells us our initial parameters are very wrong! This large loss value will guide us in the right direction to improve our parameters.

**Goal of training:** Adjust `w` and `b` to minimize this loss.

In [None]:
loss = loss_fn(t_p, t_c)
loss

tensor(1763.8848)

## Understanding Broadcasting

Before we continue, let's understand an important PyTorch feature: **broadcasting**.

When we operate on tensors of different shapes, PyTorch automatically expands them to compatible shapes. This allows our model to process multiple inputs at once (vectorization).

**Key rules:**
1. Scalar tensors can broadcast to any shape
2. Dimensions of size 1 can expand to match other dimensions
3. Operations work element-wise after broadcasting

In our model, `w` and `b` are scalars that broadcast across all 11 temperature measurements simultaneously.

In [None]:
x = torch.ones(())
y = torch.ones(3,1)
z = torch.ones(1,3)
a = torch.ones(2, 1, 1)
print(f"shapes: x: {x.shape}, y: {y.shape}")
print(f"        z: {z.shape}, a: {a.shape}")
print("x * y:", (x * y).shape)
print("y * z:", (y * z).shape)
print("y * z * a:", (y * z * a).shape)

shapes: x: torch.Size([]), y: torch.Size([3, 1])
        z: torch.Size([1, 3]), a: torch.Size([2, 1, 1])
x * y: torch.Size([3, 1])
y * z: torch.Size([3, 3])
y * z * a: torch.Size([2, 3, 3])


## Computing Gradients: Numerical Approximation

To improve our parameters, we need to know **which direction** to adjust them. This is determined by the **gradient** - the rate of change of the loss with respect to each parameter.

**Numerical gradient approximation** uses the definition of derivative:
```
âˆ‚loss/âˆ‚w â‰ˆ [loss(w + Î´) - loss(w - Î´)] / (2Î´)
```

We slightly perturb the parameter (`w + Î´` and `w - Î´`), measure how the loss changes, and estimate the gradient.

**Interpretation:** 
- If gradient is positive: loss increases when we increase `w` â†’ should decrease `w`
- If gradient is negative: loss decreases when we increase `w` â†’ should increase `w`

In [None]:
delta = 0.1

loss_rate_of_change_w = \
    (loss_fn(model(t_u, w + delta, b), t_c) - 
     loss_fn(model(t_u, w - delta, b), t_c)) / (2.0 * delta)

## Gradient Descent Step

Now we update the parameter using **gradient descent**:

```
w_new = w_old - learning_rate Ã— gradient
```

**Why the negative sign?** We move in the **opposite direction** of the gradient to go downhill toward lower loss.

**Learning rate** (`1e-2 = 0.01`): Controls how big of a step we take. Too large â†’ unstable, too small â†’ slow convergence.

In [None]:
learning_rate = 1e-2

w = w - learning_rate * loss_rate_of_change_w

In [None]:
loss_rate_of_change_b = \
    (loss_fn(model(t_u, w, b + delta), t_c) - 
     loss_fn(model(t_u, w, b - delta), t_c)) / (2.0 * delta)

b = b - learning_rate * loss_rate_of_change_b

We do the same for the bias parameter `b`. Both parameters need to be updated to minimize the loss.

## Analytical Gradients

Numerical gradients work but are slow (require 2 forward passes per parameter). We can do better!

Using **calculus**, we can derive exact gradient formulas. For MSE loss and linear model:

**Loss gradient:**
```
âˆ‚loss/âˆ‚prediction = 2(prediction - truth) / n
```

This gives us the gradient of the loss with respect to each prediction.

**Model gradient with respect to w:**

Since `prediction = w Ã— input + b`, the derivative is:
```
âˆ‚prediction/âˆ‚w = input
```

**Model gradient with respect to b:**

```
âˆ‚prediction/âˆ‚b = 1
```

The bias affects the prediction with a constant factor of 1.

**Combining gradients with the chain rule:**

To get the gradient of loss with respect to parameters, we use the **chain rule**:

```
âˆ‚loss/âˆ‚w = âˆ‚loss/âˆ‚prediction Ã— âˆ‚prediction/âˆ‚w
âˆ‚loss/âˆ‚b = âˆ‚loss/âˆ‚prediction Ã— âˆ‚prediction/âˆ‚b
```

We compute these for each data point, then sum them up to get the total gradient. This is the gradient we'll use to update our parameters!

In [None]:
def dloss_fn(t_p, t_c):
    dsq_diffs = 2 * (t_p - t_c) / t_p.size(0)  # <1>
    return dsq_diffs

In [None]:
def dmodel_dw(t_u, w, b):
    return t_u

In [None]:
def dmodel_db(t_u, w, b):
    return 1.0

In [None]:
def grad_fn(t_u, t_c, t_p, w, b):
    dloss_dtp = dloss_fn(t_p, t_c)
    dloss_dw = dloss_dtp * dmodel_dw(t_u, w, b)
    dloss_db = dloss_dtp * dmodel_db(t_u, w, b)
    return torch.stack([dloss_dw.sum(), dloss_db.sum()])  # <1>

## The Training Loop

Now we put it all together! A **training loop** repeats these steps:

1. **Forward pass**: Compute predictions with current parameters
2. **Compute loss**: Measure how wrong the predictions are
3. **Backward pass**: Compute gradients
4. **Update parameters**: Take a step in the direction that reduces loss

We repeat this for `n_epochs` iterations. Each epoch is one pass through the entire dataset.

In [None]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        w, b = params

        t_p = model(t_u, w, b)  # <1>
        loss = loss_fn(t_p, t_c)
        grad = grad_fn(t_u, t_c, t_p, w, b)  # <2>

        params = params - learning_rate * grad

        print('Epoch %d, Loss %f' % (epoch, float(loss))) # <3>
            
    return params

### Enhanced Training Loop

This version adds:
- **Selective printing**: Only print at interesting epochs to reduce clutter
- **Gradient monitoring**: See how gradients change over time
- **Stability check**: Stop if loss becomes infinite (training diverged)

In [None]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c,
                  print_params=True):
    for epoch in range(1, n_epochs + 1):
        w, b = params

        t_p = model(t_u, w, b)  # <1>
        loss = loss_fn(t_p, t_c)
        grad = grad_fn(t_u, t_c, t_p, w, b)  # <2>

        params = params - learning_rate * grad

        if epoch in {1, 2, 3, 10, 11, 99, 100, 4000, 5000}:  # <3>
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            if print_params:
                print('    Params:', params)
                print('    Grad:  ', grad)
        if epoch in {4, 12, 101}:
            print('...')

        if not torch.isfinite(loss).all():
            break  # <3>
            
    return params

In [None]:
training_loop(
    n_epochs = 100, 
    learning_rate = 1e-2, 
    params = torch.tensor([1.0, 0.0]), 
    t_u = t_u, 
    t_c = t_c)

Epoch 1, Loss 1763.884766
    Params: tensor([-44.1730,  -0.8260])
    Grad:   tensor([4517.2964,   82.6000])
Epoch 2, Loss 5802484.500000
    Params: tensor([2568.4011,   45.1637])
    Grad:   tensor([-261257.4062,   -4598.9702])
Epoch 3, Loss 19408029696.000000
    Params: tensor([-148527.7344,   -2616.3931])
    Grad:   tensor([15109614.0000,   266155.6875])
...
Epoch 10, Loss 90901105189019073810297959556841472.000000
    Params: tensor([3.2144e+17, 5.6621e+15])
    Grad:   tensor([-3.2700e+19, -5.7600e+17])
Epoch 11, Loss inf
    Params: tensor([-1.8590e+19, -3.2746e+17])
    Grad:   tensor([1.8912e+21, 3.3313e+19])


tensor([-1.8590e+19, -3.2746e+17])

## Problem: Exploding Gradients

ðŸš¨ **Watch what happens when we train with learning rate `1e-2`:**

The loss **explodes** to infinity! Why?

1. Initial gradients are very large (4517 for w!)
2. Large learning rate Ã— large gradient = huge parameter update
3. Parameters overshoot and move away from the optimum
4. Loss increases, making gradients even larger
5. Parameters grow exponentially â†’ **divergence**

**Key lesson:** Learning rate must be chosen carefully. Too large â†’ unstable, too small â†’ slow.

In [None]:
training_loop(
    n_epochs = 100, 
    learning_rate = 1e-4, 
    params = torch.tensor([1.0, 0.0]), 
    t_u = t_u, 
    t_c = t_c)

Epoch 1, Loss 1763.884766
    Params: tensor([ 0.5483, -0.0083])
    Grad:   tensor([4517.2964,   82.6000])
Epoch 2, Loss 323.090515
    Params: tensor([ 0.3623, -0.0118])
    Grad:   tensor([1859.5493,   35.7843])
Epoch 3, Loss 78.929634
    Params: tensor([ 0.2858, -0.0135])
    Grad:   tensor([765.4666,  16.5122])
...
Epoch 10, Loss 29.105247
    Params: tensor([ 0.2324, -0.0166])
    Grad:   tensor([1.4803, 3.0544])
Epoch 11, Loss 29.104168
    Params: tensor([ 0.2323, -0.0169])
    Grad:   tensor([0.5781, 3.0384])
...
Epoch 99, Loss 29.023582
    Params: tensor([ 0.2327, -0.0435])
    Grad:   tensor([-0.0533,  3.0226])
Epoch 100, Loss 29.022667
    Params: tensor([ 0.2327, -0.0438])
    Grad:   tensor([-0.0532,  3.0226])


tensor([ 0.2327, -0.0438])

## Solution: Smaller Learning Rate

By reducing the learning rate to `1e-4` (100Ã— smaller), training becomes stable!

**Observations:**
- Loss steadily decreases from 1763 â†’ 29
- Parameters converge to reasonable values
- Gradients are still large, but smaller steps prevent divergence

**However**, there's still a problem: why are the initial gradients so large (4517)? The root cause is the scale of our input data.

In [None]:
t_un = 0.1 * t_u

In [None]:
training_loop(
    n_epochs = 100, 
    learning_rate = 1e-2, 
    params = torch.tensor([1.0, 0.0]), 
    t_u = t_un, # <1>
    t_c = t_c)

Epoch 1, Loss 80.364342
    Params: tensor([1.7761, 0.1064])
    Grad:   tensor([-77.6140, -10.6400])
Epoch 2, Loss 37.574913
    Params: tensor([2.0848, 0.1303])
    Grad:   tensor([-30.8623,  -2.3864])
Epoch 3, Loss 30.871077
    Params: tensor([2.2094, 0.1217])
    Grad:   tensor([-12.4631,   0.8587])
...
Epoch 10, Loss 29.030489
    Params: tensor([ 2.3232, -0.0710])
    Grad:   tensor([-0.5355,  2.9295])
Epoch 11, Loss 28.941877
    Params: tensor([ 2.3284, -0.1003])
    Grad:   tensor([-0.5240,  2.9264])
...
Epoch 99, Loss 22.214186
    Params: tensor([ 2.7508, -2.4910])
    Grad:   tensor([-0.4453,  2.5208])
Epoch 100, Loss 22.148710
    Params: tensor([ 2.7553, -2.5162])
    Grad:   tensor([-0.4446,  2.5165])


tensor([ 2.7553, -2.5162])

## Better Solution: Data Normalization

Instead of just reducing the learning rate, let's fix the root cause: **input scale**.

Our input values (`t_u`) range from 21 to 82 - these are large numbers! When we multiply by `w` in our model, outputs become even larger, leading to large gradients.

**Normalization**: Scale inputs to a smaller range by multiplying by `0.1`:
- Original: [21.8, 33.9, 35.7, ..., 81.9]
- Normalized: [2.18, 3.39, 3.57, ..., 8.19]

This makes the optimization landscape smoother and gradients more reasonable.

## Training with Normalized Data

Now with normalized inputs, we can use the larger learning rate (`1e-2`) successfully!

**Key improvements:**
- Initial loss is much smaller (80 vs 1763)
- Gradients are reasonable size (~77 vs 4517)
- Training is stable
- Loss decreases smoothly

**Note:** Final parameters are different because we changed the input scale. The underlying relationship is still correctly learned.

## Longer Training for Better Convergence

Let's train for 5000 epochs to see the full convergence behavior:

**Observations:**
- Loss continues to decrease: 80 â†’ 22 â†’ 2.93
- Final parameters: `w â‰ˆ 5.37`, `b â‰ˆ -17.30`
- Gradients become very small (0.0006, 0.0033) indicating convergence

We've successfully found the parameters that best fit our data! 

**Fun fact:** If the unknown unit is Fahrenheit, the true conversion is `C = (F - 32) / 1.8` or `C = 0.556F - 17.78`, which is very close to our learned values when accounting for the 0.1 scaling factor!

In [None]:
params = training_loop(
    n_epochs = 5000, 
    learning_rate = 1e-2, 
    params = torch.tensor([1.0, 0.0]), 
    t_u = t_un, 
    t_c = t_c,
    print_params = True)

params

## Visualizing the Results

Let's see how well our trained model fits the data:

- **Blue line**: Our model's predictions
- **Orange dots**: Actual measurements

The model captures the linear trend well! This visualization confirms that our learned parameters successfully model the temperature conversion relationship.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

t_p = model(t_un, *params)  # <1>

fig = plt.figure(dpi=600)
plt.xlabel("Temperature (Â°Fahrenheit)")
plt.ylabel("Temperature (Â°Celsius)")
plt.plot(t_u.numpy(), t_p.detach().numpy()) # <2>
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.savefig("temp_unknown_plot.png", format="png")  # bookskip

## Original Data Scatter Plot

Here's a look at our raw data before fitting the model. The clear linear pattern suggests that a linear model is appropriate for this problem.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

fig = plt.figure(dpi=600)
plt.xlabel("Measurement")
plt.ylabel("Temperature (Â°Celsius)")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')

plt.savefig("temp_data_plot.png", format="png")

## Summary: Key Takeaways

Congratulations! You've just learned the fundamentals of training neural networks:

### Core Concepts
1. **Model**: A parameterized function that makes predictions (`prediction = w Ã— input + b`)
2. **Loss Function**: Measures prediction error (MSE in this case)
3. **Gradients**: Direction and magnitude to adjust parameters
4. **Gradient Descent**: Iterative algorithm to minimize loss

### Practical Lessons
5. **Learning Rate Matters**: Too large â†’ divergence, too small â†’ slow convergence
6. **Normalization is Crucial**: Scale inputs to reasonable ranges for stable training
7. **Monitor Training**: Watch loss and gradients to diagnose problems
8. **Patience Pays Off**: More training epochs â†’ better convergence

### What's Next?
This same process scales to deep neural networks with millions of parameters! The concepts you've learned here - forward pass, loss computation, backward pass (gradients), and parameter updates - are the foundation of all deep learning.

**Coming up**: We'll learn how PyTorch automates gradient computation with `autograd`, making it easy to train complex models without manually deriving gradient formulas!