# Important Stuff

## Tensors

In [1]:
import torch 

# Create a tensor
x = torch.tensor([[1., 2.], [3., 4.]])

# Basic operations
print(x + 2)
print(x * 3)
print(x.mean())
print(x.T)  # transpose

tensor([[3., 4.],
        [5., 6.]])
tensor([[ 3.,  6.],
        [ 9., 12.]])
tensor(2.5000)
tensor([[1., 3.],
        [2., 4.]])


In [3]:
print(torch.zeros(3,3))
print(torch.ones(2,4))
print(torch.randn(3,3))

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.]])
tensor([[-0.1989,  0.3315, -0.3665],
        [-1.4019, -0.6886, -1.0394],
        [ 0.8571, -0.3226,  0.4708]])


In [4]:
# “Track all operations on this tensor so we can compute derivatives later.”
x = torch.tensor([[2., 3.]], requires_grad=True)

In [11]:
import torch

# Step 1: Create tensor with grad tracking
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)

# Step 2: Do some operations
y = x ** 2 + 3
z = y.mean()  # scalar output required for backward()

# Step 3: Backpropagate
z.backward()

print("x:", x)
print("Gradient of x:", x.grad)

x: tensor([[1., 2.],
        [3., 4.]], requires_grad=True)
Gradient of x: tensor([[0.5000, 1.0000],
        [1.5000, 2.0000]])


In [None]:
# If you’re just doing inference or want to freeze weights:

with torch.no_grad():
    y = x * 2  # no gradient tracking
    
x.requires_grad_(False)

In [7]:
# Run on GPU if available:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = x.to(device)
print(x.device)

cuda:0


In [14]:
x.grad

tensor([[0.5000, 1.0000],
        [1.5000, 2.0000]])

# Torch NN module

In [16]:
import torch 
import torch.nn as nn


# pretend we have 4 samples, each with 3 input features
x = torch.tensor([[1.0, 2.0, 3.0],
                  [0.5, 1.0, 1.5],
                  [2.0, 0.5, 1.0],
                  [1.5, 1.5, 0.5]])

# pretend the true outputs we want to learn are single numbers (regression)
y_true = torch.tensor([[10.0],
                       [ 5.0],
                       [ 7.0],
                       [ 8.0]])



layer = nn.Linear(in_features=3, out_features=1)

print("Weight:", layer.weight)
print("Bias:  ", layer.bias)

Weight: Parameter containing:
tensor([[ 0.4852,  0.5576, -0.1539]], requires_grad=True)
Bias:   Parameter containing:
tensor([-0.0402], requires_grad=True)


In [17]:
output = layer(x)

In [18]:
output

tensor([[1.0985],
        [0.5292],
        [1.0550],
        [1.4470]], grad_fn=<AddmmBackward0>)

In [19]:
# Two layers
x = torch.tensor([[1.0, 2.0, 3.0],
                  [0.5, 1.0, 1.5],
                  [2.0, 0.5, 1.0],
                  [1.5, 1.5, 0.5]])

layer1 = nn.Linear(3, 1)
out1 = layer1(x)
print("Output of first layer:\n", out1)

layer2 = nn.Linear(1, 1)     # in_features=1 because layer1 outputs 1 value
out2 = layer2(out1)
print("Output of second layer:\n", out2)
#if you multiply two linear transformations back-to-back, you just get one larger linear transformation:
# So we need to add nonlinearity behind them

Output of first layer:
 tensor([[-0.7622],
        [-0.6308],
        [-1.5764],
        [-0.7531]], grad_fn=<AddmmBackward0>)
Output of second layer:
 tensor([[0.3735],
        [0.2970],
        [0.8474],
        [0.3682]], grad_fn=<AddmmBackward0>)


In [20]:
relu = nn.ReLU()

h = layer1(x)       # linear transform
h = relu(h)         # non-linear activation
out = layer2(h)     # second linear transform

print("Output after ReLU + second layer:\n", out)

Output after ReLU + second layer:
 tensor([[-0.0702],
        [-0.0702],
        [-0.0702],
        [-0.0702]], grad_fn=<AddmmBackward0>)


# Backpropogation

In [21]:
import torch
import torch.nn as nn

torch.manual_seed(0)

# input: 3 features -> hidden: 2 neurons -> output: 1 value
layer1 = nn.Linear(3, 2)
relu = nn.ReLU()
layer2 = nn.Linear(2, 1)

x = torch.tensor([[1.0, 2.0, 3.0]])
y_true = torch.tensor([[10.0]])


In [22]:
h = layer1(x)        # linear transform
h_relu = relu(h)     # activation
y_pred = layer2(h_relu)  # final output
print("Predicted:", y_pred)

#x → layer1(weight,bias) → h → ReLU → h_relu → layer2(weight,bias) → y_pred
# every arrow is an operation PyTorch can differentiate.

Predicted: tensor([[-0.2039]], grad_fn=<AddmmBackward0>)


In [23]:
# compute the loss (how wrong we were)
loss_fn = nn.MSELoss()
loss = loss_fn(y_pred, y_true)
print("Loss:", loss.item())
#this is a single scalar (mean squared error).

Loss: 104.1186294555664


In [25]:
# backward pass
layer1.zero_grad()
layer2.zero_grad()

loss.backward()


### 1. What happens in training

Training a neural network follows this cycle:

1. **Forward pass** – Pass the input `x` through the network to get predictions.
2. **Loss computation** – Compare the predictions with the true labels using a loss function (for example, MSELoss).
3. **Backward pass** – Call `loss.backward()` to compute gradients for every weight and bias.
4. **Weight update** – Call `optimizer.step()` to adjust weights using those gradients.
5. **Reset gradients** – Call `optimizer.zero_grad()` to clear gradients before the next backward pass.

---

### 2. What `nn.Linear` does

When you create a layer like this:

```python
layer = nn.Linear(in_features=3, out_features=2)
```

PyTorch automatically:

* Creates **weights** of shape `[out_features, in_features]`
* Creates **biases** of shape `[out_features]`
* Initializes them **randomly** (small random numbers)
* Sets `requires_grad=True` so PyTorch can track how each parameter affects the loss

---

### 3. What happens in the forward pass

When you do:

```python
y_pred = layer(x)
```

PyTorch computes:
[
y = xW^T + b
]

This gives you predictions, but weights do not change yet.
The purpose of the forward pass is to see how wrong the predictions are.

---

### 4. What the loss function does

The loss function compares predicted outputs with the true outputs.

Example:

```python
criterion = nn.MSELoss()
loss = criterion(y_pred, y_true)
```

The result is a single scalar value showing how wrong the model is.

---

### 5. Why we use `zero_grad()`

Each parameter (weights and biases) has a `.grad` attribute that stores the gradient (slope) of the loss with respect to that parameter after backpropagation.

PyTorch **adds** gradients to `.grad` every time you call `.backward()`.
This means gradients accumulate over multiple backward passes unless you clear them.

To reset gradients before a new backward pass:

```python
optimizer.zero_grad()
```

This ensures gradients only reflect the current forward pass.

If you skip this step, gradients from previous iterations remain, and the optimizer will make updates that are too large or completely wrong.

---

### 6. What happens in the backward pass

When you call:

```python
loss.backward()
```

PyTorch applies the chain rule to compute the partial derivative of the loss with respect to every parameter in the model.

* For each weight: `parameter.grad = ∂Loss/∂Weight`
* For each bias: `parameter.grad = ∂Loss/∂Bias`

These gradients tell the optimizer how much to change each parameter to reduce the loss.

---

### 7. Updating weights

When you call:

```python
optimizer.step()
```

PyTorch updates every parameter using its gradient:
[
\text{new_weight} = \text{old_weight} - \text{learning_rate} \times \text{gradient}
]

Weights and biases change slightly to reduce the error next time.

---

### 8. Summary

| Step | Command                            | What it does                               |
| ---- | ---------------------------------- | ------------------------------------------ |
| 1    | `nn.Linear()`                      | Creates and initializes weights and biases |
| 2    | `y_pred = layer(x)`                | Forward pass, computes predictions         |
| 3    | `loss = criterion(y_pred, y_true)` | Computes how wrong the prediction is       |
| 4    | `optimizer.zero_grad()`            | Clears old gradients                       |
| 5    | `loss.backward()`                  | Computes new gradients for each parameter  |
| 6    | `optimizer.step()`                 | Updates the weights and biases             |

---

### 9. Key points to remember

* **Weights** and **biases** are persistent — they stay between iterations.
* **Gradients** are temporary — they must be cleared each iteration.
* Without `zero_grad()`, gradients accumulate and the optimizer “over-corrects.”
* The backward pass only computes gradients; the optimizer applies them.


### 1. What an **epoch** means

An **epoch** = one complete pass of your entire training data through the network.
If you train for 10 epochs, the model will see every training example 10 times.

---

### 2. What happens inside each epoch

For every epoch, we do the same sequence of steps:

1. **Forward pass**
   Pass the current batch of inputs through the model to get predictions.

2. **Compute loss**
   Compare predictions to true outputs using a loss function.

3. **Reset gradients**
   Call `optimizer.zero_grad()` to clear old gradients.

4. **Backward pass**
   Call `loss.backward()` to compute gradients for all weights and biases.

5. **Update weights**
   Call `optimizer.step()` to adjust the parameters based on gradients.

Then move on to the next batch of data (if you’re using mini-batches).

---

### 3. Code outline

```python
for epoch in range(num_epochs):
    for batch_x, batch_y in dataloader:
        y_pred = model(batch_x)                 # 1. forward pass
        loss = criterion(y_pred, batch_y)       # 2. compute loss
        optimizer.zero_grad()                   # 3. clear old gradients
        loss.backward()                         # 4. backpropagation
        optimizer.step()                        # 5. update weights

    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
```

If you’re not using mini-batches, just drop the inner loop and run those same five lines on the whole dataset each epoch.

---

### 4. Why we repeat it

Every time the model goes through the data:

* The gradients point it a little closer to the right direction.
* The weights move slightly toward values that reduce the loss.
* The loss usually decreases with each epoch until the model converges or stops improving.

---

### 5. Summary table

| Step               | Description             | Happens every epoch? |
| ------------------ | ----------------------- | -------------------- |
| Initialize weights | Random once             | No                   |
| Forward pass       | Compute predictions     | Yes                  |
| Compute loss       | Compare to true outputs | Yes                  |
| Zero gradients     | Reset `.grad`           | Yes                  |
| Backward pass      | Compute gradients       | Yes                  |
| Optimizer step     | Update weights          | Yes                  |

