## üß† Case Study: Are We Going to Canada's Wonderland?
We will now implement a simple artificial neural network using **PyTorch** to predict whether we‚Äôll go to Canada's Wonderland based on three inputs:

- **X‚ÇÅ**: Weather ‚Äî 1 if sunny, 0 if rainy
- **X‚ÇÇ**: Ticket queue ‚Äî 1 if short, 0 if long
- **X‚ÇÉ**: Health ‚Äî 1 if healthy, 0 if unwell

Each input has an associated **weight**, and the network uses a **bias** and a **sigmoid activation function** to generate a confidence value.

### üî¢ Step 1: Prediction Function
We compute the neuron output $\hat{y}$ as:

$$
\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where} \quad z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b
$$

In our case:

- $x = [1, 0, 1]$
- $w = [5, 2, 4]$
- $b = -3$

In [None]:
import torch
import torch.nn as nn

# Inputs and weights
x = torch.tensor([1.0, 0.0, 1.0])
weights = torch.tensor([5.0, 2.0, 4.0])
bias = -3.0

# Linear combination
z = torch.dot(weights, x) + bias

# Sigmoid activation
sigmoid = torch.sigmoid(z)
print(f"z = {z.item():.2f}")
print(f"Predicted output (sigmoid) = {sigmoid.item():.4f}")

### üí¨ Talking Points
- We treat inputs like "sunny", "short queue", and "healthy" as binary features.
- The weighted sum (`z`) captures how favorable conditions are.
- Applying `sigmoid(z)` maps the decision to a probability.

### ‚ùå Step 2: Error Calculation
We compare the predicted value with the actual label:

$$
E = y - \hat{y}
$$

If the actual value is 1 (we should go), but our prediction is too low, we compute the error and update the weights.

In [None]:
y = torch.tensor(1.0)
error = y - sigmoid
print(f"Error = {error.item():.4f}")

### üîÅ Step 3: Weight Update with Gradient Descent
Using a learning rate $\eta = 0.1$, we update each weight:

$$
w_j = w_j - \eta \cdot \frac{\partial L}{\partial w_j}
$$

We use PyTorch autograd to perform backpropagation.

In [None]:
x = x.clone().detach().requires_grad_(True)
weights = weights.clone().detach().requires_grad_(True)

z = torch.dot(weights, x) + bias
y_hat = torch.sigmoid(z)
loss = (y_hat - y) ** 2
loss.backward()

lr = 0.1
with torch.no_grad():
    weights -= lr * weights.grad
    print(f"Updated weights: {weights}")

> ### üè† Homework Challenge
>
> Study how to update the weights using **Gradient Descent**.
>
> üìå **Question**: How is this update rule related to **Backpropagation** in multi-layer neural networks?
>
> (You may remove the solution cell below when presenting.)

In [None]:
print("Gradient Descent computes how to change each weight to reduce the loss.")
print("Backpropagation extends this to multiple layers using the chain rule.")

### üßÆ Step 3: Weight Update with Gradient Descent ‚Äî A Closer Look and soltution to the challenge

Once we compute the **loss** (difference between the predicted output and actual target), the next step is to **update the weights** so that the network improves its predictions. This is done using **gradient descent**.

### üìâ Gradient Descent: The Core Idea

We want to **minimize the loss function** $ L $ by adjusting the weights $ w_j $. Gradient descent helps us do this by computing how much the loss changes with respect to each weight ‚Äî this is the **partial derivative**:

$$
\frac{\partial L}{\partial w_j}
$$

This derivative tells us:

> "If I increase weight $ w_j $ just a little, will the loss go up or down? And by how much?"

### üß™ The Update Rule

To reduce the loss, we **move each weight** in the opposite direction of the gradient. That‚Äôs where the gradient descent update rule comes in:

$$
w_j = w_j - \eta \cdot \frac{\partial L}{\partial w_j}
$$

Where:

* $ w_j $ is the weight for feature $ j $
* $ \eta $ is the **learning rate** ‚Äî a small number like `0.1` that controls how big each update step is
* $ \frac{\partial L}{\partial w_j} $ is the gradient (i.e., how sensitive the loss is to changes in $ w_j $)

### üîç Why Subtract the Gradient?

* The **gradient** points in the direction that **increases** the loss.
* We want to **minimize** the loss.
* So, we go in the **opposite** direction ‚Äî that‚Äôs why we subtract.

### üß† Learning Rate $ \eta $: The Step Size

* If $ \eta $ is **too large**, we might overshoot the minimum ‚Äî the network will not converge.
* If $ \eta $ is **too small**, learning will be very slow.
* A value like $ \eta = 0.1 $ is a reasonable starting point for simple problems.

Think of gradient descent as walking downhill:

* The gradient tells you **which way is down**.
* The learning rate tells you **how big a step to take**.

### üîÅ PyTorch Makes It Easy

When using PyTorch:

1. We compute the loss.
2. We call `loss.backward()` ‚Äî this uses **autograd** to compute all the gradients.
3. Then, PyTorch updates the weights automatically using:

```python
optimizer.step()
```

Behind the scenes, it's doing:

$$
w_j = w_j - \eta \cdot \frac{\partial L}{\partial w_j}
$$

This is how the model **learns** from examples and improves predictions.

‚úÖ **Key Insight**: Gradient descent is what allows neural networks to learn from data. It systematically tweaks the weights so the network gets better at solving its task.


### üîÅ Let's revisit backpropagation:

![Slide 5](./images/ANN_5.png)

### üîÅ Step 3 (continued): How Gradient Descent Updates Weights ‚Äî Slide 5 Breakdown

In **Slide 5**, we see that the network made a prediction:

$$
\hat{y} = 6
$$

But the **actual target** was:

$$
y = 1
$$

So the **error** is:

$$
E = y - \hat{y} = 1 - 6 = -5
$$

Even though the sigmoid activation produces a value close to 1 (as seen in earlier steps), this diagram continues the exercise **without** applying sigmoid ‚Äî instead, it treats the raw output $ \hat{y} $ directly for didactic purposes.

### ‚öôÔ∏è Applying the Gradient Descent Update Rule

The formula used is:

$$
w_j^{\text{new}} = w_j^{\text{old}} - \eta \cdot \frac{\partial L}{\partial w_j}
$$

Let's assume:

* Learning rate $ \eta = 0.1 $
* We use the squared error loss:

$$
L = \frac{1}{2}(y - \hat{y})^2
$$

Its gradient w.r.t. $ w_j $ is:

$$
\frac{\partial L}{\partial w_j} = (y - \hat{y}) \cdot \frac{\partial \hat{y}}{\partial w_j} = (y - \hat{y}) \cdot x_j
$$

### üßÆ Step-by-Step Weight Update Calculations

We know:

* $ y = 1 $
* $ \hat{y} = 6 $
* Error: $ y - \hat{y} = -5 $
* Inputs: $ x = [1, 0, 1] $
* Initial weights: $ w_1 = 5, w_2 = 2, w_3 = 4 $

#### üîπ Weight 1 Update:

$$
\frac{\partial L}{\partial w_1} = -5 \cdot 1 = -5
$$
$$
w_1^{\text{new}} = 5 - 0.1 \cdot (-5) = 5 + 0.5 = 5.5
$$

#### üîπ Weight 2 Update:

$$
\frac{\partial L}{\partial w_2} = -5 \cdot 0 = 0
$$
$$
w_2^{\text{new}} = 2 - 0.1 \cdot 0 = 2
$$

#### üîπ Weight 3 Update:

$$
\frac{\partial L}{\partial w_3} = -5 \cdot 1 = -5
$$
$$
w_3^{\text{new}} = 4 - 0.1 \cdot (-5) = 4 + 0.5 = 4.5
$$

### ‚úÖ Updated Weights:

$$
\begin{aligned}
w_1 &= 5.5 \
w_2 &= 2.0 \
w_3 &= 4.5 \
\end{aligned}
$$

These new weights will produce a smaller output $ \hat{y} $ in the next forward pass ‚Äî helping the network move **closer to the correct target** and reducing error.

### üß† Intuition from Slide 7

Slide 7 (below) illustrates this process visually ‚Äî we are using the **gradient** of the loss function to adjust weights in a way that reduces prediction error. This is how a neural network learns: it adjusts its parameters to better match real outcomes.

### üîÅ Let's revisit backpropagation:

![Slide 5](./images/ANN_7.png)

### üîÅ **We‚Äôll continue doing this in a loop, until the error is acceptably small.**

