# Lesson 14: Introduction to PyTorch - Tensors and Gradients

Welcome to the third and final part of our course. We've built our foundation in Python and Data Analysis. Now, we enter the world of Deep Learning.

Our tool for this journey is **PyTorch**, one of the world's leading deep learning frameworks. This lesson is fundamental. We will learn about the one and only building block of PyTorch: the **Tensor**.

We will cover:
1.  What PyTorch is and why we use it.
2.  What a **Tensor** is (Scalars, Vectors, Matrices, etc.).
3.  The deep relationship between **PyTorch Tensors and NumPy Arrays**.
4.  Why we need Tensors: **The Simple Perceptron & Matrix Multiplication**.
5.  The first piece of PyTorch magic: **Autograd (Automatic Gradients)**.

---

## 1. What is PyTorch?

PyTorch is an open-source machine learning library developed primarily by Facebook's AI Research lab (FAIR). At its core, it provides two fundamental features:

1.  **A GPU-accelerated N-dimensional Tensor library**, similar to NumPy but with the ability to run on Graphics Processing Units (GPUs). This allows for massive speedups (10-50x) over running on a CPU.
2.  **A deep learning framework** built around an automatic differentiation system called `autograd`. This system allows us to automatically and efficiently calculate gradients, which are the backbone of how neural networks learn.

**Official Documentation Links (Your Best Friend!):**
* **PyTorch Homepage:** [https://pytorch.org/](https://pytorch.org/)
* **Main `torch.Tensor` Docs:** [https://pytorch.org/docs/stable/tensors.html](https://pytorch.org/docs/stable/tensors.html)
* **Autograd Concept Page:** [https://pytorch.org/docs/stable/notes/autograd.html](https://pytorch.org/docs/stable/notes/autograd.html)

In [None]:
import torch
import numpy as np
print(f"PyTorch version: {torch.__version__}")

## 2. What is a Tensor?

A tensor is a generalization of scalars, vectors, and matrices to an arbitrary number of dimensions. It is the fundamental data structure in PyTorch. Don't be intimidated by the word; it's just a name for a multi-dimensional array.

* A **scalar** is a 0-dimensional tensor (a single number).
* A **vector** is a 1-dimensional tensor (a 1D array).
* A **matrix** is a 2-dimensional tensor (a 2D array).
* A 3D tensor can represent an RGB image (height, width, channels).
* A 4D tensor can represent a *batch* of RGB images (batch_size, height, width, channels).
* A 5D tensor can represent a *batch* of videos (batch_size, sequence_length, height, width, channels).

### Creating Tensors

You can create tensors in many ways, most of which are identical to their NumPy counterparts.

In [None]:
# 1. From Python lists
t_list = [[1, 2, 3], [4, 5, 6]]
t_from_list = torch.tensor(t_list)
print(f"Tensor from list:\n{t_from_list}")
print(f"Shape: {t_from_list.shape}") # .shape is just like NumPy
print(f"Data type: {t_from_list.dtype}") # .dtype is just like NumPy

# 2. Placeholder tensors (like np.zeros, np.ones)
t_zeros = torch.zeros((2, 3))
print(f"\nZeros tensor:\n{t_zeros}")

t_ones = torch.ones((3, 2), dtype=torch.float32) # We can specify the data type
print(f"\nOnes tensor (float32):\n{t_ones}")

# 3. Tensors with random values (like np.random.rand)
t_rand = torch.rand((2, 2)) # Uniform distribution between [0, 1)
print(f"\nRandom tensor:\n{t_rand}")

# 4. From ranges (like np.arange)
t_range = torch.arange(0, 10, 2) # (start, end, step)
print(f"\nRange tensor: {t_range}")

## 3. The Bridge: NumPy vs. PyTorch

PyTorch tensors and NumPy arrays are very similar. They share most of their API and can be converted back and forth very efficiently.

**This is the most critical concept:** When a NumPy array is converted to a PyTorch tensor (or vice-versa) on the **CPU**, they **share the same underlying memory location**. This means that changing one will change the other!

This is done for performance, to avoid making unnecessary copies of large arrays.

In [None]:
# 1. NumPy to PyTorch
np_arr = np.array([1, 2, 3, 4])
torch_from_np = torch.from_numpy(np_arr)
print(f"NumPy array: {np_arr}")
print(f"PyTorch tensor: {torch_from_np}")

# 2. PyTorch to NumPy
torch_ten = torch.tensor([5, 6, 7, 8])
np_from_torch = torch_ten.numpy()
print(f"\nPyTorch tensor: {torch_ten}")
print(f"NumPy array: {np_from_torch}")

# 3. Let's prove they share memory!
print("\n--- Memory Sharing Demonstration ---")
np_test = np.array([10, 20, 30])
torch_test = torch.from_numpy(np_test)
print(f"Original NumPy: {np_test}")
print(f"Original PyTorch: {torch_test}")

print("\nModifying the NumPy array...")
np_test[0] = 99

print(f"NEW NumPy: {np_test}")
print(f"NEW PyTorch (changed automatically!): {torch_test}")

### API Similarity

If you know NumPy, you know 90% of PyTorch's tensor operations.

In [None]:
np_a = np.random.rand(3, 3)
torch_a = torch.from_numpy(np_a)

print(f"NumPy mean: {np_a.mean()}")
print(f"PyTorch mean: {torch_a.mean()}")

print(f"\nNumPy sum (axis 0): \n{np_a.sum(axis=0)}")
print(f"PyTorch sum (dim 0): \n{torch_a.sum(dim=0)}")

# Matrix Multiplication
np_b = np.random.rand(3, 2)
torch_b = torch.from_numpy(np_b)

print(f"\nNumPy matmul:\n{np_a @ np_b}")
print(f"PyTorch matmul:\n{torch_a @ torch_b}")

## 4. Tensor Reshaping: Flattening

A very common operation in deep learning is **flattening** a tensor. Neural networks (especially the first linear layer) often expect a 1D vector of features as input.

For example, an image is 2D (or 3D with color channels). To feed it to a simple network, we must "unroll" it into a 1D vector.

Let's say we have a batch of 10 images, each 28x28 pixels.

In [None]:
# A fake batch of 10 images, 28x28 pixels
batch_of_images = torch.rand(10, 28, 28)
print(f"Original shape: {batch_of_images.shape}") # (batch_size, height, width)

# Flatten the entire tensor (not very useful)
flat_all = batch_of_images.flatten()
print(f"\nTotally flat shape: {flat_all.shape}") # 10 * 28 * 28 = 7840

# What we usually want is to flatten *each item in the batch*.
# We want to keep the batch dimension (10) and unroll the rest (28x28 = 784).
# The final shape should be (10, 784).
flat_correct = batch_of_images.flatten(start_dim=1)
print(f"\nFlattened per batch item (start_dim=1): {flat_correct.shape}")

# An older, but very common way to do this is with .view()
# We tell PyTorch to keep the first dimension and infer the rest with -1
flat_view = batch_of_images.view(10, -1)
print(f"Flattened with .view(10, -1): {flat_view.shape}")

## 5. Why Do We Need Tensors? The Simple Perceptron

This is the core question. We need tensors and matrices for one primary reason: **to perform massive, parallel computations (like matrix multiplication) very, very fast.**

The most basic building block of a neural network is a **Linear Layer** (or a perceptron).

The math is simple: **`y = Wx + b`**

* **`x`**: The input vector (e.g., our flattened 28x28 image, which has 784 features).
* **`W`**: The **weights** matrix. This is what the model *learns*. It represents the *strength of the connections* between each input feature and each output neuron.
* **`b`**: The **bias** vector. An extra set of learnable parameters to shift the output.
* **`y`**: The output vector (e.g., a vector of 10 scores, one for each digit from 0 to 9).

### Let's model this with tensors:

1.  **Input `x`**: A single flattened image. Shape: `(784)`
2.  **Weights `W`**: We want 10 output neurons, and we have 784 input features. Shape: `(10, 784)`
3.  **Bias `b`**: One bias for each of the 10 output neurons. Shape: `(10)`

The operation `Wx` is a **matrix-vector product**.

In [None]:
# Create random tensors to simulate the shapes
x = torch.rand(784)
W = torch.rand(10, 784)
b = torch.rand(10)

# Perform the linear layer operation
y = torch.matmul(W, x) + b
# A common shorthand for matmul is '@'
# y = W @ x + b

print(f"Shape of W: {W.shape}")
print(f"Shape of x: {x.shape}")
print(f"Shape of b: {b.shape}")
print(f"Shape of output y: {y.shape}")

### The REAL Power: Batch Processing

Doing this one image at a time is slow. The *real* reason we use tensors is to process a **batch** of inputs (e.g., 64 images) all at once in a single, parallel operation.

1.  **Input `X` (Batch)**: A *matrix* of inputs. Shape: `(64, 784)` (64 images, 784 features each)
2.  **Weights `W`**: Same as before. Shape: `(10, 784)`
3.  **Bias `b`**: Same as before. Shape: `(10)`

Now `(10, 784) @ (64, 784)` won't work. The dimensions don't align.
We need to change our operation to: **`Y = X @ W.T + b`** (where `W.T` is the **transpose** of W).

* **`X`**: `(64, 784)`
* **`W.T`**: `(784, 10)`

The **matrix-matrix product** `(64, 784) @ (784, 10)` results in a new matrix of shape `(64, 10)`.

This is perfect! It's a batch of 64 output vectors, one for each input image. GPUs are *designed* to do this exact operation (matrix multiplication) extremely fast.

In [None]:
# 1. Create a batch of 64 images
X_batch = torch.rand(64, 784)
W = torch.rand(10, 784) # Weights are the same
b = torch.rand(10)       # Bias is the same

# 2. Perform the batch operation
Y_batch = X_batch @ W.T + b

print(f"Shape of X_batch: {X_batch.shape}")
print(f"Shape of W.T: {W.T.shape}")
print(f"Shape of b: {b.shape}")
print(f"\nShape of output Y_batch: {Y_batch.shape}")
print("This is a batch of 64 output vectors, each with 10 scores!")

## 6. The Magic: `autograd` (Automatic Gradients)

This is the second, and more magical, reason PyTorch exists. It allows the framework to automatically calculate the gradients (derivatives) of our computations. This is how neural networks learn.

### A Simple Real-World Analogy

Imagine you are on a hill in the fog and you want to get to the valley (the lowest point). You can't see the valley, but you can feel the **slope** of the ground beneath your feet.

* The **hill** represents our error (which we'll call **Loss** in the next lesson). We want to minimize it.
* Your **position** on the hill is your model's **Weight/Parameter** (e.g., a number `w`).
* The **slope** you feel is the **Gradient**. It's a vector that points in the direction of *steepest ascent* (the fastest way *uphill*).

**To get to the valley (minimum error), you just take a small step in the exact opposite direction of the gradient.**

**Autograd is the tool that automatically tells us the slope (the gradient) for every single parameter (like `W` and `b`) in our model.**

### How it Works: `requires_grad=True`

To tell PyTorch that we want to track the computations for a tensor, we set its `requires_grad` flag to `True`. All tensors that are parameters in a neural network will have this set automatically.

PyTorch then builds a **computation graph** behind the scenes.

### Example 1: A Simple Function

Let's take a simple function `y = 3x²`. 
From calculus, we know the derivative (the gradient) of `y` with respect to `x` is `dy/dx = 6x`.

Let's see if PyTorch agrees. We'll test it at `x = 4`. The gradient should be `6 * 4 = 24`.

In [None]:
# 1. Create our input tensor 'x' and tell PyTorch to track it
x = torch.tensor(4.0, requires_grad=True) # Must be a float to have a gradient

# 2. Define our function 'y'
y = 3 * x**2

print(f"x = {x}")
print(f"y = {y}")

# 3. Calculate the gradients
# This tells PyTorch to go backward through the computation graph and
# calculate the gradient of y with respect to all its dependencies (i.e., x)
y.backward()

# 4. Check the gradient stored in x.grad
print(f"\nThe gradient dy/dx at x=4 is: {x.grad}")

assert x.grad == 24

### Example 2: The Computation Graph

Let's do a more complex example. This is what happens inside a neural network.

`a` and `b` are our parameters (weights). We want to find the gradient of the final scalar output `s` with respect to both `a` and `b`.

Computation:
1. `c = a * 2`
2. `d = b + c`
3. `s = d.mean()` (We compute a single summary scalar `s` from the vector `d`)

In [None]:
a = torch.tensor([1.0, 2.0], requires_grad=True)
b = torch.tensor([3.0, 4.0], requires_grad=True)

c = a * 2
d = b + c
s = d.mean() # 's' is our final scalar value

print(f"a: {a}")
print(f"b: {b}")
print(f"c: {c}")
print(f"d: {d}")
print(f"s: {s}")

# Calculate gradients
s.backward()

# Check the gradients
# The gradient of 's' with respect to 'a' is d(s)/da
# The gradient of 's' with respect to 'b' is d(s)/db
print(f"\nGradient for a: {a.grad}")
print(f"Gradient for b: {b.grad}")

### CRITICAL NUANCE 1: Gradient Accumulation

By default, PyTorch **accumulates** gradients every time you call `.backward()`. It **adds** the new gradients to the existing ones (it does `x.grad += new_grad`).

This is a design choice to support advanced models (like RNNs). For 99% of our use cases, we must **manually set the gradients to zero** after each learning step. If we don't, our gradients will be wrong and the model will not learn.

In [None]:
x = torch.tensor(3.0, requires_grad=True)

# --- First pass ---
y1 = x**2  # y = 9, grad = 2x = 6
y1.backward()
print(f"After first pass, x.grad = {x.grad}")

# --- Second pass (WITHOUT clearing) ---
y2 = x**3  # y = 27, grad = 3x^2 = 27
y2.backward()
print(f"After second pass, x.grad = {x.grad}") # 6 + 27 = 33

# This is wrong! The gradient for y2 is 27, not 33. It accumulated!

### CRITICAL NUANCE 2: Clearing Gradients with `.zero_()`

To fix this, we must call `.grad.zero_()` on all our parameters *before* we calculate the new gradients (i.e., before the next `.backward()` call).

This is why you will *always* see `optimizer.zero_grad()` at the start of a training loop.

In [None]:
x = torch.tensor(3.0, requires_grad=True)

# --- First pass ---
y1 = x**2  # y = 9, grad = 2x = 6
y1.backward()
print(f"After first pass, x.grad = {x.grad}")

# --- CLEAR THE GRADIENT ---
print("\n--- Clearing gradients ---")
x.grad.zero_() # This is the in-place version of x.grad = torch.zeros_like(x.grad)

# --- Second pass (Correct) ---
y2 = x**3  # y = 27, grad = 3x^2 = 27
y2.backward()
print(f"After second pass, x.grad = {x.grad}") # Now it's 27, which is correct!

## 7. Practice Exercises

1.  Create a NumPy array `[10, 20, 30]`. Convert it to a PyTorch tensor. Modify the original NumPy array by changing `10` to `100`. Print the PyTorch tensor. What happened?
2.  Create a 2D PyTorch tensor of shape `(4, 5)` with random integers between 0 and 10.
3.  Using the tensor from #2, flatten it so that it has a shape of `(4, 5)` and then `(20)`. 
4.  Calculate the gradient for the function `y = 5x³ + 2x` at the point `x = 2`.

In [None]:
# --- Exercise 1 ---
print("\n--- Exercise 1 ---")
ex1_np = np.array([10, 20, 30])
ex1_torch = torch.from_numpy(ex1_np)
ex1_np[0] = 100
print(f"The tensor also changed: {ex1_torch}") # It shares memory!

# --- Exercise 2 ---
print("\n--- Exercise 2 ---")
ex2 = torch.randint(0, 10, (4, 5))
print(f"Random 4x5 tensor:\n{ex2}")

# --- Exercise 3 ---
print("\n--- Exercise 3 ---")
ex3_flat_all = ex2.flatten()
print(f"Total flatten shape: {ex3_flat_all.shape}")

# --- Exercise 4 ---
print("\n--- Exercise 4 ---")
# y = 5x^3 + 2x
# dy/dx = 15x^2 + 2
# At x=2, grad = 15*(2^2) + 2 = 15*4 + 2 = 60 + 2 = 62
x = torch.tensor(2.0, requires_grad=True)
y = 5*x**3 + 2*x
y.backward()
print(f"The gradient at x=2 is: {x.grad}")
assert x.grad == 62