- Let's start with the absolute foundation of PyTorch: **Tensors**.

---

**PyTorch Tutorial Part 1: Tensors in Depth**

At its core, PyTorch is a library for numerical computation using n-dimensional arrays called **tensors**. These are similar to NumPy's `ndarrays` but come with the powerful ability to run on GPUs for accelerated computation and to keep track of operations for automatic differentiation (which we'll cover in Part 2).

**What we'll cover in this part:**

1.  **Tensor Creation:** Various ways to create tensors.
2.  **Tensor Attributes:** Understanding `shape`, `dtype`, and `device`.
3.  **Tensor Operations:**
    * Arithmetic Operations
    * Matrix Operations
    * Reshaping Operations
    * Indexing, Slicing, Joining, and Mutating
4.  **Broadcasting:** How PyTorch handles operations on tensors of different shapes.
5.  **GPU Acceleration:** Moving tensors between CPU and GPU.
6.  **NumPy Bridge:** Interoperability with NumPy arrays.

Let's get started with examples!


In [8]:
!pip install torch




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import torch
import numpy as np # For comparison and interoperability

In [10]:
!pip show torch

Name: torch
Version: 2.7.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: G:\learning\Machine_learning\Pandas\venv\Lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions
Required-by: 


#### 1. Tensor Creation

PyTorch offers many ways to create tensors:

From Python lists or sequences:

In [13]:
import sys
print(sys.executable)

G:\learning\Machine_learning\Pandas\venv\Scripts\python.exe


In [12]:
import numpy as np # For comparison and interoperability

# From a Python list
data_list = [[1, 2], [3, 4]]
tensor_from_list = torch.tensor(data_list)
print("Tensor from list:\n", tensor_from_list)

tensor_float = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float32)
print("\nFloat tensor from list:\n", tensor_float)

tensor_long = torch.tensor([1, 2, 3], dtype=torch.long)
print("\nLong tensor:\n", tensor_long)

AttributeError: module 'torch' has no attribute 'tensor'

Let's get started with examples!

```python
import torch
import numpy as np # For comparison and interoperability

print(f"PyTorch Version: {torch.__version__}")
```

**1. Tensor Creation**

PyTorch offers many ways to create tensors:

* **From Python lists or sequences:**
    ```python
    # From a Python list
    data_list = [[1, 2], [3, 4]]
    tensor_from_list = torch.tensor(data_list)
    print("Tensor from list:\n", tensor_from_list)

    # You can specify the data type (dtype)
    tensor_float = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float32)
    print("\nFloat tensor from list:\n", tensor_float)

    tensor_long = torch.tensor([1, 2, 3], dtype=torch.long) # int64
    print("\nLong tensor:\n", tensor_long)
    ```

* **From NumPy arrays:**
    ```python
    numpy_array = np.array([[1, 2, 3], [4, 5, 6]])
    tensor_from_numpy = torch.from_numpy(numpy_array) # Shares memory with numpy_array
    print("\nTensor from NumPy array:\n", tensor_from_numpy)

    # Modifying the NumPy array will affect the tensor (if they share memory)
    numpy_array[0, 0] = 99
    print("Modified tensor (after modifying NumPy array):\n", tensor_from_numpy)

    # To create a copy that doesn't share memory:
    tensor_copy_from_numpy = torch.tensor(numpy_array) # This creates a copy
    numpy_array[0, 0] = 100
    print("Original NumPy array now:\n", numpy_array)
    print("Tensor copy (unaffected by later NumPy mod):\n", tensor_copy_from_numpy)
    ```

* **Creating tensors with specific values (zeros, ones, random):**
    ```python
    # Tensor of zeros
    zeros_tensor = torch.zeros(2, 3) # Shape (2 rows, 3 columns)
    print("\nZeros tensor:\n", zeros_tensor)

    # Tensor of ones
    ones_tensor = torch.ones(3, 2, dtype=torch.double) # Specify dtype
    print("\nOnes tensor (double precision):\n", ones_tensor)

    # Tensor with random values (uniformly distributed between 0 and 1)
    rand_tensor = torch.rand(2, 2)
    print("\nRandom tensor (uniform distribution):\n", rand_tensor)

    # Tensor with random values (normally distributed with mean 0, variance 1)
    randn_tensor = torch.randn(3, 3)
    print("\nRandom tensor (normal distribution):\n", randn_tensor)

    # Tensor with integer random values within a range
    randint_tensor = torch.randint(low=0, high=10, size=(2, 4)) # Integers from [0, 10)
    print("\nRandom integer tensor:\n", randint_tensor)
    ```

* **Creating tensors like other tensors (`*_like` methods):**
    These methods create a new tensor with the same `shape` and `dtype` (by default) as an existing tensor.
    ```python
    x = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float16)
    zeros_like_x = torch.zeros_like(x)
    print("\nZeros tensor like x:\n", zeros_like_x)
    print("dtype of zeros_like_x:", zeros_like_x.dtype)

    rand_like_x = torch.rand_like(x, dtype=torch.float32) # Override dtype
    print("\nRandom tensor like x (but with float32 dtype):\n", rand_like_x)
    print("dtype of rand_like_x:", rand_like_x.dtype)
    ```

* **Creating tensors with ranges:**
    ```python
    # Like Python's range()
    range_tensor = torch.arange(0, 10, 2) # Start, end (exclusive), step
    print("\nRange tensor (arange):\n", range_tensor)

    # Linearly spaced points
    linspace_tensor = torch.linspace(0, 1, 5) # Start, end (inclusive), number of steps
    print("\nLinspace tensor:\n", linspace_tensor)
    ```

**2. Tensor Attributes**

Every tensor has attributes that describe its characteristics:

* **`shape` (or `size()`):** The dimensions of the tensor.
* **`dtype`:** The data type of the elements in the tensor.
* **`device`:** The device (CPU or GPU) where the tensor's data is stored.

```python
tensor_example = torch.randn(3, 4, 5) # A 3x4x5 tensor

print("\n--- Tensor Attributes ---")
print("Tensor example:\n", tensor_example)
print("Shape of tensor:", tensor_example.shape)
print("Size of tensor (alternative way):", tensor_example.size())
print("Data type (dtype) of tensor:", tensor_example.dtype)
print("Device tensor is stored on:", tensor_example.device)
print("Number of dimensions:", tensor_example.ndim)
print("Number of elements:", tensor_example.numel())
```

**3. Tensor Operations**

PyTorch provides a rich set of operations for manipulating tensors.

* **Arithmetic Operations:**
    These are typically element-wise.
    ```python
    a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
    b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

    print("\n--- Arithmetic Operations ---")
    print("Tensor a:\n", a)
    print("Tensor b:\n", b)

    # Addition
    sum_ab = a + b
    # sum_ab_alt = torch.add(a, b)
    print("a + b:\n", sum_ab)

    # Subtraction
    diff_ab = a - b
    print("a - b:\n", diff_ab)

    # Element-wise Multiplication
    mul_ab = a * b
    # mul_ab_alt = torch.mul(a, b)
    print("a * b (element-wise):\n", mul_ab)

    # Division
    div_ab = a / b
    print("a / b:\n", div_ab)

    # Exponentiation
    pow_a = a ** 2
    # pow_a_alt = torch.pow(a, 2)
    print("a squared:\n", pow_a)

    # In-place operations (modify the tensor directly, often denoted by a trailing underscore `_`)
    c = torch.ones(2, 2)
    print("Tensor c before add_:\n", c)
    c.add_(a) # c = c + a
    print("Tensor c after c.add_(a):\n", c)
    ```

* **Matrix Operations:**
    ```python
    mat1 = torch.tensor([[1., 2.], [3., 4.]])
    mat2 = torch.tensor([[5., 6.], [7., 8.]])
    vec1 = torch.tensor([1., 2., 3.])
    vec2 = torch.tensor([4., 5., 6.])

    print("\n--- Matrix Operations ---")
    print("Matrix 1:\n", mat1)
    print("Matrix 2:\n", mat2)

    # Matrix Multiplication
    mat_mul = torch.matmul(mat1, mat2)
    # mat_mul_alt = mat1 @ mat2
    print("Matrix multiplication (mat1 @ mat2):\n", mat_mul)

    # Element-wise product (already covered, but important to distinguish)
    # elem_prod = mat1 * mat2

    # Dot product of two 1D tensors
    dot_prod = torch.dot(vec1, vec2)
    print(f"Dot product of {vec1} and {vec2}:", dot_prod)

    # Transpose
    mat1_t = mat1.T
    # mat1_t_alt = torch.transpose(mat1, 0, 1)
    print("Transpose of Matrix 1:\n", mat1_t)
    ```

* **Reshaping Operations:**
    Changing the shape of a tensor without changing its data.
    ```python
    x_orig = torch.arange(12) # 0 to 11
    print("\n--- Reshaping Operations ---")
    print("Original tensor x_orig (1D, 12 elements):\n", x_orig)

    # view(): Returns a new tensor with the same data but different shape.
    # The new shape must be compatible with the original number of elements.
    # Data is contiguous.
    x_view_3_4 = x_orig.view(3, 4)
    print("x_orig.view(3, 4):\n", x_view_3_4)
    # x_view_3_neg1 = x_orig.view(3, -1) # -1 infers the dimension
    # print("x_orig.view(3, -1):\n", x_view_3_neg1)

    # reshape(): Similar to view(), but can operate on non-contiguous tensors
    # (by creating a copy if necessary). Generally safer if unsure about contiguity.
    x_reshaped_4_3 = x_orig.reshape(4, 3)
    print("x_orig.reshape(4, 3):\n", x_reshaped_4_3)

    # squeeze(): Removes dimensions of size 1.
    x_unsqueeze_example = torch.randn(1, 3, 1, 5)
    print("x_unsqueeze_example shape:", x_unsqueeze_example.shape)
    x_squeezed = x_unsqueeze_example.squeeze() # Removes all dimensions of size 1
    print("x_squeezed shape (all ones removed):", x_squeezed.shape)
    x_squeezed_dim0 = x_unsqueeze_example.squeeze(0) # Removes dim 0 if it's size 1
    print("x_squeezed_dim0 shape:", x_squeezed_dim0.shape)

    # unsqueeze(): Adds a dimension of size 1 at a specified position.
    x_unsqueezed_dim0 = x_squeezed.unsqueeze(0) # Add dim at position 0
    print("x_unsqueezed_dim0 shape (from squeezed):", x_unsqueezed_dim0.shape)
    x_unsqueezed_dim2 = x_squeezed.unsqueeze(2) # Add dim at position 2
    print("x_unsqueezed_dim2 shape (from squeezed):", x_unsqueezed_dim2.shape)

    # permute(): Rearranges dimensions.
    x_permute_example = torch.randn(2, 3, 4)
    print("x_permute_example shape:", x_permute_example.shape)
    x_permuted = x_permute_example.permute(2, 0, 1) # old_dim2, old_dim0, old_dim1
    print("x_permuted shape (4, 2, 3):", x_permuted.shape)

    # flatten(): Flattens a tensor from a start_dim to an end_dim.
    x_flatten_example = torch.randn(2, 3, 4)
    print("x_flatten_example shape:", x_flatten_example.shape)
    x_flattened_all = torch.flatten(x_flatten_example) # Flattens completely
    print("x_flattened_all shape:", x_flattened_all.shape)
    x_flattened_from_dim1 = torch.flatten(x_flatten_example, start_dim=1) # Flattens from dim 1 onwards
    print("x_flattened_from_dim1 shape:", x_flattened_from_dim1.shape)
    ```

* **Indexing, Slicing, Joining, and Mutating:**
    Similar to NumPy, PyTorch offers rich indexing and slicing capabilities.
    ```python
    tensor_idx_slice = torch.arange(10).reshape(2, 5) # 0-9 in a 2x5 tensor
    print("\n--- Indexing and Slicing ---")
    print("Original tensor for indexing/slicing:\n", tensor_idx_slice)

    # Get a specific element
    print("Element at [0, 0]:", tensor_idx_slice[0, 0])

    # Get a row
    print("First row (tensor_idx_slice[0]):\n", tensor_idx_slice[0])
    print("First row (tensor_idx_slice[0, :]):\n", tensor_idx_slice[0, :])

    # Get a column
    print("First column (tensor_idx_slice[:, 0]):\n", tensor_idx_slice[:, 0])

    # Get a sub-tensor (slicing)
    print("Sub-tensor (rows 0-1, cols 1-3):\n", tensor_idx_slice[0:2, 1:4]) # or [:, 1:4] for all rows

    # Boolean indexing
    bool_idx_tensor = torch.tensor([[True, False, True], [False, True, True]])
    data_for_bool = torch.randn(2,3)
    print("Data for boolean indexing:\n", data_for_bool)
    print("Boolean mask:\n", bool_idx_tensor)
    print("Elements selected by boolean mask:\n", data_for_bool[bool_idx_tensor]) # Returns a 1D tensor

    # Joining tensors
    t1 = torch.zeros(2, 3)
    t2 = torch.ones(2, 3)
    t3 = torch.full((2,3), 2)

    # Concatenate along a dimension (dim=0 stacks rows, dim=1 stacks columns)
    cat_dim0 = torch.cat((t1, t2, t3), dim=0)
    print("\nConcatenated along dim 0 (rows):\n", cat_dim0)
    print("Shape of cat_dim0:", cat_dim0.shape)

    cat_dim1 = torch.cat((t1, t2, t3), dim=1)
    print("\nConcatenated along dim 1 (columns):\n", cat_dim1)
    print("Shape of cat_dim1:", cat_dim1.shape)

    # Stacking (creates a new dimension)
    stack_dim0 = torch.stack((t1, t2, t3), dim=0)
    print("\nStacked along new dim 0:\n", stack_dim0)
    print("Shape of stack_dim0:", stack_dim0.shape) # (3, 2, 3)

    # Mutating (modifying) tensors using indexing
    print("\nTensor before mutation:\n", tensor_idx_slice)
    tensor_idx_slice[0, 0] = 100
    tensor_idx_slice[1, :] = 55 # Set whole second row to 55
    print("Tensor after mutation:\n", tensor_idx_slice)
    ```

**4. Broadcasting**

Broadcasting allows PyTorch to perform operations on tensors of different (but compatible) shapes without explicitly making copies of the data. The rules are similar to NumPy:

1.  If tensors have different numbers of dimensions, prepend 1s to the shape of the smaller tensor until they have the same number of dimensions.
2.  Two tensors are compatible in a dimension if:
    * They are equal in that dimension, OR
    * One of them has size 1 in that dimension.
3.  The size of the resulting tensor in each dimension is the maximum of the sizes of the input tensors in that dimension.

```python
x_broadcast = torch.arange(3).view(3, 1) # Shape (3, 1) -> [[0], [1], [2]]
y_broadcast = torch.arange(2).view(1, 2) # Shape (1, 2) -> [[0, 1]]

print("\n--- Broadcasting ---")
print("x_broadcast (shape 3x1):\n", x_broadcast)
print("y_broadcast (shape 1x2):\n", y_broadcast)

# x_broadcast becomes (3, 2) by repeating column values
# y_broadcast becomes (3, 2) by repeating row values
result_broadcast = x_broadcast + y_broadcast
print("x_broadcast + y_broadcast (result shape 3x2):\n", result_broadcast)

# Another example
a_bc = torch.ones(3, 4) # Shape (3, 4)
b_bc = torch.rand(4)    # Shape (4) -> effectively (1, 4) for broadcasting
c_bc = a_bc + b_bc      # b_bc is broadcast across the rows of a_bc
print("\na_bc (3,4) + b_bc (4) result shape:", c_bc.shape)
```

**5. GPU Acceleration**

One of PyTorch's key strengths is its seamless GPU support (primarily NVIDIA GPUs via CUDA).

```python
print("\n--- GPU Acceleration ---")
# Check if CUDA (GPU support) is available
if torch.cuda.is_available():
    print("CUDA is available! Running on GPU.")
    device = torch.device("cuda") # Default CUDA device

    # Create a tensor and move it to GPU
    tensor_cpu = torch.randn(3, 3)
    print("Tensor on CPU:\n", tensor_cpu)
    print("Device of tensor_cpu:", tensor_cpu.device)

    tensor_gpu = tensor_cpu.to(device) # or tensor_cpu.cuda()
    print("\nTensor on GPU:\n", tensor_gpu)
    print("Device of tensor_gpu:", tensor_gpu.device)

    # Operations can be performed directly on GPU tensors
    result_gpu = tensor_gpu * tensor_gpu + 2
    print("\nResult of GPU operation (still on GPU):\n", result_gpu)
    print("Device of result_gpu:", result_gpu.device)

    # To move a tensor back to CPU
    result_cpu = result_gpu.to(torch.device("cpu")) # or result_gpu.cpu()
    print("\nResult moved back to CPU:\n", result_cpu)
    print("Device of result_cpu:", result_cpu.device)

    # Note: All tensors involved in an operation must be on the same device.
    # Trying to operate on a CPU tensor and a GPU tensor directly will raise an error.
    try:
        error_prone = tensor_cpu + tensor_gpu
    except RuntimeError as e:
        print("\nError when adding CPU and GPU tensors directly:", e)

else:
    print("CUDA is not available. Examples will run on CPU.")
    device = torch.device("cpu")
```

**6. NumPy Bridge**

PyTorch offers excellent interoperability with NumPy.

* **Tensor to NumPy array:**
    ```python
    print("\n--- NumPy Bridge ---")
    pytorch_tensor_np = torch.ones(5)
    print("PyTorch tensor:", pytorch_tensor_np)

    # .numpy() works only on CPU tensors
    # If tensor is on GPU, first move it to CPU: pytorch_tensor_np.cpu().numpy()
    numpy_array_from_tensor = pytorch_tensor_np.numpy()
    print("Converted NumPy array:", numpy_array_from_tensor)

    # IMPORTANT: If the tensor is on the CPU, the PyTorch tensor and NumPy array
    # will share their underlying memory locations. Modifying one will affect the other.
    pytorch_tensor_np.add_(1) # In-place addition
    print("PyTorch tensor after in-place add:", pytorch_tensor_np)
    print("NumPy array reflects the change:", numpy_array_from_tensor)
    ```

* **NumPy array to Tensor:**
    ```python
    numpy_array_to_convert = np.arange(5, dtype=np.float32)
    print("\nOriginal NumPy array:", numpy_array_to_convert)

    pytorch_tensor_from_np = torch.from_numpy(numpy_array_to_convert)
    print("Converted PyTorch tensor:", pytorch_tensor_from_np)

    # Again, they share memory
    np.add(numpy_array_to_convert, 1, out=numpy_array_to_convert) # In-place add for NumPy
    print("NumPy array after in-place add:", numpy_array_to_convert)
    print("PyTorch tensor reflects the change:", pytorch_tensor_from_np)
    ```

---

**PyTorch Tutorial Part 2: Automatic Differentiation with `torch.autograd`**

One of the most powerful features of PyTorch is `torch.autograd`, its automatic differentiation engine. This is what enables PyTorch to automatically calculate the gradients of your computation (typically a loss function) with respect to its parameters (typically the weights of a neural network). These gradients are essential for training neural networks using optimization algorithms like gradient descent.

**What we'll cover in this part:**

1.  **What is `autograd`?**
2.  **Computation Graphs:** How PyTorch tracks operations.
3.  **`requires_grad`:** Tracking operations for gradient computation.
4.  **`grad_fn`:** The backward function reference.
5.  **`.backward()`:** Computing gradients.
6.  **`.grad`:** Accessing computed gradients.
7.  **Gradient Accumulation:** How gradients are summed up.
8.  **Disabling Gradient Tracking:** `torch.no_grad()` and `detach()`.
9.  **More on `.backward()` (Vector-Jacobian Product).**

Let's dive in with examples.

```python
import torch

print(f"PyTorch Version: {torch.__version__}")
```

**1. What is `autograd`?**

At its heart, `autograd` allows you to request the gradient of some output (usually a scalar loss value) with respect to some input(s) (usually model parameters). If you have a function $y = f(x)$, where $x$ is an input tensor and $y$ is an output tensor, `autograd` helps you compute $\frac{\partial y}{\partial x}$. For neural networks, $x$ would be the parameters, and $y$ would be the loss.

**2. Computation Graphs**

PyTorch uses a "define-by-run" philosophy, meaning the computation graph is built dynamically as operations are performed. When tensors that are "tracked" (we'll see this with `requires_grad`) are used in operations, PyTorch records these operations, forming a directed acyclic graph (DAG).

* **Nodes** in this graph are tensors.
* **Edges** are functions (operations) that produce output tensors from input tensors.

When you call `.backward()` from an output node (e.g., loss), `autograd` traverses this graph backward from that node to compute gradients for all leaf nodes that require gradients.

**3. `requires_grad` Attribute**

For `autograd` to track operations on a tensor and compute gradients for it, the tensor's `requires_grad` attribute must be set to `True`.

* By default, tensors you create manually have `requires_grad=False`.
* Parameters of `torch.nn.Module` (which we'll see in Part 3) typically have `requires_grad=True` by default.

```python
# Example of requires_grad
x = torch.tensor(2.0, requires_grad=False) # Default
y = torch.tensor(3.0, requires_grad=True)  # We want gradients for y
z = torch.tensor(4.0, requires_grad=True)  # And for z

print(f"x: {x}, requires_grad: {x.requires_grad}")
print(f"y: {y}, requires_grad: {y.requires_grad}")
print(f"z: {z}, requires_grad: {z.requires_grad}")

# Operations involving tensors with requires_grad=True will produce outputs
# that also require gradients and are part of the computation graph.
f = y * z + x # x is not tracked
print(f"\nf = y*z + x = {f}")
print(f"f.requires_grad: {f.requires_grad}") # True because y and z require_grad

# You can change requires_grad in-place for an existing tensor
a = torch.randn(2, 2)
print(f"\na before: {a.requires_grad}")
a.requires_grad_(True) # In-place operation
print(f"a after: {a.requires_grad}")
```

**4. `grad_fn` Attribute**

If a tensor is the result of an operation involving tensors that require gradients, it will have a `grad_fn` attribute. This attribute references the function (like `AddBackward0`, `MulBackward0`) that created this tensor and is used by `autograd` during the backward pass to compute gradients.

* User-created tensors (leaf nodes) with `requires_grad=True` will have `grad_fn=None`.

```python
w = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
b = torch.tensor([0.5], requires_grad=True)
inputs = torch.tensor([10.0, 11.0, 12.0])

# A simple linear operation
output = (w * inputs).sum() + b # output = w_0*i_0 + w_1*i_1 + w_2*i_2 + b

print(f"\nw: {w}, grad_fn: {w.grad_fn}") # Leaf node, so None
print(f"b: {b}, grad_fn: {b.grad_fn}") # Leaf node, so None
print(f"inputs: {inputs}, grad_fn: {inputs.grad_fn}") # Not requiring grad, so None

print(f"\noutput: {output}")
print(f"output.requires_grad: {output.requires_grad}")
print(f"output.grad_fn: {output.grad_fn}") # Will show something like <AddBackward0 object at ...>
```
The `grad_fn` shows the last operation that created `output`. `autograd` uses this chain of `grad_fn`s to go backward.

**5. `.backward()` - Computing Gradients**

Once you have a scalar output (like a loss value) that results from a chain of operations on tensors requiring gradients, you can call `.backward()` on this scalar. This triggers `autograd` to compute the gradients of this scalar with respect to all leaf tensors in the graph that have `requires_grad=True`.

```python
x1 = torch.tensor(2.0, requires_grad=True)
x2 = torch.tensor(5.0, requires_grad=True)
y_out = x1**2 * x2 + x1 # y_out = x1^2 * x2 + x1

print(f"\nx1 = {x1.item()}, x2 = {x2.item()}")
print(f"y_out = x1^2 * x2 + x1 = {y_out.item()}")

# Now, let's compute the gradients dy_out/dx1 and dy_out/dx2
# Analytically:
# dy_out/dx1 = 2*x1*x2 + 1 = 2*2*5 + 1 = 21
# dy_out/dx2 = x1^2       = 2^2       = 4

y_out.backward() # This populates .grad attribute of x1 and x2

# Gradients are now stored in x1.grad and x2.grad
print(f"\ndy_out/dx1 (x1.grad): {x1.grad}")
print(f"dy_out/dx2 (x2.grad): {x2.grad}")
```
The `.grad` attribute will now hold the computed gradients.

**6. `.grad` - Accessing Computed Gradients**

As seen above, after calling `.backward()`, the gradients are accumulated in the `.grad` attribute of the leaf tensors (those with `requires_grad=True` for which gradients were computed).

* If `requires_grad` is `False` for a tensor, its `.grad` attribute will remain `None`.
* Only leaf nodes in the graph that have `requires_grad=True` will have their `.grad` populated. Gradients for intermediate tensors are not typically stored to save memory, but can be accessed using "hooks" (an advanced topic).

**7. Gradient Accumulation**

Crucially, PyTorch **accumulates** gradients. This means if you call `.backward()` multiple times, the newly computed gradients are *added* to the existing values in the `.grad` attribute.

This behavior is useful for scenarios like training with gradient accumulation over multiple mini-batches or for custom gradient manipulations. However, in a standard training loop, you usually want to compute fresh gradients for each batch. Therefore, you **must zero out the gradients** before each new backward pass.

```python
p = torch.tensor(3.0, requires_grad=True)
q_val = p * p # q_val = p^2
print(f"\np = {p.item()}, q_val = {q_val.item()}")

# First backward pass
q_val.backward() # dq/dp = 2*p = 6
print(f"After first backward pass, p.grad = {p.grad}") # Should be 6.0

# Second backward pass on the same q_val (without zeroing gradients)
# Note: For this simple example, to call backward() again on q_val,
# we'd typically need to rebuild parts of the graph or specify retain_graph=True
# as the graph is freed by default after backward().
# Let's redefine q_val for simplicity or use retain_graph.
# Using retain_graph=True for demonstration of accumulation on the *same* graph:
p_new = torch.tensor(3.0, requires_grad=True)
q_new = p_new * p_new
q_new.backward(retain_graph=True) # dq/dp = 6
print(f"\nFor p_new = {p_new.item()}:")
print(f"After first backward, p_new.grad = {p_new.grad}") # 6.0

q_new.backward() # Call backward again on the same q_new
print(f"After second backward (gradients accumulated), p_new.grad = {p_new.grad}") # Should be 6.0 + 6.0 = 12.0

# In a typical training loop, you would do this for an optimizer:
# optimizer.zero_grad() # Clears p.grad
# loss = ...
# loss.backward()
# optimizer.step()

# Let's demonstrate zeroing explicitly for a tensor
p.grad.zero_() # In-place zeroing of the gradient
print(f"After p.grad.zero_(), p.grad = {p.grad}") # Should be 0.0

q_val_fresh = p * p # Recompute q_val using the same p
q_val_fresh.backward()
print(f"After zeroing and fresh backward, p.grad = {p.grad}") # Should be 6.0 again
```
In practice, `optimizer.zero_grad()` (which we'll see in Part 4) handles zeroing the gradients for all model parameters managed by the optimizer.

**8. Disabling Gradient Tracking**

Sometimes, you don't want PyTorch to track operations, for example:
* During inference/evaluation (speeds up computation and saves memory).
* When you are modifying model parameters manually (e.g., in some reinforcement learning scenarios) and don't want these modifications to be part of the gradient history.
* When a part of your model should be "frozen."

PyTorch provides two main ways to do this:

* **`torch.no_grad()` context manager:**
    Any operations performed within this block will not be tracked, even if their inputs have `requires_grad=True`.
    ```python
    weights = torch.randn(3, 3, requires_grad=True)
    inputs_data = torch.randn(3, 1)
    print(f"\nweights.requires_grad: {weights.requires_grad}")

    with torch.no_grad():
        output_no_grad = weights @ inputs_data
        print(f"Inside torch.no_grad(): output_no_grad.requires_grad = {output_no_grad.requires_grad}") # False
        print(f"output_no_grad.grad_fn = {output_no_grad.grad_fn}") # None

    # Operations outside the block are tracked again if inputs require grad
    output_with_grad = weights @ inputs_data
    print(f"Outside torch.no_grad(): output_with_grad.requires_grad = {output_with_grad.requires_grad}") # True
    print(f"output_with_grad.grad_fn = {output_with_grad.grad_fn}") # Not None
    ```

* **`.detach()` method:**
    This method creates a new tensor that shares the same data as the original tensor but is detached from the computation graph. It will have `requires_grad=False`, and no operations involving it will be tracked for gradient computation *through this detached tensor*. The original tensor remains unchanged.

    ```python
    original_tensor = torch.randn(2, 2, requires_grad=True)
    detached_tensor = original_tensor.detach()

    print(f"\noriginal_tensor.requires_grad: {original_tensor.requires_grad}") # True
    print(f"detached_tensor.requires_grad: {detached_tensor.requires_grad}")   # False

    # Modifying the detached tensor's data will affect the original tensor's data
    # because they share the same underlying storage.
    detached_tensor[0,0] = 100.0
    print(f"original_tensor after modifying detached_tensor:\n{original_tensor}")

    # Operations on detached_tensor won't affect gradients of original_tensor
    # if original_tensor is used elsewhere in the graph.
    # If you perform an operation with the original_tensor, it will still build a graph.
    some_output = original_tensor * 2
    print(f"some_output.grad_fn: {some_output.grad_fn}") # Has a grad_fn
    ```
    `detach()` is useful when you want to use a tensor's value without it affecting gradient computations, or if you want to prevent gradients from flowing back through a certain part of your network during a backward pass.

**9. More on `.backward()` (Vector-Jacobian Product)**

If `.backward()` is called on a tensor that is **not a scalar** (i.e., it has more than one element), you must provide a `gradient` argument to `.backward()`. This `gradient` argument should be a tensor of the same shape as the tensor you're calling `.backward()` on. It represents the gradient of some final scalar loss with respect to the elements of this non-scalar tensor.

Essentially, `autograd` computes a **Vector-Jacobian Product (VJP)**. If $y = f(x)$ and $l$ is the final scalar loss, and $v = \frac{\partial l}{\partial y}$ is a vector, then `y.backward(v)` computes $v^T \cdot J$, where $J$ is the Jacobian matrix $\frac{\partial y}{\partial x}$. The result is accumulated in `x.grad`.

For neural network training, the loss function typically outputs a scalar, so you don't need to provide the `gradient` argument to `loss.backward()`. PyTorch implicitly uses a gradient of `torch.tensor(1.0)`.

```python
# Example of backward() on a non-scalar tensor
inp = torch.randn(3, requires_grad=True)
outp = inp * 2 # outp is not a scalar

# To compute gradients for inp, we need to provide 'gradients' to backward()
# This typically represents the gradient of a final scalar loss w.r.t. 'outp'
grad_of_loss_wrt_outp = torch.tensor([0.1, 1.0, 0.01])
outp.backward(gradient=grad_of_loss_wrt_outp)

print(f"\nInput tensor inp: {inp}")
print(f"Output tensor outp: {outp}")
print(f"Gradient of loss w.r.t. outp (provided): {grad_of_loss_wrt_outp}")
print(f"inp.grad (dL/dinp = dL/doutp * doutp/dinp = grad_of_loss_wrt_outp * 2):\n{inp.grad}")
# Expected inp.grad = [0.1*2, 1.0*2, 0.01*2] = [0.2, 2.0, 0.02]
```

---


**PyTorch Tutorial Part 3: Building Neural Networks with `torch.nn`**

So far, we've learned about Tensors (the data containers) and `autograd` (the engine for computing gradients). Now, we'll explore `torch.nn`, PyTorch's module specifically designed for building and training neural networks. It provides a rich collection of building blocks (layers, loss functions, etc.) and a powerful base class (`nn.Module`) for creating custom network architectures.

**What we'll cover in this part:**

1.  **Introduction to `torch.nn`:** Its purpose and structure.
2.  **`nn.Module`:** The cornerstone for all network models.
    * Defining custom models by subclassing `nn.Module`.
    * The `__init__` method: Where to define layers.
    * The `forward` method: Where to define the data flow.
3.  **Common Layers:** Focusing on fully connected layers and activations.
4.  **`nn.Sequential`:** A simpler way to build linear stacks of layers.
5.  **Inspecting Model Parameters.**
6.  **Performing a Forward Pass.**
7.  **Example: Building a Simple Multi-Layer Perceptron (MLP).**

Let's get started!

```python
import torch
import torch.nn as nn # Neural network module
import torch.nn.functional as F # Functional API for layers/activations

print(f"PyTorch Version: {torch.__version__}")
```

**1. Introduction to `torch.nn`**

The `torch.nn` namespace provides all the building blocks you need to create neural networks. These blocks are often referred to as "modules" or "layers." A key concept is that these modules are themselves subclasses of `nn.Module`, allowing them to be nested within each other to create complex architectures. Each `nn.Module` can track its own learnable parameters (weights and biases).

**2. `nn.Module`: The Base Class for Models**

All neural network models in PyTorch should be subclasses of `nn.Module`. This base class provides a lot of useful functionality:

* It can contain other `nn.Module` instances (layers or other custom modules).
* It registers its parameters (tensors with `requires_grad=True`). You can access all parameters of a module (and its submodules) using `model.parameters()`.
* It provides helper methods like `model.to(device)` to move all parameters to a specific device (CPU/GPU), `model.train()` to set the model to training mode, and `model.eval()` to set it to evaluation mode (important for layers like Dropout and BatchNorm).

To create your own neural network model, you typically override two methods:

* **`__init__(self, ...)`:**
    * This is the constructor. You **must** call `super().__init__()` first.
    * This is where you define and instantiate the layers (which are also `nn.Module` instances) that your network will use. These layers become attributes of your model.
    ```python
    # class MyModel(nn.Module):
    #     def __init__(self):
    #         super().__init__()
    #         # Define layers here, e.g.:
    #         self.layer1 = nn.Linear(in_features, out_features)
    #         self.activation = nn.ReLU()
    ```

* **`forward(self, x, ...)`:**
    * This method defines how input data `x` flows through the network to produce an output.
    * You use the layers defined in `__init__` here.
    * The computation graph is built dynamically each time `forward` is called.
    ```python
    # class MyModel(nn.Module):
    #     # ... __init__ as above
    #     def forward(self, x):
    #         x = self.layer1(x)
    #         x = self.activation(x)
    #         return x
    ```

**3. Common Layers**

`torch.nn` provides a wide variety of pre-built layers. For this part, we'll focus on the most basic ones for building a simple feed-forward network.

* **`nn.Linear(in_features, out_features, bias=True)`:**
    * Applies a linear transformation to the incoming data: $y = x W^T + b$.
    * `in_features`: Size of each input sample (number of features).
    * `out_features`: Size of each output sample.
    * `bias`: If `True` (default), the layer learns an additive bias.
    * The weights ($W$) and biases ($b$) are learnable parameters, automatically initialized and tracked by the layer.

* **Activation Functions (e.g., `nn.ReLU`, `nn.Sigmoid`, `nn.Tanh`):**
    * These introduce non-linearity into the model, allowing it to learn more complex patterns.
    * They are also `nn.Module` instances.
    * Examples:
        * `nn.ReLU()`: Rectified Linear Unit, $max(0, x)$.
        * `nn.Sigmoid()`: Sigmoid function, $\frac{1}{1 + e^{-x}}$, squashes values between 0 and 1.
        * `nn.Tanh()`: Hyperbolic Tangent, squashes values between -1 and 1.
    * Many activation functions also have functional counterparts in `torch.nn.functional` (often imported as `F`). For example, `F.relu(x)`, `F.sigmoid(x)`. Using the functional versions can be slightly more concise if the activation doesn't have learnable parameters (which most common ones don't).

        ```python
        # Using nn.Module activation
        relu_layer = nn.ReLU()
        # output = relu_layer(input_tensor)

        # Using functional activation
        # output = F.relu(input_tensor)
        ```

**4. `nn.Sequential`: A Simpler Container**

For networks where data flows sequentially through a series of layers, `nn.Sequential` provides a convenient way to define them without explicitly writing the `forward` method.

```python
input_size = 10
hidden_size = 20
output_size = 5

# Define a simple sequential model
model_sequential = nn.Sequential(
    nn.Linear(input_size, hidden_size), # First layer
    nn.ReLU(),                          # Activation
    nn.Linear(hidden_size, output_size) # Second layer
)

print("--- Sequential Model ---")
print(model_sequential)

# You can also pass an OrderedDict to name the layers
from collections import OrderedDict
model_sequential_named = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(input_size, hidden_size)),
    ('relu1', nn.ReLU()),
    ('fc2', nn.Linear(hidden_size, output_size))
]))
print("\n--- Sequential Model with Named Layers ---")
print(model_sequential_named)
```

**5. Inspecting Model Parameters**

Once you define a model (either by subclassing `nn.Module` or using `nn.Sequential`), PyTorch automatically tracks its learnable parameters.

* `model.parameters()`: Returns an iterator over all parameters of the model (including those of submodules).
* `model.named_parameters()`: Returns an iterator over all parameters, yielding both the name of the parameter (e.g., `'fc1.weight'`) and the parameter tensor itself.

```python
print("\n--- Inspecting Parameters of model_sequential_named ---")
for name, param in model_sequential_named.named_parameters():
    if param.requires_grad:
        print(f"Layer: {name}, Size: {param.size()}, Requires Grad: {param.requires_grad}")
        # print(param.data) # To see the actual values (can be large)
```
Notice that each `nn.Linear` layer has a `weight` and a `bias` parameter.

**6. Performing a Forward Pass**

To get an output from your model, you simply call it like a function, passing the input data. This internally calls the `forward` method.

```python
# Create some dummy input data
batch_size = 4
dummy_input = torch.randn(batch_size, input_size) # (batch_size, number_of_features)
print(f"\n--- Forward Pass with model_sequential_named ---")
print(f"Dummy input shape: {dummy_input.shape}")

# Perform a forward pass
# If the model or input is on GPU, ensure they are both on the same device
output = model_sequential_named(dummy_input)

print(f"Output shape: {output.shape}") # Should be (batch_size, output_size)
print(f"Output tensor (first sample):\n{output[0]}")
print(f"Output requires_grad: {output.requires_grad}") # True, as it depends on learnable params
print(f"Output grad_fn: {output.grad_fn}") # Will show the last operation
```

**7. Example: Building a Simple Multi-Layer Perceptron (MLP)**

Let's build a slightly more structured MLP by subclassing `nn.Module`.
Suppose we want a network for a binary classification task with:
* Input features: 784 (e.g., a flattened 28x28 image)
* Hidden layer 1: 128 neurons, ReLU activation
* Hidden layer 2: 64 neurons, ReLU activation
* Output layer: 1 neuron (for binary classification, typically followed by a sigmoid if not included in loss)

```python
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super().__init__() # Call the parent class constructor
        
        # Define layers
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.relu1 = nn.ReLU() # Or use F.relu in forward
        
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.relu2 = nn.ReLU()
        
        self.fc3 = nn.Linear(hidden_dim2, output_dim)
        # For binary classification, a sigmoid is often applied to the output.
        # This can be done here, or often it's combined with the loss function
        # (e.g., nn.BCEWithLogitsLoss expects raw logits).
        # For simplicity, we'll output raw logits here.
        # self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # Define the data flow
        x = self.fc1(x)
        x = self.relu1(x) # Or x = F.relu(x)
        
        x = self.fc2(x)
        x = self.relu2(x)
        
        x = self.fc3(x)
        # if you want sigmoid output directly from model:
        # x = self.sigmoid(x) 
        return x

# --- Instantiate and Test SimpleMLP ---
print("\n--- Custom MLP Model (SimpleMLP) ---")
# Define dimensions
input_features = 784
h1_features = 128
h2_features = 64
output_features = 1 # For binary classification (raw logit)

# Create an instance of the model
mlp_model = SimpleMLP(input_features, h1_features, h2_features, output_features)
print(mlp_model)

# Print its parameters
print("\nParameters of mlp_model:")
for name, param in mlp_model.named_parameters():
    print(f"Name: {name}, Size: {param.size()}, Requires Grad: {param.requires_grad}")

# Create dummy input for the MLP
mlp_batch_size = 10
mlp_dummy_input = torch.randn(mlp_batch_size, input_features)

# Forward pass
mlp_output = mlp_model(mlp_dummy_input)
print(f"\nMLP Dummy Input shape: {mlp_dummy_input.shape}")
print(f"MLP Output shape: {mlp_output.shape}") # (mlp_batch_size, output_features)
print(f"MLP Output (first sample): {mlp_output[0].item()}") # .item() if it's a single value tensor
print(f"MLP Output requires_grad: {mlp_output.requires_grad}")
```

**Key Takeaways from `nn.Module` Usage:**

* **Modularity:** You can define complex models by combining simpler modules.
* **Parameter Tracking:** `nn.Module` automatically handles the registration and tracking of all learnable parameters within it and its submodules.
* **Clear Separation:** `__init__` is for defining *what* layers exist, and `forward` is for defining *how* data flows through them.

---

**PyTorch Tutorial Part 4: Loss Functions and Optimizers**

In this part, we'll cover two critical components for training any neural network:

1.  **Loss Functions:** These functions quantify how "wrong" our model's predictions are compared to the actual target values. The goal of training is to minimize this loss.
2.  **Optimizers:** These algorithms use the gradients (computed by `autograd` based on the loss) to update the model's parameters (weights and biases) in a way that (hopefully) reduces the loss.

**What we'll cover in this part:**

1.  **Understanding Loss Functions (`torch.nn`)**
    * Purpose and common types.
    * Examples: `nn.MSELoss`, `nn.CrossEntropyLoss`, `nn.BCELoss`, `nn.BCEWithLogitsLoss`.
2.  **Understanding Optimizers (`torch.optim`)**
    * Purpose and common types.
    * Key methods: `zero_grad()`, `step()`.
    * Examples: `optim.SGD`, `optim.Adam`.
3.  **How They Fit Together (Conceptual).**

Let's begin!

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F # For functional versions if needed

print(f"PyTorch Version: {torch.__version__}")

# Let's use the SimpleMLP from Part 3 as a reference for some examples
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x) # Output raw logits
        return x
```

**1. Understanding Loss Functions (`torch.nn`)**

A loss function (or criterion) takes the model's output and the true target values as input and computes a scalar value representing the "loss" or "error." PyTorch provides many common loss functions in the `torch.nn` module. These are typically subclasses of `nn.Module`.

* **`nn.MSELoss` (Mean Squared Error Loss):**
    * Commonly used for **regression tasks** where the goal is to predict continuous values.
    * Calculates the mean of the squared differences between each element in the input $x$ (predictions) and target $y$.
    * Loss = $\frac{1}{N} \sum (x_i - y_i)^2$

    ```python
    print("--- nn.MSELoss Example ---")
    mse_loss_fn = nn.MSELoss()

    # Dummy predictions and targets for a regression problem
    predictions_reg = torch.randn(5, 1) # 5 samples, 1 feature each
    targets_reg = torch.randn(5, 1)
    print("Predictions (regression):\n", predictions_reg)
    print("Targets (regression):\n", targets_reg)

    loss_mse = mse_loss_fn(predictions_reg, targets_reg)
    print(f"MSE Loss: {loss_mse.item()}") # .item() to get the scalar value
    ```

* **`nn.CrossEntropyLoss`:**
    * Very commonly used for **multi-class classification tasks**.
    * This criterion combines `nn.LogSoftmax` and `nn.NLLLoss` in one single class.
    * **Input (Predictions):** Raw, unnormalized scores (logits) for each class. Shape: `(batch_size, num_classes)`.
    * **Target:** Class indices (long integers) for each sample. Shape: `(batch_size,)`, where each value is between `0` and `num_classes-1`.

    ```python
    print("\n--- nn.CrossEntropyLoss Example ---")
    ce_loss_fn = nn.CrossEntropyLoss()

    # Dummy predictions (logits) and targets for a 3-class classification problem
    # Batch size of 4 samples
    predictions_multi_class = torch.randn(4, 3) # 4 samples, 3 classes (raw logits)
    targets_multi_class = torch.tensor([0, 2, 1, 0]) # True class indices for each sample

    print("Predictions (multi-class logits):\n", predictions_multi_class)
    print("Targets (multi-class indices):\n", targets_multi_class)

    loss_ce = ce_loss_fn(predictions_multi_class, targets_multi_class)
    print(f"CrossEntropy Loss: {loss_ce.item()}")
    ```

* **`nn.BCELoss` (Binary Cross Entropy Loss):**
    * Used for **binary classification tasks** (two classes, e.g., 0 or 1).
    * **Input (Predictions):** Probabilities for the positive class, typically after a Sigmoid activation. Values should be between 0 and 1. Shape: `(batch_size, 1)` or `(batch_size,)`.
    * **Target:** Probabilities or binary labels (0 or 1). Should have the same shape as input.

    ```python
    print("\n--- nn.BCELoss Example ---")
    bce_loss_fn = nn.BCELoss()
    sigmoid = nn.Sigmoid() # To get probabilities

    # Dummy predictions (logits) and targets for a binary classification problem
    predictions_binary_logits = torch.randn(4, 1) # Raw logits
    predictions_binary_probs = sigmoid(predictions_binary_logits) # Probabilities (0 to 1)
    targets_binary = torch.tensor([[0.], [1.], [1.], [0.]]) # True binary labels

    print("Predictions (binary probabilities after sigmoid):\n", predictions_binary_probs)
    print("Targets (binary):\n", targets_binary)

    loss_bce = bce_loss_fn(predictions_binary_probs, targets_binary)
    print(f"BCE Loss: {loss_bce.item()}")
    ```

* **`nn.BCEWithLogitsLoss`:**
    * Also used for **binary classification tasks**.
    * This loss combines a Sigmoid layer and the BCELoss in one single class. It is numerically more stable than using a plain Sigmoid followed by a BCELoss.
    * **Input (Predictions):** Raw, unnormalized scores (logits). Shape: `(batch_size, 1)` or `(batch_size,)`.
    * **Target:** Probabilities or binary labels (0 or 1). Should have the same shape as input.

    ```python
    print("\n--- nn.BCEWithLogitsLoss Example ---")
    bce_with_logits_loss_fn = nn.BCEWithLogitsLoss()

    # Use the same raw logits and targets as before
    # predictions_binary_logits = torch.randn(4, 1) # from above
    # targets_binary = torch.tensor([[0.], [1.], [1.], [0.]]) # from above

    print("Predictions (binary raw logits):\n", predictions_binary_logits)
    print("Targets (binary):\n", targets_binary)

    loss_bce_logits = bce_with_logits_loss_fn(predictions_binary_logits, targets_binary)
    print(f"BCEWithLogits Loss: {loss_bce_logits.item()}")
    # This value should be very similar to the loss_bce calculated above
    ```
    **Recommendation:** Prefer `BCEWithLogitsLoss` over `Sigmoid + BCELoss` for binary classification due to better numerical stability.

**2. Understanding Optimizers (`torch.optim`)**

Optimizers are algorithms that adjust the model's parameters (weights and biases) to minimize the loss function. They use the gradients computed by `autograd`.

* **Initialization:** You create an optimizer instance by passing it the model's parameters (which you want to optimize) and a learning rate (`lr`).
    * `model.parameters()`: This method (from `nn.Module`) returns an iterator over all learnable parameters of the model.

* **Key Optimizer Methods in a Training Loop:**
    1.  **`optimizer.zero_grad()`:**
        * This method sets the gradients of all model parameters being optimized to zero.
        * It's crucial to call this *before* calling `loss.backward()` in each iteration. Why? Because, as discussed in Part 2, PyTorch accumulates gradients by default. If you don't zero them out, gradients from previous batches/iterations will interfere with the current update.
    2.  **`loss.backward()`:** (Covered in Part 2)
        * This computes the gradients of the loss with respect to the model parameters (and any other tensors with `requires_grad=True` that contributed to the loss).
    3.  **`optimizer.step()`:**
        * This method updates the values of the model parameters using the computed gradients and the specific optimization algorithm (e.g., SGD, Adam). This is where the actual "learning" happens.

* **Common Optimizers:**

    * **`optim.SGD` (Stochastic Gradient Descent):**
        * A fundamental optimization algorithm.
        * Can include `momentum` (helps accelerate SGD in the relevant direction and dampens oscillations) and `weight_decay` (L2 penalty to prevent overfitting).
        ```python
        # model = SimpleMLP(...) # Assume model is defined
        # learning_rate_sgd = 0.01
        # optimizer_sgd = optim.SGD(model.parameters(), lr=learning_rate_sgd, momentum=0.9)
        ```

    * **`optim.Adam` (Adaptive Moment Estimation):**
        * A very popular and often effective adaptive learning rate optimization algorithm. It computes adaptive learning rates for each parameter.
        * Often works well with default hyperparameters.
        * `weight_decay` can also be used.
        ```python
        # model = SimpleMLP(...)
        # learning_rate_adam = 0.001
        # optimizer_adam = optim.Adam(model.parameters(), lr=learning_rate_adam)
        ```
    * **`optim.AdamW` (Adam with Decoupled Weight Decay):**
        * A modification of Adam that often improves generalization by changing how weight decay is applied. Generally preferred over Adam when using weight decay.
        ```python
        # model = SimpleMLP(...)
        # optimizer_adamw = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
        ```

* **Learning Rate (`lr`):**
    * This is a crucial hyperparameter. It determines the step size at each iteration while moving toward a minimum of a loss function.
    * Too small: Training will be very slow.
    * Too large: Training might oscillate or diverge, never finding the minimum.
    * Finding a good learning rate often requires experimentation. Learning rate schedules (dynamically changing the learning rate during training) are also common.

**3. How They Fit Together (Conceptual Example)**

Let's imagine a single training step with our `SimpleMLP`:

```python
print("\n--- Conceptual Training Step ---")
# 1. Define Model, Loss, Optimizer
input_dim_ex = 10
hidden1_ex = 20
hidden2_ex = 15
output_dim_ex = 1 # For binary classification (logits)

model = SimpleMLP(input_dim_ex, hidden1_ex, hidden2_ex, output_dim_ex)
criterion = nn.BCEWithLogitsLoss() # Use this for raw logit output
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 2. Dummy Input and Target
dummy_inputs = torch.randn(8, input_dim_ex) # Batch of 8 samples
dummy_targets = torch.randint(0, 2, (8, 1)).float() # Binary targets (0 or 1)

print(f"Model: {model.__class__.__name__}")
print(f"Criterion: {criterion.__class__.__name__}")
print(f"Optimizer: {optimizer.__class__.__name__}")
print(f"Dummy Inputs shape: {dummy_inputs.shape}")
print(f"Dummy Targets shape: {dummy_targets.shape}")


# --- A Single Training Step ---
# a. Zero gradients
optimizer.zero_grad()

# b. Forward pass: Get model predictions
outputs = model(dummy_inputs)
print(f"Model outputs shape: {outputs.shape}")

# c. Calculate loss
loss = criterion(outputs, dummy_targets)
print(f"Calculated Loss: {loss.item()}")

# d. Backward pass: Compute gradients of the loss w.r.t. model parameters
loss.backward()
# Now model parameters (e.g., model.fc1.weight.grad) will have gradients

# e. Optimizer step: Update model parameters based on gradients
optimizer.step()

# After optimizer.step(), the model's weights have been updated!
# We could check, for example, model.fc1.weight.data to see it changed slightly.
# For a real effect, this loop would repeat many times.

print("Conceptual training step completed (parameters would have been updated).")
```
This small snippet shows the core interaction: `optimizer.zero_grad()`, forward pass, loss calculation, `loss.backward()`, and `optimizer.step()`. This cycle is the heart of most training loops in PyTorch.

---

**PyTorch Tutorial Part 5: Data Handling with `Dataset` and `DataLoader`**

Training deep learning models often involves large datasets. Loading, preprocessing, and batching this data efficiently is crucial for fast and effective training. PyTorch provides two core utilities in `torch.utils.data` to help with this:

1.  **`Dataset`:** An abstract class representing your dataset. It provides a way to access individual data samples and their corresponding labels.
2.  **`DataLoader`:** Wraps a `Dataset` and provides an iterable over it, enabling easy batching, shuffling, and parallel data loading.

**What we'll cover in this part:**

1.  **The `torch.utils.data.Dataset` Class:**
    * Understanding its role.
    * How to create a custom `Dataset` by implementing `__len__` and `__getitem__`.
2.  **The `torch.utils.data.DataLoader` Class:**
    * Its purpose: batching, shuffling, parallel loading.
    * Key parameters: `batch_size`, `shuffle`, `num_workers`.
3.  **Example: Creating and Using a Custom `Dataset` with `DataLoader`.**
4.  **Using Pre-built Datasets (e.g., from `torchvision.datasets`) and Transforms.**

Let's get started!

```python
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd # For a slightly more realistic custom dataset example

# For torchvision example
import torchvision
import torchvision.transforms as transforms

print(f"PyTorch Version: {torch.__version__}")
```

**1. The `torch.utils.data.Dataset` Class**

A `Dataset` object in PyTorch is anything that has a `__getitem__` method and a `__len__` method.
* `__len__(self)` should return the total number of samples in your dataset.
* `__getitem__(self, index)` should return the data sample (e.g., features and label) at the given `index`.

You typically create your own custom dataset class by inheriting from `torch.utils.data.Dataset` and implementing these two methods.

* **`__init__(self, ...)`:** In the constructor, you'll usually load your data from files, or if it's small, store it directly in memory (e.g., as lists or tensors). You might also perform initial one-time preprocessing here.

**Example: A Simple Custom Dataset**

Let's create a very basic dataset with some dummy features and labels.

```python
class MySimpleCustomDataset(Dataset):
    def __init__(self, num_samples=100, num_features=5):
        """
        Args:
            num_samples (int): Number of samples in the dataset.
            num_features (int): Number of features per sample.
        """
        print(f"Initializing MySimpleCustomDataset with {num_samples} samples.")
        # Generate some random features and labels
        self.features = torch.randn(num_samples, num_features)
        self.labels = torch.randint(0, 2, (num_samples,)) # Binary labels (0 or 1)
        print("Data generation complete.")

    def __len__(self):
        # Returns the total number of samples in the dataset
        return len(self.features)

    def __getitem__(self, idx):
        # Returns the sample (features and label) at the given index
        # It's good practice to ensure idx is in range, though DataLoader handles this.
        if torch.is_tensor(idx):
            idx = idx.tolist()
            
        sample_features = self.features[idx]
        sample_label = self.labels[idx]
        
        # You can return them as a tuple, dictionary, or any structure you prefer
        return sample_features, sample_label

print("--- MySimpleCustomDataset ---")
simple_dataset = MySimpleCustomDataset(num_samples=10, num_features=3)

# Test __len__
print(f"Length of simple_dataset: {len(simple_dataset)}")

# Test __getitem__
first_sample_features, first_sample_label = simple_dataset[0]
print(f"First sample features: {first_sample_features}")
print(f"First sample label: {first_sample_label}")

fifth_sample_features, fifth_sample_label = simple_dataset[4]
print(f"\nFifth sample features: {fifth_sample_features}")
print(f"Fifth sample label: {fifth_sample_label}")
```

**Example: Custom Dataset from a Pandas DataFrame (More Realistic)**

Often, your data might be in a CSV file, which you can load into a Pandas DataFrame.

```python
class PandasDataset(Dataset):
    def __init__(self, dataframe, feature_columns, target_column):
        """
        Args:
            dataframe (pd.DataFrame): The pandas DataFrame containing the data.
            feature_columns (list of str): List of column names for features.
            target_column (str): Column name for the target label.
        """
        print("Initializing PandasDataset...")
        self.features = torch.tensor(dataframe[feature_columns].values, dtype=torch.float32)
        self.labels = torch.tensor(dataframe[target_column].values, dtype=torch.long) # Assuming classification
        print("Data converted to tensors.")

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

print("\n--- PandasDataset ---")
# Create a dummy Pandas DataFrame
data = {
    'feature1': np.random.rand(20),
    'feature2': np.random.rand(20) * 10,
    'feature3': np.random.rand(20) - 0.5,
    'target': np.random.randint(0, 3, 20) # 3 classes (0, 1, 2)
}
df = pd.DataFrame(data)
print("Sample DataFrame head:\n", df.head())

feature_cols = ['feature1', 'feature2', 'feature3']
target_col = 'target'

pandas_dataset = PandasDataset(df, feature_columns=feature_cols, target_column=target_col)
print(f"Length of pandas_dataset: {len(pandas_dataset)}")
features_pd_0, label_pd_0 = pandas_dataset[0]
print(f"First sample from pandas_dataset - Features: {features_pd_0}, Label: {label_pd_0}")
```

**2. The `torch.utils.data.DataLoader` Class**

Once you have a `Dataset`, you'll usually wrap it with a `DataLoader`. The `DataLoader` takes care of:

* **Batching:** Grouping multiple samples into a "batch" for processing. Training on batches is more computationally efficient and can lead to more stable gradient estimates.
* **Shuffling:** Randomly shuffling the data at the beginning of each epoch. This is crucial for preventing the model from learning the order of the data and helps in generalization.
* **Parallel Data Loading (`num_workers`):** Using multiple subprocesses to load data in the background while the GPU is busy with model computations. This can significantly speed up training if data loading/preprocessing is a bottleneck.
* **Custom Collation (`collate_fn`):** (Advanced) Allows you to define how individual samples (returned by `Dataset.__getitem__`) are combined into a batch. Useful for tasks like padding sequences of varying lengths.

**Key `DataLoader` Parameters:**

* `dataset`: The `Dataset` object to load data from.
* `batch_size (int, optional)`: How many samples per batch to load (default: `1`).
* `shuffle (bool, optional)`: Set to `True` to have the data reshuffled at every epoch (default: `False`).
* `num_workers (int, optional)`: How many subprocesses to use for data loading. `0` means that the data will be loaded in the main process (default: `0`).
    * **Note on `num_workers` on Windows:** If using `num_workers > 0` on Windows (or sometimes macOS with certain Python versions), you often need to wrap your main training script logic in an `if __name__ == '__main__':` block to avoid issues with multiprocessing.
* `pin_memory (bool, optional)`: If `True`, the `DataLoader` will copy Tensors into CUDA pinned memory before returning them. This can speed up data transfer to the GPU (default: `False`).
* `drop_last (bool, optional)`: Set to `True` to drop the last incomplete batch, if the dataset size is not divisible by the batch size (default: `False`).

**3. Example: Using `DataLoader` with a Custom `Dataset`**

Let's use our `MySimpleCustomDataset` with a `DataLoader`.

```python
print("\n--- DataLoader with MySimpleCustomDataset ---")
simple_dataset_for_loader = MySimpleCustomDataset(num_samples=60, num_features=4)

# Create a DataLoader instance
batch_size = 16
# On Windows, if num_workers > 0, you might need if __name__ == '__main__':
# For this interactive example, num_workers=0 is safest for broad compatibility.
data_loader = DataLoader(dataset=simple_dataset_for_loader,
                         batch_size=batch_size,
                         shuffle=True,       # Shuffle data at each epoch
                         num_workers=0)      # Use 0 for this example

# Iterate through the DataLoader (e.g., in a training epoch)
print(f"\nIterating through DataLoader (batch_size={batch_size}):")
# In a real training loop, this would be inside an epoch loop
for epoch in range(1): # Simulate one epoch
    print(f"\nEpoch {epoch+1}")
    for i, (batch_features, batch_labels) in enumerate(data_loader):
        print(f"  Batch {i+1}:")
        print(f"    Batch features shape: {batch_features.shape}") # (batch_size, num_features)
        print(f"    Batch labels shape: {batch_labels.shape}")     # (batch_size,)
        # In a real loop, you would:
        # 1. Move data to device (e.g., batch_features.to(device))
        # 2. Perform forward pass
        # 3. Calculate loss
        # 4. Backward pass
        # 5. Optimizer step
        if i == 2: # Print first 3 batches for brevity
            break 
```
You'll notice that if `shuffle=True`, the order of samples will be different each time you iterate through the `data_loader` in a new "epoch."

**4. Using Pre-built Datasets (e.g., from `torchvision.datasets`) and Transforms**

PyTorch's domain-specific libraries like `torchvision`, `torchaudio`, and `torchtext` (though `torchtext`'s usage has evolved) provide many pre-built datasets. `torchvision` is particularly popular for computer vision.

These datasets often integrate with `torchvision.transforms`, which are common image transformations (e.g., converting to tensor, normalizing, resizing, cropping, data augmentation).

**Example: MNIST Dataset from `torchvision`**

```python
print("\n--- torchvision.datasets.MNIST Example ---")

# Define transformations
# 1. Convert PIL Image to PyTorch Tensor
# 2. Normalize the tensor (mean and std dev for MNIST are approx 0.1307 and 0.3081)
mnist_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)) # (mean,), (std,) for grayscale
])

# Download and load the training data (if not already downloaded)
# Set download=True for the first time
try:
    train_mnist_dataset = torchvision.datasets.MNIST(
        root='./data_mnist',  # Directory to store the data
        train=True,          # Get the training set
        transform=mnist_transforms, # Apply defined transformations
        download=True        # Download if not present
    )
    print("MNIST training dataset loaded/downloaded successfully.")

    # Create a DataLoader for the MNIST training set
    mnist_train_loader = DataLoader(
        dataset=train_mnist_dataset,
        batch_size=64,
        shuffle=True,
        num_workers=0 # For simplicity in this example
    )

    # Get one batch of images and labels
    print("\nIterating through MNIST DataLoader (one batch):")
    for images, labels in mnist_train_loader:
        print(f"  Images batch shape: {images.shape}") # (batch_size, channels, height, width) -> (64, 1, 28, 28)
        print(f"  Labels batch shape: {labels.shape}")   # (batch_size,) -> (64)
        print(f"  Sample labels from batch: {labels[:5]}")
        break # Just show one batch

except Exception as e:
    print(f"Could not load/download MNIST. Error: {e}")
    print("Please ensure you have an internet connection if downloading for the first time,")
    print("or check write permissions for the './data_mnist' directory.")

```
`transforms.Compose` chains multiple transformations together. `ToTensor()` converts a PIL Image or NumPy `ndarray` in the range [0, 255] to a `torch.FloatTensor` of shape (C x H x W) in the range [0.0, 1.0]. `Normalize()` normalizes the tensor with a given mean and standard deviation.

---

This part has equipped you with the knowledge to handle data efficiently in PyTorch using `Dataset` and `DataLoader`. You can create custom datasets for your specific data and use DataLoaders to prepare batches for your model, along with leveraging pre-built datasets and transforms from libraries like `torchvision`.

---

**PyTorch Tutorial Part 6: A Complete Training Loop**

In this part, we will write the full Python script to:
1.  Load a dataset (we'll use MNIST).
2.  Define a simple neural network.
3.  Define a loss function and an optimizer.
4.  Train the network for a few epochs, printing progress.
5.  Evaluate the network's performance on a test set.

**What we'll cover in this part:**

1.  **Setting up the Environment and Data.**
2.  **Defining the Neural Network Model.**
3.  **Defining Loss Function and Optimizer.**
4.  **The Training Loop:**
    * Iterating through epochs and batches.
    * Moving data to the device (CPU/GPU).
    * Zeroing gradients.
    * Forward pass.
    * Loss calculation.
    * Backward pass.
    * Optimizer step.
5.  **The Evaluation Loop:**
    * Using `model.eval()` and `torch.no_grad()`.
    * Calculating accuracy.
6.  **Running the Complete Script.**

Let's put it all together!

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms # For image transformations

print(f"PyTorch Version: {torch.__version__}")
print(f"Torchvision Version: {torchvision.__version__}")

# --- 1. Configuration and Device Setup ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Hyperparameters
input_size = 28 * 28  # MNIST images are 28x28 pixels
hidden_size1 = 128
hidden_size2 = 64
num_classes = 10      # Digits 0-9
learning_rate = 0.001
batch_size = 64
num_epochs = 5        # Train for a few epochs for demonstration

# --- 2. Load MNIST Dataset ---
# Transformations to apply to the data
# ToTensor() converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255]
# to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0].
# Normalize() normalizes a tensor image with mean and standard deviation.
# For MNIST, the mean and std are approximately 0.1307 and 0.3081 respectively.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Download and load training dataset
train_dataset = torchvision.datasets.MNIST(root='./data_mnist',
                                           train=True,
                                           transform=transform,
                                           download=True)

# Download and load test dataset
test_dataset = torchvision.datasets.MNIST(root='./data_mnist',
                                          train=False,
                                          transform=transform,
                                          download=True)

# Create DataLoaders
# num_workers=0 means data loading will happen in the main process.
# For faster loading, you can increase num_workers, but be mindful of
# the `if __name__ == '__main__':` guard on Windows.
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=batch_size,
                          shuffle=True, # Shuffle training data
                          num_workers=0)

test_loader = DataLoader(dataset=test_dataset,
                         batch_size=batch_size,
                         shuffle=False, # No need to shuffle test data
                         num_workers=0)

print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of test samples: {len(test_dataset)}")
print(f"Number of training batches: {len(train_loader)}")
print(f"Number of test batches: {len(test_loader)}")


# --- 3. Define the Neural Network Model ---
class SimpleMLP(nn.Module):
    def __init__(self, input_s, hidden1_s, hidden2_s, output_s):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_s, hidden1_s)
        self.fc2 = nn.Linear(hidden1_s, hidden2_s)
        self.fc3 = nn.Linear(hidden2_s, output_s)
        # No Softmax here because nn.CrossEntropyLoss will apply it internally

    def forward(self, x):
        # Flatten the image (28x28) into a 1D vector (784)
        x = x.view(-1, 28*28) # -1 infers the batch size
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x) # Output raw logits
        return x

model = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes).to(device)
print("\nModel Architecture:")
print(model)

# --- 4. Define Loss Function and Optimizer ---
criterion = nn.CrossEntropyLoss() # Combines LogSoftmax and NLLLoss
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# --- 5. Training Loop ---
print("\n--- Starting Training ---")
for epoch in range(num_epochs):
    model.train() # Set the model to training mode (enables dropout, batchnorm updates etc.)
    running_loss = 0.0
    correct_train = 0
    total_train = 0

    for i, (images, labels) in enumerate(train_loader):
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)

        # 1. Zero the parameter gradients
        optimizer.zero_grad()

        # 2. Forward pass
        outputs = model(images) # images are already flattened in the model's forward method

        # 3. Calculate loss
        loss = criterion(outputs, labels)

        # 4. Backward pass (compute gradients)
        loss.backward()

        # 5. Optimizer step (update parameters)
        optimizer.step()

        running_loss += loss.item() * images.size(0) # loss.item() is avg loss for batch

        # Calculate training accuracy for this batch
        _, predicted_train = torch.max(outputs.data, 1)
        total_train += labels.size(0)
        correct_train += (predicted_train == labels).sum().item()

        if (i + 1) % 100 == 0: # Print every 100 mini-batches
            avg_batch_loss = loss.item() # Average loss for current batch
            current_batch_accuracy = 100 * (predicted_train == labels).sum().item() / labels.size(0)
            print(f"Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Batch Loss: {avg_batch_loss:.4f}, Batch Accuracy: {current_batch_accuracy:.2f}%")

    epoch_loss = running_loss / len(train_dataset)
    epoch_accuracy_train = 100 * correct_train / total_train
    print(f"--- Epoch {epoch+1} Finished ---")
    print(f"Training Loss: {epoch_loss:.4f}, Training Accuracy: {epoch_accuracy_train:.2f}%")

    # --- 6. Evaluation Loop (on Test Set after each epoch) ---
    model.eval() # Set the model to evaluation mode (disables dropout, uses learned batchnorm stats)
    correct_test = 0
    total_test = 0
    test_loss = 0.0
    with torch.no_grad(): # Disable gradient calculations for evaluation
        for images_test, labels_test in test_loader:
            images_test = images_test.to(device)
            labels_test = labels_test.to(device)

            outputs_test = model(images_test)
            loss_test_batch = criterion(outputs_test, labels_test)
            test_loss += loss_test_batch.item() * images_test.size(0)

            _, predicted_test = torch.max(outputs_test.data, 1)
            total_test += labels_test.size(0)
            correct_test += (predicted_test == labels_test).sum().item()

    avg_test_loss = test_loss / len(test_dataset)
    accuracy_test = 100 * correct_test / total_test
    print(f"Test Loss: {avg_test_loss:.4f}, Test Accuracy: {accuracy_test:.2f}%")
    print("-" * 30)

print("--- Training Finished ---")

# --- 7. (Optional) Save the trained model ---
# torch.save(model.state_dict(), 'mnist_mlp_model.pth')
# print("Trained model state_dict saved to mnist_mlp_model.pth")

# To load:
# loaded_model = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes).to(device)
# loaded_model.load_state_dict(torch.load('mnist_mlp_model.pth'))
# loaded_model.eval() # Don't forget for inference
```

**Explanation of the Code:**

1.  **Configuration and Device:** Sets up hyperparameters and determines if a GPU is available.
2.  **Load MNIST Dataset:**
    * Uses `torchvision.datasets.MNIST` to download/load the dataset.
    * `transforms.Compose` is used to chain `transforms.ToTensor()` (converts image to tensor and scales to [0,1]) and `transforms.Normalize()` (normalizes tensor values).
    * `DataLoader` instances are created for both training and test sets. `shuffle=True` is important for the training loader.
3.  **Define Model (`SimpleMLP`):**
    * A simple Multi-Layer Perceptron with two hidden layers and ReLU activations.
    * The `forward` method includes `x.view(-1, 28*28)` to flatten the 2D image (1x28x28) into a 1D vector (784 features) before passing it to the linear layers. Note that `images` from `DataLoader` will have shape `(batch_size, 1, 28, 28)`.
4.  **Loss and Optimizer:**
    * `nn.CrossEntropyLoss` is chosen because MNIST is a multi-class classification problem (digits 0-9). This loss function expects raw logits as model output (which our `SimpleMLP` provides).
    * `optim.Adam` is used as the optimizer.
5.  **Training Loop (`for epoch in range(num_epochs):`)**
    * `model.train()`: Sets the model to training mode. This is important because some layers like Dropout and BatchNorm behave differently during training and evaluation.
    * The inner loop iterates through batches provided by `train_loader`.
    * **Device Transfer:** `images.to(device)` and `labels.to(device)` move the data for the current batch to the GPU if available.
    * **`optimizer.zero_grad()`:** Clears previously accumulated gradients. This must be done before the backward pass.
    * **Forward Pass:** `outputs = model(images)` gets the predictions from the model.
    * **Loss Calculation:** `loss = criterion(outputs, labels)` computes the loss.
    * **Backward Pass:** `loss.backward()` computes the gradients of the loss with respect to all model parameters that have `requires_grad=True`.
    * **Optimizer Step:** `optimizer.step()` updates the model parameters based on the computed gradients.
    * **Logging:** Prints loss and accuracy periodically.
6.  **Evaluation Loop (`model.eval()`, `with torch.no_grad():`)**
    * `model.eval()`: Sets the model to evaluation mode. This turns off dropout and makes batch normalization use its learned running statistics instead of batch statistics.
    * `with torch.no_grad():`: Disables gradient computation within this block. This is crucial for evaluation as it reduces memory consumption and speeds up computations since we don't need gradients here.
    * The loop iterates through the `test_loader`.
    * It calculates the average loss and accuracy on the test set.
7.  **Saving the Model (Optional):**
    * `torch.save(model.state_dict(), 'filename.pth')` is the recommended way to save model parameters.

**To Run This Code:**
1.  Make sure you have PyTorch and Torchvision installed (`pip install torch torchvision`).
2.  Save the code as a Python file (e.g., `pytorch_training.py`).
3.  Run it from your terminal: `python pytorch_training.py`.

You should see the training progress printed, with loss decreasing and accuracy (hopefully) increasing over the epochs for both training and test sets.

---

**PyTorch Tutorial Part 7: Saving and Loading Models in Detail**

Once you've spent time training a model, you'll want to save it for several reasons:
* **Persistence:** To use it later for inference without retraining.
* **Sharing:** To share your trained model with others.
* **Resuming Training:** To continue training from where you left off, especially for long training jobs.
* **Deployment:** To deploy the model into a production environment.
* **Fine-tuning:** To use it as a base for transfer learning on a new task.

PyTorch provides flexible ways to save and load models. We'll explore the most common and recommended approaches.

**What we'll cover in this part:**

1.  **What to Save?**
2.  **Saving and Loading `state_dict` (Recommended for Inference & Sharing)**
3.  **Saving and Loading Entire Models (Less Flexible)**
4.  **Saving and Loading Checkpoints (For Resuming Training)**
5.  **Handling Devices (CPU/GPU) During Save/Load.**
6.  **Best Practices.**

Let's use a slightly simplified version of our `SimpleMLP` from Part 6 for these examples.

```python
import torch
import torch.nn as nn
import torch.optim as optim
import os # For file operations

print(f"PyTorch Version: {torch.__version__}")

# Define a device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define our SimpleMLP model (from Part 6, slightly simplified for clarity here)
class SimpleMLP(nn.Module):
    def __init__(self, input_s, hidden1_s, hidden2_s, output_s):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_s, hidden1_s)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden1_s, hidden2_s)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden2_s, output_s)

    def forward(self, x):
        x = x.view(-1, 28*28) # Assuming MNIST-like flat input
        x = self.relu1(self.fc1(x))
        x = self.relu2(self.fc2(x))
        x = self.fc3(x)
        return x

# Model dimensions (example for MNIST)
input_size = 28 * 28
hidden_size1 = 128
hidden_size2 = 64
num_classes = 10

# Create a dummy model instance for demonstration
model_to_save = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes).to(device)
# In a real scenario, this model would be trained.
# For demonstration, we'll use its initial random weights.
print("Model instantiated for saving/loading examples.")
```

**1. What to Save?**

PyTorch allows you to save different aspects of your model:

* **`model.state_dict()`:** This is a Python dictionary object that maps each layer to its learnable parameters (weights and biases). It only saves the parameters, not the model architecture itself. This is the **recommended approach** for saving models for inference, sharing, or transfer learning because it's more flexible and portable.
* **Entire Model:** You can save the entire model object using `torch.save(model, PATH)`. This uses Python's `pickle` module to serialize the model object, including its architecture. While convenient, it can be brittle if the code defining the model changes or if you try to load it in a different project.
* **Checkpoint:** A dictionary containing various pieces of information needed to resume training, such as the model's `state_dict`, the optimizer's `state_dict`, the current epoch, the latest loss, etc.

**2. Saving and Loading `state_dict` (Recommended)**

This method saves only the learnable parameters (weights and biases) of the model.

* **Saving `state_dict`:**
    ```python
    print("\n--- Saving and Loading state_dict ---")
    STATE_DICT_PATH = "simple_mlp_statedict.pth"

    # 1. Get the state_dict
    model_state_dict = model_to_save.state_dict()
    # print("Model state_dict keys:", model_state_dict.keys()) # See what's inside

    # 2. Save it
    torch.save(model_state_dict, STATE_DICT_PATH)
    print(f"Model state_dict saved to: {STATE_DICT_PATH}")
    ```

* **Loading `state_dict`:**
    To load the `state_dict`, you first need to create an instance of your model class (so PyTorch knows the architecture).
    ```python
    # 1. Instantiate the model architecture
    # This MUST be the same architecture as the one whose state_dict was saved
    loaded_model_from_statedict = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes).to(device)
    print(f"\nNew model instance created for loading state_dict.")

    # 2. Load the state_dict
    # Ensure the model is on the same device as the saved parameters, or use map_location (see later)
    state_dict_loaded = torch.load(STATE_DICT_PATH, map_location=device)
    loaded_model_from_statedict.load_state_dict(state_dict_loaded)
    print(f"Model state_dict loaded from: {STATE_DICT_PATH}")

    # 3. Set the model to evaluation mode (IMPORTANT for inference)
    # This turns off layers like Dropout and sets BatchNorm to use running statistics.
    loaded_model_from_statedict.eval()
    print("Loaded model set to evaluation mode.")

    # Now you can use loaded_model_from_statedict for inference
    # Example:
    dummy_input = torch.randn(1, 1, 28, 28).to(device) # Batch of 1, 1 channel, 28x28
    with torch.no_grad(): # Disable gradient calculation for inference
        output = loaded_model_from_statedict(dummy_input)
    print(f"Output from loaded model (state_dict): {output.argmax(dim=1).item()}")
    ```

**3. Saving and Loading Entire Models**

This method saves the entire Python object using `pickle`.

* **Saving Entire Model:**
    ```python
    print("\n--- Saving and Loading Entire Model ---")
    ENTIRE_MODEL_PATH = "simple_mlp_entire_model.pth"

    torch.save(model_to_save, ENTIRE_MODEL_PATH)
    print(f"Entire model saved to: {ENTIRE_MODEL_PATH}")
    ```

* **Loading Entire Model:**
    ```python
    # No need to instantiate the model class first
    loaded_entire_model = torch.load(ENTIRE_MODEL_PATH, map_location=device)
    print(f"\nEntire model loaded from: {ENTIRE_MODEL_PATH}")

    # Remember to set to evaluation mode
    loaded_entire_model.eval()
    print("Loaded entire model set to evaluation mode.")

    # Example inference
    with torch.no_grad():
        output_entire = loaded_entire_model(dummy_input)
    print(f"Output from loaded entire model: {output_entire.argmax(dim=1).item()}")
    ```
    **Caution:** While simpler, this method is less portable. If the code defining `SimpleMLP` changes, or if you try to load this in an environment where `SimpleMLP` isn't defined exactly the same way, it might break. The `state_dict` approach is generally more robust.

**4. Saving and Loading Checkpoints (For Resuming Training)**

When training large models, you often want to save checkpoints periodically. A checkpoint typically includes more than just the model's parameters; it might include the optimizer's state, the current epoch, the latest loss, etc. This allows you to resume training exactly where you left off if it gets interrupted.

```python
print("\n--- Saving and Loading Checkpoints ---")
CHECKPOINT_PATH = "simple_mlp_checkpoint.pth"

# Assume we are in a training loop
current_epoch = 3 # Example epoch
current_loss = 0.567 # Example loss
optimizer_for_checkpoint = optim.Adam(model_to_save.parameters(), lr=0.001)
# Simulate a few optimizer steps to have a state
for _ in range(5):
    optimizer_for_checkpoint.zero_grad()
    # dummy loss and backward
    d_loss = model_to_save(torch.randn(1,1,28,28).to(device)).sum()
    d_loss.backward()
    optimizer_for_checkpoint.step()


# Saving a checkpoint
print(f"Saving checkpoint at epoch {current_epoch}...")
checkpoint = {
    'epoch': current_epoch,
    'model_state_dict': model_to_save.state_dict(),
    'optimizer_state_dict': optimizer_for_checkpoint.state_dict(),
    'loss': current_loss,
    # You can add any other information you need
    'hyperparameters': {'input_size': input_size, 'hidden1': hidden_size1, 'hidden2': hidden_size2, 'classes': num_classes}
}
torch.save(checkpoint, CHECKPOINT_PATH)
print(f"Checkpoint saved to: {CHECKPOINT_PATH}")


# Loading a checkpoint to resume training
print("\nLoading checkpoint to resume training...")
# 1. Instantiate model and optimizer (architecture must match)
model_to_resume = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes).to(device)
optimizer_to_resume = optim.Adam(model_to_resume.parameters(), lr=0.0005) # LR might be overwritten or you might store it

# 2. Load the checkpoint dictionary
loaded_checkpoint = torch.load(CHECKPOINT_PATH, map_location=device)

# 3. Load states into model and optimizer
model_to_resume.load_state_dict(loaded_checkpoint['model_state_dict'])
optimizer_to_resume.load_state_dict(loaded_checkpoint['optimizer_state_dict'])

# 4. Load other information
start_epoch = loaded_checkpoint['epoch'] + 1 # Start from the next epoch
last_loss = loaded_checkpoint['loss']
hyperparams = loaded_checkpoint['hyperparameters']

print(f"Resuming training from epoch {start_epoch}")
print(f"Last recorded loss: {last_loss}")
print(f"Loaded hyperparameters: {hyperparams}")

# 5. Set model to training mode if you are resuming training
model_to_resume.train()
print("Model set to train mode for resuming.")

# Now you can continue your training loop with model_to_resume, optimizer_to_resume, from start_epoch
# for epoch in range(start_epoch, num_total_epochs):
#     # ... your training code ...
```

**5. Handling Devices (CPU/GPU) During Save/Load**

It's common to train a model on a GPU and then deploy it or load it for inference on a CPU (or a different GPU setup).

* **Saving:** When you save `model.state_dict()`, it saves the parameters as they are, including their device information.
* **Loading with `map_location`:** The `torch.load()` function has a `map_location` argument that is crucial for device handling:
    * `torch.load(PATH, map_location=torch.device('cpu'))`: Loads all tensors onto the CPU, regardless of where they were when saved.
    * `torch.load(PATH, map_location=torch.device('cuda:0'))`: Loads all tensors onto GPU 0.
    * `torch.load(PATH, map_location=device)`: Loads onto the `device` you've defined (e.g., chosen GPU or CPU).

```python
print("\n--- Handling Devices During Save/Load ---")
# Assume model_to_save was trained and is currently on 'device' (e.g., GPU if available)
# model_to_save is already on 'device' from its instantiation

STATE_DICT_GPU_PATH = "model_gpu_statedict.pth"
torch.save(model_to_save.state_dict(), STATE_DICT_GPU_PATH)
print(f"Model (potentially on GPU) state_dict saved to {STATE_DICT_GPU_PATH}")

# Example 1: Load a GPU-trained model onto CPU
print("\nLoading model onto CPU explicitly:")
model_on_cpu = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes) # Instantiate on CPU by default
# Load state_dict and map it to CPU
cpu_state_dict = torch.load(STATE_DICT_GPU_PATH, map_location=torch.device('cpu'))
model_on_cpu.load_state_dict(cpu_state_dict)
model_on_cpu.eval()
print(f"Model loaded on CPU. First parameter device: {next(model_on_cpu.parameters()).device}")

# Example 2: If you have a GPU and want to ensure loading onto the current 'device'
if torch.cuda.is_available():
    print("\nLoading model onto current 'device' (which could be GPU):")
    model_on_current_device = SimpleMLP(input_size, hidden_size1, hidden_size2, num_classes)
    # No need to move model_on_current_device to device yet, load_state_dict can handle if parameters are moved by map_location
    state_dict_for_current_device = torch.load(STATE_DICT_GPU_PATH, map_location=device)
    model_on_current_device.load_state_dict(state_dict_for_current_device)
    model_on_current_device.to(device) # Ensure model structure itself is on the device
    model_on_current_device.eval()
    print(f"Model loaded on {device}. First parameter device: {next(model_on_current_device.parameters()).device}")
```
**General workflow for loading a model that might have been saved from a GPU:**
1.  Load the `state_dict` using `map_location` to specify the target device.
2.  Instantiate your model architecture.
3.  Load the `state_dict` into your model instance.
4.  If your model instance was not already on the target device, move it using `model.to(target_device)`.

**6. Best Practices**

* **Prefer `state_dict`:** For sharing models, inference, and transfer learning, saving and loading the `state_dict` is more robust and flexible.
* **Define Model Architecture Separately:** Always have your model class definition available when loading a `state_dict`.
* **`model.eval()`:** Always call `model.eval()` after loading a trained model and before performing inference. This ensures layers like dropout and batch normalization are in evaluation mode.
* **`model.train()`:** If you load a checkpoint to resume training, remember to call `model.train()` to set these layers back to training mode.
* **Use `map_location`:** Be mindful of devices. Use `map_location` in `torch.load` when loading models that might have been trained on a different device setup.
* **Clean Up:** Delete saved files if no longer needed using `os.remove(PATH)`.

```python
# --- Clean up dummy files ---
if os.path.exists(STATE_DICT_PATH): os.remove(STATE_DICT_PATH)
if os.path.exists(ENTIRE_MODEL_PATH): os.remove(ENTIRE_MODEL_PATH)
if os.path.exists(CHECKPOINT_PATH): os.remove(CHECKPOINT_PATH)
if os.path.exists(STATE_DICT_GPU_PATH): os.remove(STATE_DICT_GPU_PATH)
print("\nDummy saved files cleaned up.")
```

---