# Topic 1: PyTorch Fundamentals - Tensors & Operations

## Learning Objectives

By the end of this notebook, you will:
- Understand what tensors are and why they're fundamental to deep learning
- Create and manipulate tensors using PyTorch
- Perform mathematical operations on tensors
- Understand tensor shapes, broadcasting, and reshaping
- Know when to use PyTorch vs NumPy
- Leverage GPU acceleration for computations

---

## 1. The Big Picture: What Are Tensors?

### Why Do We Need Tensors?

Imagine you're building a model that recognizes images of cats and dogs. Each image is made up of thousands of pixels, and each pixel has RGB color values. How do you represent this data in code?

- A **scalar** is a single number: `5` (0-dimensional)
- A **vector** is a 1D array: `[1, 2, 3]` (1-dimensional)
- A **matrix** is a 2D array: `[[1, 2], [3, 4]]` (2-dimensional)
- A **tensor** is an n-dimensional array: can be 3D, 4D, 5D, etc.

For our cat/dog image:
- **3D tensor**: (height, width, RGB channels) → a single image
- **4D tensor**: (batch_size, height, width, channels) → multiple images

**Tensors are the universal data structure for deep learning** because they can represent any type of data and can be efficiently processed on GPUs.

### Why PyTorch Tensors vs NumPy Arrays?

NumPy is great, but PyTorch tensors offer:
1. **GPU acceleration**: 100x faster computations
2. **Automatic differentiation**: Calculate gradients automatically (essential for training)
3. **Deep learning ecosystem**: Built-in neural network layers, optimizers, etc.

Think of PyTorch as "NumPy with superpowers for deep learning."

In [None]:
# Setup
import torch
import numpy as np
import matplotlib.pyplot as plt

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---

## 2. Creating Tensors

Let's explore different ways to create tensors. Each method has its use case.

In [None]:
# Method 1: From Python lists
# Use case: Small, manually specified data
scalar = torch.tensor(42)
vector = torch.tensor([1, 2, 3, 4])
matrix = torch.tensor([[1, 2, 3],
                       [4, 5, 6]])

print("Scalar (0D):", scalar)
print("Shape:", scalar.shape)  # torch.Size([]) - no dimensions
print()

print("Vector (1D):", vector)
print("Shape:", vector.shape)  # torch.Size([4])
print()

print("Matrix (2D):", matrix)
print("Shape:", matrix.shape)  # torch.Size([2, 3]) - 2 rows, 3 columns

In [None]:
# Method 2: Special initialization functions
# Use case: Initialize model parameters, create placeholders

# All zeros (useful for initializing biases)
zeros = torch.zeros(2, 3)
print("Zeros:\n", zeros)
print()

# All ones
ones = torch.ones(2, 3)
print("Ones:\n", ones)
print()

# Random values from uniform distribution [0, 1)
# Use case: Weight initialization (with proper scaling)
rand_uniform = torch.rand(2, 3)
print("Random uniform [0, 1):\n", rand_uniform)
print()

# Random values from standard normal distribution (mean=0, std=1)
# Use case: Common weight initialization (Xavier, He initialization use this)
rand_normal = torch.randn(2, 3)
print("Random normal:\n", rand_normal)
print()

# Identity matrix (diagonal = 1, rest = 0)
# Use case: Residual connections, specific initializations
identity = torch.eye(3)
print("Identity matrix:\n", identity)

In [None]:
# Method 3: From NumPy arrays (common when working with data)
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])
tensor_from_numpy = torch.from_numpy(numpy_array)
print("From NumPy:\n", tensor_from_numpy)
print("Type:", tensor_from_numpy.dtype)  # Note: int64 by default

# Converting back to NumPy (shares memory - be careful!)
back_to_numpy = tensor_from_numpy.numpy()
print("\nBack to NumPy:\n", back_to_numpy)

# They share memory! Changing one affects the other
tensor_from_numpy[0, 0] = 999
print("\nAfter modifying tensor:")
print("Tensor:", tensor_from_numpy)
print("NumPy:", back_to_numpy)  # Also changed!

### Why Does Shape Matter?

**Shape is the most important concept in tensor programming.** Most bugs in deep learning code come from shape mismatches.

Think of shape as the "type system" of tensors:
- You can't add a `(2, 3)` tensor to a `(3, 2)` tensor directly
- Neural network layers expect specific input shapes
- Understanding shapes helps you design architectures

In [None]:
# Understanding tensor properties
x = torch.randn(2, 3, 4)  # 3D tensor

print(f"Tensor:\n{x}")
print(f"\nShape: {x.shape}")        # torch.Size([2, 3, 4])
print(f"Size: {x.size()}")          # Same as shape
print(f"Number of dimensions: {x.ndim}")  # 3
print(f"Total number of elements: {x.numel()}")  # 2 * 3 * 4 = 24
print(f"Data type: {x.dtype}")     # torch.float32 (default for randn)
print(f"Device: {x.device}")       # cpu or cuda

---

## 3. Data Types (dtype)

### Why Do Data Types Matter?

- **Memory**: `float32` uses 4 bytes, `float16` uses 2 bytes
- **Speed**: GPUs are optimized for `float16` and `float32`
- **Precision**: Training usually needs `float32`, inference can use `float16`
- **Compatibility**: Some operations require specific types

In [None]:
# Common data types
float32 = torch.tensor([1.0, 2.0], dtype=torch.float32)  # Default, most common
float16 = torch.tensor([1.0, 2.0], dtype=torch.float16)  # Half precision (faster)
int64 = torch.tensor([1, 2], dtype=torch.int64)          # Default for integers
bool_tensor = torch.tensor([True, False], dtype=torch.bool)  # Boolean

print(f"float32: {float32.dtype}, size: {float32.element_size()} bytes per element")
print(f"float16: {float16.dtype}, size: {float16.element_size()} bytes per element")
print(f"int64: {int64.dtype}, size: {int64.element_size()} bytes per element")
print(f"bool: {bool_tensor.dtype}, size: {bool_tensor.element_size()} bytes per element")

# Converting between types
x = torch.tensor([1.5, 2.5, 3.5])
print(f"\nOriginal: {x}, dtype: {x.dtype}")
print(f"To int: {x.int()}, dtype: {x.int().dtype}")  # Truncates!
print(f"To long (int64): {x.long()}, dtype: {x.long().dtype}")

---

## 4. Tensor Operations

### Mathematical Operations

PyTorch provides two ways to do operations:
1. **Functions**: `torch.add(a, b)`
2. **Methods**: `a.add(b)`
3. **Operators**: `a + b`

They're all equivalent! Use what's most readable.

In [None]:
# Element-wise operations (operate on each element individually)
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# Addition
print("Addition (a + b):")
print(a + b)  # [[6, 8], [10, 12]]
print()

# Element-wise multiplication (NOT matrix multiplication!)
print("Element-wise multiplication (a * b):")
print(a * b)  # [[5, 12], [21, 32]]
print()

# Exponentiation
print("Power (a ** 2):")
print(a ** 2)  # [[1, 4], [9, 16]]
print()

# Common functions
print("Square root (torch.sqrt(a)):")
print(torch.sqrt(a))
print()

print("Exponential (torch.exp(a)):")
print(torch.exp(a))
print()

print("Natural log (torch.log(a)):")
print(torch.log(a))

### In-place Operations

Operations ending with `_` modify the tensor in-place (save memory but lose the original).

In [None]:
x = torch.tensor([1.0, 2.0, 3.0])
print(f"Original x: {x}")

# Regular operation (creates new tensor)
y = x.add(10)
print(f"After y = x.add(10):")
print(f"  x: {x}")  # Unchanged
print(f"  y: {y}")  # New tensor
print()

# In-place operation (modifies existing tensor)
x.add_(10)  # Note the underscore!
print(f"After x.add_(10):")
print(f"  x: {x}")  # Modified!

# WARNING: In-place operations can break autograd (gradient computation)
# Avoid them when training neural networks unless you know what you're doing

---

## 5. Matrix Operations

### Why Matrix Multiplication Is Everywhere

Matrix multiplication is the **core operation** in neural networks. Every fully connected layer, every attention mechanism - they all use matrix multiplication.

**Remember**: For matrix multiplication `A @ B`:
- `A` has shape `(m, n)`
- `B` has shape `(n, p)`
- Result has shape `(m, p)`
- The inner dimensions `(n)` must match!

In [None]:
# Matrix multiplication
A = torch.tensor([[1, 2],
                  [3, 4],
                  [5, 6]])  # Shape: (3, 2)

B = torch.tensor([[7, 8, 9],
                  [10, 11, 12]])  # Shape: (2, 3)

# Three equivalent ways to multiply
C1 = torch.mm(A, B)      # Function
C2 = A.mm(B)             # Method
C3 = A @ B               # Operator (most readable!)

print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"C shape: {C1.shape}")  # (3, 3)
print(f"\nResult (A @ B):\n{C1}")

# All three are identical
print(f"\nAre they equal? {torch.equal(C1, C2) and torch.equal(C2, C3)}")

In [None]:
# Batch matrix multiplication (common in transformers!)
# Scenario: You have multiple matrices and want to multiply them pairwise

# Batch of 32 matrices, each 10x20
batch_A = torch.randn(32, 10, 20)
# Batch of 32 matrices, each 20x30
batch_B = torch.randn(32, 20, 30)

# Multiply each corresponding pair
batch_C = torch.bmm(batch_A, batch_B)  # Shape: (32, 10, 30)
# Or use @ operator (works for batched too!)
batch_C2 = batch_A @ batch_B

print(f"Batch A shape: {batch_A.shape}")
print(f"Batch B shape: {batch_B.shape}")
print(f"Batch C shape: {batch_C.shape}")
print(f"Are they equal? {torch.equal(batch_C, batch_C2)}")

---

## 6. Reshaping Tensors

### Why Reshape?

Different neural network layers expect different shapes:
- CNN layers: `(batch, channels, height, width)`
- Fully connected layers: `(batch, features)`
- Transformers: `(batch, sequence_length, embedding_dim)`

You'll constantly need to reshape data to fit these expectations.

In [None]:
# Original tensor
x = torch.arange(12)  # [0, 1, 2, ..., 11]
print(f"Original shape: {x.shape}")  # torch.Size([12])
print(f"Original:\n{x}")
print()

# Reshape to 2D
x_2d = x.reshape(3, 4)  # 3 rows, 4 columns
print(f"Reshaped to (3, 4):\n{x_2d}")
print()

# Reshape to 3D
x_3d = x.reshape(2, 2, 3)  # 2 blocks of 2x3
print(f"Reshaped to (2, 2, 3):\n{x_3d}")
print()

# Use -1 to infer dimension (handy!)
x_auto = x.reshape(3, -1)  # 3 rows, infer columns
print(f"Reshaped to (3, -1) -> actual shape: {x_auto.shape}")
print(x_auto)

In [None]:
# view() vs reshape() - what's the difference?
# view() requires contiguous memory, reshape() doesn't
# Always use reshape() unless you're optimizing for performance

x = torch.arange(12)

# Both work for contiguous tensors
view_x = x.view(3, 4)
reshape_x = x.reshape(3, 4)
print(f"Are they equal? {torch.equal(view_x, reshape_x)}")

# Flatten: Convert any shape to 1D
flattened = x_3d.flatten()
print(f"\nFlattened shape: {flattened.shape}")  # torch.Size([12])
print(f"Flattened: {flattened}")

In [None]:
# squeeze() and unsqueeze() - add/remove dimensions of size 1
# Very useful for matching shapes in operations!

x = torch.tensor([1, 2, 3, 4])
print(f"Original shape: {x.shape}")  # torch.Size([4])

# Add dimension at position 0
x_unsqueezed = x.unsqueeze(0)
print(f"After unsqueeze(0): {x_unsqueezed.shape}")  # torch.Size([1, 4])

# Add dimension at position 1
x_unsqueezed2 = x.unsqueeze(1)
print(f"After unsqueeze(1): {x_unsqueezed2.shape}")  # torch.Size([4, 1])

# Remove dimensions of size 1
x_squeezed = x_unsqueezed.squeeze()
print(f"After squeeze(): {x_squeezed.shape}")  # torch.Size([4])

---

## 7. Broadcasting

### What Is Broadcasting?

Broadcasting is PyTorch's way of making tensors with different shapes compatible for operations.

**Why is this powerful?** You can add a bias vector to every row of a matrix without explicit loops!

**Rules**:
1. If tensors have different number of dimensions, pad the smaller one with dimensions of size 1 on the left
2. Dimensions are compatible if they're equal or one of them is 1
3. The result has the maximum size along each dimension

In [None]:
# Example 1: Add a scalar to a tensor
x = torch.tensor([[1, 2, 3],
                  [4, 5, 6]])
scalar = 10

result = x + scalar  # scalar is broadcast to match x's shape
print(f"x + 10:\n{result}")
print()

# Example 2: Add a vector to each row of a matrix
matrix = torch.tensor([[1, 2, 3],
                       [4, 5, 6],
                       [7, 8, 9]])
vector = torch.tensor([10, 20, 30])

print(f"Matrix shape: {matrix.shape}")  # (3, 3)
print(f"Vector shape: {vector.shape}")  # (3,)

result = matrix + vector  # vector is broadcast to (3, 3)
print(f"\nMatrix + vector:\n{result}")
print()

# What happened? 
# vector shape (3,) became (1, 3), then broadcast to (3, 3)
# Each row of matrix gets the vector added to it

In [None]:
# Example 3: More complex broadcasting
a = torch.ones(3, 1, 5)  # Shape: (3, 1, 5)
b = torch.ones(1, 4, 5)  # Shape: (1, 4, 5)

c = a + b  # Broadcasts to (3, 4, 5)
print(f"a shape: {a.shape}")
print(f"b shape: {b.shape}")
print(f"c shape: {c.shape}")  # (3, 4, 5)
print()

# How? 
# Dimension 0: 3 and 1 -> broadcast to 3
# Dimension 1: 1 and 4 -> broadcast to 4  
# Dimension 2: 5 and 5 -> stay 5

# Example 4: Broadcasting error
try:
    x = torch.ones(3, 4)
    y = torch.ones(2, 4)
    z = x + y  # This will fail!
except RuntimeError as e:
    print(f"Error: {e}")
    print("Why? Dimension 0: 3 and 2 are incompatible (neither is 1)")

---

## 8. Indexing and Slicing

PyTorch indexing works just like NumPy!

In [None]:
x = torch.arange(12).reshape(3, 4)
print(f"Original tensor:\n{x}")
print()

# Get single element
print(f"Element at [1, 2]: {x[1, 2]}")  # 6
print()

# Get entire row
print(f"First row: {x[0]}")  # [0, 1, 2, 3]
print(f"Last row: {x[-1]}")  # [8, 9, 10, 11]
print()

# Get entire column
print(f"Second column: {x[:, 1]}")  # [1, 5, 9]
print()

# Slicing (start:end:step)
print(f"First two rows:\n{x[:2]}")  # Rows 0 and 1
print()
print(f"Last two columns:\n{x[:, -2:]}")  # Last 2 columns of all rows
print()

# Advanced indexing with boolean mask
mask = x > 5
print(f"Mask (x > 5):\n{mask}")
print(f"Elements > 5: {x[mask]}")  # Returns 1D tensor

---

## 9. GPU Acceleration

### Why GPUs?

GPUs can perform thousands of operations in parallel. For matrix operations (which neural networks are full of), this means **10-100x speedup**!

**Key principle**: Keep your tensors on the same device. Don't mix CPU and GPU tensors!

In [None]:
# Check device availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\nUsing device: {device}")

In [None]:
# Moving tensors to GPU
cpu_tensor = torch.randn(3, 3)
print(f"CPU tensor device: {cpu_tensor.device}")

# Method 1: Using .to()
gpu_tensor = cpu_tensor.to(device)
print(f"GPU tensor device: {gpu_tensor.device}")

# Method 2: Using .cuda() (only if you're sure GPU exists)
if torch.cuda.is_available():
    gpu_tensor2 = cpu_tensor.cuda()
    print(f"GPU tensor2 device: {gpu_tensor2.device}")

# Creating tensors directly on GPU
gpu_tensor3 = torch.randn(3, 3, device=device)
print(f"GPU tensor3 device: {gpu_tensor3.device}")

In [None]:
# Performance comparison (only meaningful with GPU)
import time

size = 5000

# CPU operation
cpu_a = torch.randn(size, size)
cpu_b = torch.randn(size, size)

start = time.time()
cpu_c = cpu_a @ cpu_b
cpu_time = time.time() - start
print(f"CPU time: {cpu_time:.4f} seconds")

# GPU operation (if available)
if torch.cuda.is_available():
    gpu_a = cpu_a.to(device)
    gpu_b = cpu_b.to(device)
    
    # Warm up GPU
    _ = gpu_a @ gpu_b
    torch.cuda.synchronize()  # Wait for GPU to finish
    
    start = time.time()
    gpu_c = gpu_a @ gpu_b
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    
    print(f"GPU time: {gpu_time:.4f} seconds")
    print(f"Speedup: {cpu_time / gpu_time:.2f}x")
else:
    print("GPU not available for comparison")

---

## 10. Aggregation Operations

These operations reduce tensors along dimensions - crucial for loss functions, metrics, and attention mechanisms.

In [None]:
x = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0],
                  [7.0, 8.0, 9.0]])

print(f"Original tensor:\n{x}\n")

# Sum
print(f"Sum of all elements: {x.sum()}")  # Scalar
print(f"Sum along rows (dim=0): {x.sum(dim=0)}")  # Sum each column
print(f"Sum along columns (dim=1): {x.sum(dim=1)}")  # Sum each row
print()

# Mean
print(f"Mean: {x.mean()}")
print(f"Mean along rows: {x.mean(dim=0)}")
print(f"Mean along columns: {x.mean(dim=1)}")
print()

# Max and min
print(f"Max value: {x.max()}")
print(f"Max along rows: {x.max(dim=0)}")  # Returns (values, indices)
print(f"Max values along rows: {x.max(dim=0).values}")
print(f"Max indices along rows: {x.max(dim=0).indices}")

---

## 11. Practical Example: Image Batch

Let's work with a realistic scenario: a batch of images for a neural network.

In [None]:
# Simulate a batch of RGB images
# Shape: (batch_size, channels, height, width)
batch_size = 32
channels = 3  # RGB
height = 224
width = 224

# Create random images (normally you'd load real images)
images = torch.randn(batch_size, channels, height, width)
print(f"Image batch shape: {images.shape}")
print(f"Total pixels: {images.numel():,}")
print(f"Memory (float32): {images.numel() * 4 / 1e6:.2f} MB")
print()

# Common operations
# 1. Normalize images (subtract mean, divide by std)
mean = images.mean(dim=(2, 3), keepdim=True)  # Mean per image per channel
std = images.std(dim=(2, 3), keepdim=True)
normalized = (images - mean) / (std + 1e-7)  # Add epsilon to avoid division by zero

print(f"Original mean: {images.mean():.4f}, std: {images.std():.4f}")
print(f"Normalized mean: {normalized.mean():.4f}, std: {normalized.std():.4f}")
print()

# 2. Get a single image from batch
single_image = images[0]  # Shape: (3, 224, 224)
print(f"Single image shape: {single_image.shape}")
print()

# 3. Convert to grayscale (average across channels)
grayscale = images.mean(dim=1, keepdim=True)  # Shape: (32, 1, 224, 224)
print(f"Grayscale shape: {grayscale.shape}")

---

## Mini Exercises

Try these exercises to test your understanding!

### Exercise 1: Create and Manipulate Tensors

Create a tensor of shape `(4, 5)` filled with random values from a normal distribution. Then:
1. Calculate the mean and standard deviation
2. Replace all values less than 0 with 0 (ReLU activation!)
3. Calculate the sum of each row

In [None]:
# Your code here


In [None]:
# Solution
x = torch.randn(4, 5)
print(f"Original tensor:\n{x}\n")

# 1. Mean and std
print(f"Mean: {x.mean():.4f}")
print(f"Std: {x.std():.4f}\n")

# 2. ReLU (replace negative values with 0)
x_relu = torch.where(x < 0, torch.tensor(0.0), x)
# Or simpler: x_relu = torch.clamp(x, min=0)
# Or even simpler: x_relu = torch.relu(x)
print(f"After ReLU:\n{x_relu}\n")

# 3. Sum of each row
row_sums = x_relu.sum(dim=1)
print(f"Row sums: {row_sums}")

### Exercise 2: Matrix Operations

Given matrices A `(3, 4)` and B `(4, 2)`:
1. Compute C = A @ B
2. Verify the shape is correct
3. Compute the transpose of C

In [None]:
# Your code here


In [None]:
# Solution
A = torch.randn(3, 4)
B = torch.randn(4, 2)

# 1. Matrix multiplication
C = A @ B

# 2. Verify shape
print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"C shape: {C.shape}")  # Should be (3, 2)
assert C.shape == (3, 2), "Shape mismatch!"
print()

# 3. Transpose
C_T = C.T  # Or C.transpose(0, 1)
print(f"C:\n{C}\n")
print(f"C transposed:\n{C_T}")
print(f"C_T shape: {C_T.shape}")  # (2, 3)

### Exercise 3: Broadcasting Challenge

You have:
- A batch of 16 vectors, each of size 128: shape `(16, 128)`
- A bias vector of size 128: shape `(128,)`

Add the bias to each vector in the batch using broadcasting.

In [None]:
# Your code here


In [None]:
# Solution
batch = torch.randn(16, 128)
bias = torch.randn(128)

# Simply add! Broadcasting handles it
result = batch + bias

print(f"Batch shape: {batch.shape}")
print(f"Bias shape: {bias.shape}")
print(f"Result shape: {result.shape}")  # (16, 128)

# Verify it worked: first row should be original + bias
print(f"\nVerification:")
print(f"First 3 elements of first row: {batch[0, :3]}")
print(f"First 3 elements of bias: {bias[:3]}")
print(f"First 3 elements of result: {result[0, :3]}")
print(f"Are they equal? {torch.allclose(batch[0, :3] + bias[:3], result[0, :3])}")

---

## Comprehensive Exercise: Simulated Linear Layer

Implement a simple linear transformation (the core of fully connected layers):

**Formula**: `y = xW + b`

Where:
- `x`: input batch of shape `(batch_size, input_features)`
- `W`: weight matrix of shape `(input_features, output_features)`
- `b`: bias vector of shape `(output_features,)`
- `y`: output of shape `(batch_size, output_features)`

**Tasks**:
1. Create random tensors for x, W, and b with appropriate shapes
2. Compute y
3. Verify the output shape is correct
4. Move everything to GPU (if available) and repeat

Use: `batch_size=32`, `input_features=128`, `output_features=64`

In [None]:
# Your code here


In [None]:
# Solution

# Parameters
batch_size = 32
input_features = 128
output_features = 64

# 1. Create tensors
x = torch.randn(batch_size, input_features)
W = torch.randn(input_features, output_features)  # Note: features are transposed!
b = torch.randn(output_features)

print(f"x shape: {x.shape}")  # (32, 128)
print(f"W shape: {W.shape}")  # (128, 64)
print(f"b shape: {b.shape}")  # (64,)
print()

# 2. Compute y = xW + b
y = x @ W + b  # Broadcasting adds b to each row

# 3. Verify shape
print(f"y shape: {y.shape}")  # (32, 64)
assert y.shape == (batch_size, output_features), "Shape mismatch!"
print("Shape is correct!\n")

# 4. GPU version
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move to GPU
x_gpu = x.to(device)
W_gpu = W.to(device)
b_gpu = b.to(device)

# Compute on GPU
y_gpu = x_gpu @ W_gpu + b_gpu

print(f"y_gpu shape: {y_gpu.shape}")
print(f"y_gpu device: {y_gpu.device}")

# Verify results are the same (within floating point precision)
y_cpu_from_gpu = y_gpu.cpu()
print(f"\nResults match: {torch.allclose(y, y_cpu_from_gpu)}")

---

## Key Takeaways

1. **Tensors are everywhere**: They're the fundamental data structure in PyTorch
2. **Shape is king**: Most bugs come from shape mismatches. Always print shapes!
3. **Broadcasting is powerful**: Lets you write concise code without explicit loops
4. **GPU acceleration is easy**: Just move tensors with `.to(device)`
5. **Matrix multiplication is the core operation**: `@` operator is your friend
6. **PyTorch = NumPy + Deep Learning**: Similar API, but with gradients and GPUs

## Next Steps

Now that you understand tensors, you're ready to learn about **automatic differentiation** - the magic that makes neural network training possible!

Continue to: [Topic 2: Automatic Differentiation & Backpropagation](02_autograd_backprop.ipynb)

---

## Further Reading

- [PyTorch Tensor Documentation](https://pytorch.org/docs/stable/tensors.html)
- [Broadcasting Semantics](https://pytorch.org/docs/stable/notes/broadcasting.html)
- [CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html)