1. Introduction to PyTorch
PyTorch is a deep learning framework primarily used for building machine learning models. It's a flexible and efficient library that offers strong support for dynamic computation graphs (eager execution) and deep integration with Python.

Key Features:

Dynamic Computation Graph: PyTorch allows you to modify the computation graph dynamically, which is more intuitive for debugging and easier to work with for complex models.
Automatic Differentiation: PyTorch’s Autograd module automatically computes gradients for backpropagation, simplifying the training process.
GPU Acceleration: PyTorch seamlessly integrates with CUDA, enabling you to run models on GPUs for faster computations.
Deep Learning Modules: PyTorch comes with a high-level neural network module, torch.nn, for building and training deep learning models.

In [None]:
!pip3 install torch torchvision torchaudio -f https://download.pytorch.org/whl/metal.html # latest stable

In [None]:
import torch

print(torch.backends.mps.is_available())  # Should return True

device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}") 
# to use device do: x.to(device) where x can be a tensor or model etc.
# just making the device var or doing torch.device("mps") does not move the tensor/model to the device

In [None]:
# Pytorch import 
import torch

""" 
In pytorch, a tensor is a multi-dimensional array that can be used to store data. same as TF tensor
- NOTE unlike Tf-metal pytorch does not have a special tensor for GPU, it uses the same tensor class for CPU and GPU. unless you have cuda (nvidia GPU)

We can preform operations on tensors, such as addition, multiplication, etc. the operations are similar to numpy beacause they are element-wise
meaning each element in the tensor is operated on independently with the element in the other tensor.
EX: 
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
c = a + b # c = [5, 7, 9] (1+4, 2+5, 3+6)

WE can also reshape the tensor, and change the data type of the tensor, such as int32, int64, float32, float64
for reshaping this means we can change the number of rows and columns in the tensor, but the total number of elements must remain the same.
i.e 2x3 -> 3x2
"""
# Creating a tensor from a list
x = torch.tensor([1.0, 2.0, 3.0])

# Creating a tensor with zeros, ones, or random values
zeros_tensor = torch.zeros(2, 3)  # 2x3 matrix of zeros
ones_tensor = torch.ones(2, 3)  # 2x3 matrix of ones
random_tensor = torch.rand(2, 3)  # 2x3 matrix with random values

# Creating a tensor with a specific device (CPU or GPU)
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
tensor_on_device = torch.tensor([1.0, 2.0, 3.0]).to(device)  # move tensor to GPU if available else CPU

# Basic operations
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
c = a + b  # Element-wise addition
d = a * b  # Element-wise multiplication

# Matrix multiplication
x = torch.rand(2, 3)
y = torch.rand(3, 2)
z = torch.mm(x, y)  # Matrix multiplication

x = torch.rand(2, 3)
x_reshaped = x.view(3, 2)  # Reshaping to a different dimension

""" 
PyTorch Autograd (Automatic Differentiation)

- backpropagation: the process of computing the gradients of the loss function with respect to the model parameters.
- a backward pass is the step wher you use backpropagation to compute the gradients of the loss function with respect to the model parameters.
for ex in a neural network, we need to compute the gradients of the loss function with respect to the weights and biases of the network.
what this dose for us is it allows us to update the weights and biases of the network in the direction that reduces the loss function.
this makes the network learn from the data and improve its performance.
in our case we backpropagate the gradients from out to x this gradient vector tells us how much the output changes with respect to the input
with this gradient vector we can update the weights and biases of the network in the direction that reduces the loss function.

- Autograd is PyTorch's automatic differentiation engine that powers the backpropagation in neural networks. It automatically calculates the gradients for tensor operations.
- Creating Tensors with requires_grad: Tensors that require gradients (for backpropagation) are marked by setting requires_grad=True.
when we backpropagate, we calculate the gradient of the output with respect to the input.

The backward() function: Computes all the gradients in the computation graph. It accumulates gradients into .grad attributes of tensors.
For scalar outputs, out.backward() is enough. For non-scalar outputs, you need to pass a gradient argument to backward().

NOTE: after backpropagation, the gradients are accumulated in the .grad attribute of the tensor. hence x.grad will contain the gradients of the output with respect to x.
we store it in X as thats the starting point of the computation graph. (see TF notes) but basically we can store it in any tensor but be default we store it in x as its the start.

the second EX has number so 
v= [1, 1]
out = v ** 2 = [1, 1]
out.sum() = 1 + 1 = 2
out.sum().backward() computes the gradients of out with respect to v and stores them in v.grad so 2 is the gradient of out with respect to v and it is stored in v.grad
v.grad = [2, 2] because the gradient of v with respect to itself is 1 and we have 2 elements in v so the gradient is 2 for each element.
"""
x = torch.randn(2, 2, requires_grad=True) # Create a tensor with requires_grad=True initially it has no gradient and is just a 2x2 matrix
y = x + 2 # y is a tensor that is the result of adding 2 to each element of x. y = [[x[0][0]+2, x[0][1]+2], [x[1][0]+2, x[1][1]+2]] x is rand so we dont know
z = y * y * 3 # z is a tensor that is the result of multiplying y by itself and then multiplying by 3. z = 3 * y * y
out = z.mean() # out is a scalar that is the mean of z. out = (3 * y * y).mean() the mean is just a number 

# Backpropagate, we backpropagate the gradients from out to x this gradient vector tells us how much the output changes with respect to the input
out.backward() # this computes the gradients of out with respect to x and stores them in x.grad 
print(x.grad) # this will print the gradients of out with respect to x. this is a 2x2 matrix that tells us how much the output changes with respect to the input

#EX2
# in this we use the sum of the squares of the elements of v as the output insted of the mean
v = torch.tensor([1.0, 1.0], requires_grad=True) # Create a tensor with requires_grad=True
out = v ** 2 # out is a tensor that is the result of raising v to the power of 2
out.sum().backward()  # Backpropagate the sum of the squares what this does is it computes the gradients of out with respect to v and stores them in v.grad the sum means we are summing the elements of out
print(v.grad)  # Gradients of v

""" 
Neural Networks with torch.nn
The torch.nn module provides a set of tools to build and train neural networks.

Building a Model (Class-based Approach)

You define a neural network by subclassing torch.nn.Module.

what is a FULLY CONNECTED LAYER?: a fully connected layer is a layer in a neural network where each neuron in the layer is connected to every neuron in the previous layer.
from tensorflow we know that the dense layer is a fully connected layer.

Forward Pass: The forward pass is the process of passing the input data through the layers of the neural network and computing the output.
- NOTE a backward pass is the process of computing the gradients of the loss function with respect to the model parameters. not the same as the forward pass.

The RELU is a non-linear activation function that is used to introduce non-linearity into the network. see TF notes for more details.

EX:
in our example we have 2 layers, the first layer has 784 input neurons and 128 output neurons, and the second layer has 128 input neurons and 10 output neurons.
in the forward pass we pass the input data through the first layer, apply the ReLU activation function what relu dose in our case is lets us activate the neurons based on the input, 
and then pass the output through the second layer and return that as the output of the network.

ReLU (Rectified Linear Unit) sets all negative values to zero and keeps positive values unchanged.
In a simple NN, it adds non-linearity, helping the model learn complex patterns instead of just straight lines. another one is signmoid (see tf)
it basically tells the NN what input values to keep and what to ignore. i.e what is important and what is not.
Mathematically:
ReLU(x)=max⁡(0,x)
ReLU(x)=max(0,x)
✅ Keeps positives → same
✅ Turns negatives → 0
"""
import torch.nn as nn # for neural networks
import torch.optim as optim # for optimizers

class SimpleNN(nn.Module): # inherit from nn.Module
    def __init__(self): 
        super(SimpleNN, self).__init__() # call the parent class constructor to initialize the module
        self.fc1 = nn.Linear(784, 128)  # Fully connected layer (input 784, output 128)
        self.fc2 = nn.Linear(128, 10)  # Fully connected layer (input 128, output 10)

    def forward(self, x): # forward pass
        x = torch.relu(self.fc1(x))  # ReLU activation function 
        x = self.fc2(x) # output layer takes input from previous layer
        return x # return the output of the network

# Model instantiation
model = SimpleNN()

""" 
Training a Model

Training a model involves the following steps:

Define the loss function (e.g., nn.CrossEntropyLoss). this is used for multi-class classification problems i.e predicting numbers from 0 to 9 for ex the loss function is used to compute the difference between the predicted output and the target output help us measure how well the model is performing.
Define an optimizer (e.g., optim.SGD or optim.Adam). this is used to update the model parameters based on the gradients computed in the backward pass. basically we use the optimizer to update the weights and biases of the network in the direction that reduces the loss function.
Perform the forward pass and compute the loss. this is used to compute the difference between the predicted output and the target output help us measure how well the model is performing.
Backpropagate to compute gradients. this is used to compute the gradients of the loss function with respect to the model parameters. after we compute the gradients we can update the weights and biases of the network in the direction that reduces the loss function.
Update weights using the optimizer. this is used to update the model parameters based on the gradients computed in the backward pass. basically we use the optimizer to update the weights and biases of the network in the direction that reduces the loss function.

Backwards pass vs Backpropagation: 
Backpropagation = the method (the algorithm for calculating gradients).
Backward pass = the action (the model doing the gradient calculation step during training).

Simple Timeline in Training:
Forward pass → compute output.
Compute loss.
Backward pass → run backpropagation → compute gradients.
Optimizer step → update weights.

# NOTE: a epoch is one complete pass through the entire training dataset so here we have 10 epochs meaning we will pass through the entire training dataset (X, y) 10 times.
"""
# NOTE: you random data maches the model meaning they are the same shape, so you can use any data you want, but in a real world example you would use real data.
# Create 1000 samples, each with 10 features a feature is a column in the input data that represents a specific characteristic of the data like e.g. height, weight, age, etc.
X = torch.randn(1000, 784)  # inputs (features)
y = torch.randint(0, 10, (1000,))  # outputs (labels) - for binary classification (0 or 1)
# Wrap into a dataset
dataset = torch.utils.data.TensorDataset(X, y)
# Create a trainloader
trainloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10
for epoch in range(epochs): # for each epoch
    for data, target in trainloader:
        optimizer.zero_grad()  # Zero the gradients
        output = model(data)  # Forward pass using the model above
        loss = criterion(output, target)  # Compute loss
        loss.backward()  # Backpropagate
        optimizer.step()  # Update weights
    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")  # Print loss for each epoch
# we should see the loss decrease as we train the model

""" 
Datasets and DataLoader
PyTorch provides utilities for loading and batching data using torch.utils.data.Dataset and DataLoader.

Creating a Custom Dataset
- A dataset is a class that inherits from torch.utils.data.Dataset and implements the __len__ and __getitem__ methods.
- in torch a dataset is a class that represents a collection of data samples. it is used to load and preprocess the data for training and testing the model.

### what we achove here is we can load the data in batches of 64 samples and shuffle the data so that we dont get the same data every time we train the model.

- you can also use Predefined Datasets (e.g., MNIST, CIFAR)
"""
from torch.utils.data import Dataset, DataLoader

train_data = torch.randn(1000, 784)  # Random training data with 1000 samples and 784 features
train_labels = torch.randint(0, 10, (1000,))  # Random training labels numbers from 0 to 9
class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

# Example usage
train_dataset = CustomDataset(train_data, train_labels) # this is the dataset we created above
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) # this is the dataloader that loads the dataset in batches of 64 samples




LLM example with MPS (mac gpu acceleration)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import time

# Check if MPS (Metal Performance Shaders) is available
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") # for cpu do device = torch.device("cpu")
print(f"Using device: {device}")

# Define a simple transformer-based language model
class SimpleTransformerLM(nn.Module):
    def __init__(self, vocab_size=10000, embedding_dim=512, nhead=8, 
                 num_layers=6, dim_feedforward=2048):
        super(SimpleTransformerLM, self).__init__()
        
        # Word embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Positional encoding (simplified for this example)
        self.pos_encoder = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim),
            nn.GELU()
        )
        
        # Transformer encoder layers
        encoder_layers = nn.TransformerEncoderLayer(
            d_model=embedding_dim, 
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
        
        # Output layer
        self.output = nn.Linear(embedding_dim, vocab_size)
        
    def forward(self, x):
        # x shape: [batch_size, seq_len]
        x = self.embedding(x)  # [batch_size, seq_len, embedding_dim]
        x = self.pos_encoder(x)
        x = self.transformer_encoder(x)
        x = self.output(x)  # [batch_size, seq_len, vocab_size]
        return x

# Create model and move to MPS device
vocab_size = 10000  # Vocabulary size
seq_length = 128    # Sequence length
batch_size = 16     # Batch size

model = SimpleTransformerLM(vocab_size=vocab_size)
model.to(device)  # Move model to GPU
print(f"Model moved to {device}")

# Generate some dummy data
input_data = torch.randint(0, vocab_size, (batch_size, seq_length)).to(device)
target_data = torch.randint(0, vocab_size, (batch_size, seq_length)).to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train model for a few steps to demonstrate GPU usage
def train_step(model, inputs, targets):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(inputs)
    # Reshape for cross entropy
    outputs = outputs.view(-1, vocab_size)
    targets = targets.view(-1)
    
    # Calculate loss
    loss = criterion(outputs, targets)
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return loss.item()

# Test training performance
print("\nTraining for 20 iterations to test performance:")
start_time = time.time()
for i in range(20):
    loss = train_step(model, input_data, target_data)
    print(f"Iteration {i+1}, Loss: {loss:.4f}")
end_time = time.time()
print(f"Training time: {end_time - start_time:.2f} seconds") # 2.80 sec on GPU, 13.35 sec on CPU (MacBook Pro M2 16GB)

# Generate predictions with the model
def generate_text(model, start_tokens, max_length=20):
    model.eval()
    current_tokens = start_tokens.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            # Get model predictions for the next token
            logits = model(current_tokens)
            next_token_logits = logits[:, -1, :]
            
            # Sample from the distribution
            probs = torch.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, 1)
            
            # Append new token to sequence
            current_tokens = torch.cat([current_tokens, next_token], dim=1)
    
    return current_tokens

# Try generating some "text" (just token IDs in this example)
start_seq = torch.randint(0, vocab_size, (1, 5)).to(device)
generated = generate_text(model, start_seq)
print("\nGenerated token sequence:")
print(generated.cpu().numpy())