# Week 2 Practice: From Transformers to Alignment
### Learning Objectives:
* Understand attention mechanisms through NumPy code
* Build a simple transformer block
* Predict next token using a pretrained LLM
* Analyze hallucinations
* Explore supervised fine-tuning logic
* Understand how DPO works via preference modeling

#### Package Introduction

In this notebook, we will use several important Python libraries:

- **[NumPy](https://education.launchcode.org/data-analysis-curriculum/eda-with-pandas/reading/numpy-intro/index.html?utm_term=launchcode&utm_campaign=&utm_source=bing&utm_medium=ppc&hsa_acc=4368208516&hsa_cam=568518766&hsa_grp=1173180668353233&hsa_ad=&hsa_src=o&hsa_tgt=dat-2325123495982042:loc-190&hsa_kw=launchcode&hsa_mt=b&hsa_net=adwords&hsa_ver=3&msclkid=f69fad9bed3f18d44e31c5a6703d580b&utm_content=Group%202)**: The fundamental package for scientific computing with Python. We use it for matrix operations and to demonstrate the attention mechanism.
- **[PyTorch](https://www.geeksforgeeks.org/start-learning-pytorch-for-beginners/)**: A popular deep learning framework. We use it to build and train neural network models, including transformer blocks.
- **[Hugging Face Transformers](https://github.com/huggingface/transformers)**: Provides state-of-the-art pre-trained models and tools for natural language processing. We use it to load and interact with large language models (LLMs).
- **[huggingface-cli](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli)**: A command-line tool for managing Hugging Face models and datasets. Useful for downloading models or checking your authentication.

Make sure you have these packages installed. You can install them using pip if needed:


In [1]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# Optional: If you're still having issues, you can also try setting max_split_size_mb
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:32"

In [3]:
! pip install numpy torch transformers huggingface_hub

# For `huggingface-cli`, it is included with `huggingface_hub` or `transformers`. You can check your installation with:

! huggingface-cli --help


usage: huggingface-cli <command> [<args>]

positional arguments:
  {download,upload,repo-files,env,login,whoami,logout,auth,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache,tag,version,upload-large-folder}
                        huggingface-cli command helpers
    download            Download files from the Hub
    upload              Upload a file or a folder to a repo on the Hub
    repo-files          Manage files in a repo on the Hub
    env                 Print information about the environment.
    login               Log in using a token from
                        huggingface.co/settings/tokens
    whoami              Find out which huggingface.co account you are logged
                        in as.
    logout              Log out
    auth                Other authentication related commands
    repo                {create} Commands to interact with your huggingface.co
                        repos.
    lfs-enable-largefiles
                        Co

### Part 1: Attention Mechanism (Self-Attention)

In [None]:
import numpy as np

# Random Q, K, V matrices
def generate_random_qkv(seq_len=4, d_model=8):
    return [np.random.rand(seq_len, d_model) for _ in range(3)]

# Scaled dot-product attention
def self_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    weights = softmax(scores)
    output = np.dot(weights, V)
    return output, weights

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Q, K, V = generate_random_qkv()
out, attn_weights = self_attention(Q, K, V)
print("Attention Output:\n", out)
print("Attention Weights:\n", attn_weights)


### 1. import numpy as np:

This line imports the numpy library, which is essential for numerical operations, especially array and matrix manipulations, that are heavily used in this code.

### 2. generate_random_qkv(seq_len=4, d_model=8) function:

__Purpose:__ This function creates three random matrices: Query (Q), Key (K), and Value (V). These matrices are conceptual representations of the input sequence.

__Parameters:__
* seq_len: Represents the "sequence length," which is the number of tokens or elements in the input sequence. Here, it's defaulted to 4.
* d_model: Represents the "dimension of the model" or the embedding dimension of each token. Here, it's defaulted to 8.

__Return:__ It returns a list containing three NumPy arrays, each of shape (seq_len, d_model). These arrays are filled with random floating-point numbers. In a real-world scenario, these would be learned embeddings of your input data.

### 3. softmax(x) function:

__Purpose:__ This function implements the softmax activation function. Softmax is crucial for converting a vector of arbitrary real values into a probability distribution, where all elements are non-negative and sum to 1.

__Parameters:__
* x: A NumPy array (typically a vector or a matrix).

__Mechanism:__

* exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True)): To ensure numerical stability and prevent potential overflow issues when computing exp(x) for large values, the maximum value in x along the last axis is subtracted from x before applying the exponential function. keepdims=True ensures the subtracted maximum maintains its dimension, allowing for proper broadcasting.

* return exp_x / np.sum(exp_x, axis=-1, keepdims=True): Each element of exp_x is then divided by the sum of all elements along the last axis. This normalizes the values into a probability distribution.

### 4. self_attention(Q, K, V) function:

__Purpose:__ This is the core implementation of the Scaled Dot-Product Attention mechanism. It calculates how much attention each element in the sequence should pay to other elements.

__Parameters:__

* Q: The Query matrix.

* K: The Key matrix.

* V: The Value matrix.

__Mechanisms:__
* d_k = Q.shape[-1]: d_k is the dimension of the key vectors (which is the same as the query vectors). This is used for scaling.
* scores = np.dot(Q, K.T) / np.sqrt(d_k): This is the "dot-product" part.

    1. np.dot(Q, K.T): Computes the dot product between the Query matrix and the transpose of the Key matrix. This measures the similarity or "relevance" between each query vector and all key vectors. The result is a (seq_len, seq_len) matrix, where each element (i, j) represents the similarity between the i-th query and the j-th key.

    2. / np.sqrt(d_k): The scores are divided by the square root of d_k. This "scaling" factor is crucial to prevent the dot products from becoming too large, especially with high-dimensional d_k, which can push the softmax function into regions with extremely small gradients, leading to vanishing gradients.

* weights = softmax(scores): The scores are passed through the softmax function. This converts the raw similarity scores into attention weights. Each row of weights represents how much attention a particular query element pays to all other key elements in the sequence. The sum of weights in each row will be 1.

* output = np.dot(weights, V): This is the "weighted sum" part. The attention weights are multiplied by the Value matrix. This operation essentially creates a weighted sum of the Value vectors, where the weights determine the importance of each Value vector for generating the output for a specific query. The output captures the relevant information from the input sequence, weighted by the attention scores.

__Return:__ It returns two things:

* output: The final attention output matrix, with the same shape as V (seq_len, d_model).

* weights: The attention weights matrix (seq_len, seq_len).

__Example Usage:__

* Q, K, V = generate_random_qkv(): Calls the function to generate random Query, Key, and Value matrices.

* out, attn_weights = self_attention(Q, K, V): Calls the self_attention function with the generated Q, K, and V, storing the results in out (the attention output) and attn_weights (the attention weights).

* print("Attention Output:\n", out): Prints the resulting attention output matrix.

* print("Attention Weights:\n", attn_weights): Prints the attention weights matrix.

### In essence, this code demonstrates how self-attention works:

* For each element in an input sequence (represented by its Query vector), it calculates its similarity to all other elements (represented by their Key vectors). 
* These similarities are then scaled and transformed into probabilities (attention weights) using softmax. 
* Finally, these attention weights are used to compute a weighted sum of the Value vectors, producing an output that "attends" to the most relevant parts of the input sequence. 
* This mechanism allows the model to capture long-range dependencies and contextual relationships within the data.

### Discussion:
Walk students through the QK^T score computation, scaling, and softmax. Explain how this captures relationships between tokens.

### Part 2: Mini Transformer Block in PyTorch

In [None]:
import torch
import torch.nn as nn

class MiniTransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads=2, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.ReLU(),
            nn.Linear(embed_dim * 4, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
    
    def forward(self, x):
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_output)
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        return x

x = torch.randn(1, 5, 16)  # batch_size=1, seq_len=5, embed_dim=16
model = MiniTransformerBlock(embed_dim=16)
out = model(x)
print(out.shape)


 Goal: Show how self-attention and FFN work with residual and norm in PyTorch.

### 1. Imports

* torch: This is the main PyTorch library, providing fundamental tensor operations and utilities.

* torch.nn: This sub-library provides modules for building neural networks, including layers like linear transformations, activation functions, and attention mechanisms.

### 2. MiniTransformerBlock Class
This class inherits from nn.Module, which is the base class for all neural network modules in PyTorch.

__init__(self, embed_dim)
This is the constructor method where the layers of the Transformer block are defined.

* super().__init__(): Calls the constructor of the parent class nn.Module. This is crucial for proper initialization of PyTorch modules.

* self.attn = nn.MultiheadAttention(embed_dim, num_heads=2, batch_first=True):

    * This defines the Multi-Head Attention mechanism.

    * embed_dim: This is the dimension of the input and output features for each token in the sequence.

    * num_heads=2: This specifies that the attention mechanism will be split into 2 "heads." Each head can learn different aspects of the relationships between tokens. Their outputs are then concatenated and linearly transformed.

    * batch_first=True: This tells the MultiheadAttention module that the input tensor will have the batch dimension as the first dimension (e.g., [batch_size, sequence_length, embedding_dimension]).

* self.ffn = nn.Sequential(...):

    * This defines the Feed-Forward Network (FFN), which is applied independently to each position in the sequence. It's wrapped in nn.Sequential to define a sequence of operations.

    * nn.Linear(embed_dim, embed_dim * 4): A linear (dense) layer that projects the embed_dim input to a higher dimension (embed_dim * 4).

    * nn.ReLU(): The Rectified Linear Unit (ReLU) activation function, which introduces non-linearity.

    * nn.Linear(embed_dim * 4, embed_dim): Another linear layer that projects the higher-dimensional output back to the original embed_dim.

* self.norm1 = nn.LayerNorm(embed_dim):

    * This defines the first Layer Normalization layer. Layer normalization normalizes the inputs across the feature dimension for each sample independently. This helps stabilize training.

* self.norm2 = nn.LayerNorm(embed_dim):

    * This defines the second Layer Normalization layer.

### Part 3: Next Token Prediction using HuggingFace

#### Option 1: Use a Publicly Available Model
Use a non-gated model such as TinyLlama or mistralai/Mistral-7B-v0.1

datasets 3.0.1 requires pyarrow>=15.0.0, but not pyarrow 14.0.2. To avoid dependency Conflict (PyArrow): datasets 3.0.1 requires pyarrow>=15.0.0, but you have pyarrow 14.0.2.

In [None]:
!pip install --upgrade pyarrow>=15.0.0

Tell Python to use UTF-8 for I/O operations within your Jupyter Notebook.

In [5]:
import os
os.environ["PYTHONIOENCODING"] = "utf-8"

Using Python's login() Function to save the Huggingface token to Git credential helper. Git version 2.50.1.windows.1 has been installed

In [7]:
from huggingface_hub import login

# Get your token from Hugging Face website (Settings -> Access Tokens)
your_huggingface_token = "MyToken" # The placeholder for my token

login(token=your_huggingface_token, add_to_git_credential=True)

In [7]:
# Step 1: Login via CLI
! pip install --upgrade huggingface_hub[cli]
# To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
! huggingface-cli login --token MyToken




The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: read).
The token `Read30July` has been saved to C:\Users\ch939\.cache\huggingface\stored_tokens
Your token has been saved to C:\Users\ch939\.cache\huggingface\token
Login successful.
The current active token is: `Read30July`


When you want to optimize and scale your deep learning workflows, even for publicly available models, __accelerate__ becomes highly beneficial and often necessary. Using a `device_map` or `tp_plan` requires `accelerate`.

A __device_map__ is essentially a plan or a blueprint that tells accelerate (and by extension, transformers) where to place each individual layer (or component) of your large language model across your available computational devices.

__tp_plan__ stands for Tensor Parallelism Plan. This is a more advanced technique used for distributed inference and training of extremely large models, especially when a single layer of the model itself is too large to fit on a single GPU.

In [None]:
!pip install accelerate

The below imstallation can be ignored.

In [None]:
!pip install hf_xet

import ipywidgets as widgets;
from IPython.display import display

In [9]:
import ipywidgets as widgets
from IPython.display import display

Given that I have an NVIDIA GeForce RTX 4070 SUPER (which has either 12GB or 16GB of VRAM) and accelerate is installed and I've accepted the model terms, the most probable cause for the continuous "Error displaying widget" during model loading is now:

__Insufficient GPU Memory (VRAM) for the model's default loading precision.__

Mistral-7B, even with device_map="auto", might be attempting to load in float32 (full precision) by default, which can exceed the VRAM of a 4070 SUPER.

__Solution: Load the Model in Lower Precision (8-bit or 4-bit)__

You need to tell transformers to load the model in a more memory-efficient format. This is typically done using bitsandbytes for 8-bit or 4-bit quantization.

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
# login(token="MyToken")  # optional if already logged in via CLI

import torch # Make sure to import torch for torch_dtype
print(torch.__version__)
print(torch.version.cuda)

2.7.1+cu128
12.8


None for CUDA was the root cause of all my previous GPU-related errors. Even though you have a powerful RTX 4070 SUPER, your Python environment's PyTorch installation is incapable of communicating with it.

To solve the problem, I uninstalled my previous CPU-only PyTorch in Anaconda.

pip uninstall torch torchvision torchaudio

__Determine the Correct PyTorch Installation Command:__

Go to the official PyTorch website to get the most accurate and up-to-date installation command for my system:
https://pytorch.org/get-started/locally/

On this page, select the following options:

PyTorch Build: Stable

Your OS: Windows

Package: Pip

Compute Platform: CUDA 12.1 (or the highest available CUDA version that is 12.x for my RTX 4070 SUPER). CUDA 12.1 or 12.2 is generally compatible with the 40-series cards. 

CUDA version 12.8 is selected. The website will then provide you with the exact pip install command to install the CUDA-enabled version.

In [13]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=True)

this was the previous command used:

model = AutoModelForCausalLM.from_pretrained(model_name, token=True, device_map="auto")

In [15]:
import torch
print(torch.__version__)
print(torch.version.cuda)

2.7.1+cu128
12.8


Install bitsandbytes using the Gohlke wheel (if PyTorch is CUDA-enabled):
This remains the most robust method for Windows.

* Go to https://www.lfd.uci.edu/~gohlke/pythonlibs/#bitsandbytes

* Download the .whl file that matches your Python version and the CUDA version your PyTorch is built with (e.g., if PyTorch is 2.3.0+cu121 and your Python is 3.10, look for cp310 and cu121).

* Navigate to your Downloads folder in your Anaconda Prompt/terminal (not necessarily Jupyter for this step, though !pip install in Jupyter can work if you provide the full path to the .whl file).

* Run:

Bash

pip install C:\Users\YourUser\Downloads\your_downloaded_bitsandbytes_wheel_file.whl

Replace the path and filename with your actual downloaded file

Unfortunately, the website is out of service

Prioritized Approach (No Gohlke wheels needed if this works):

In [13]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.cuda.device_count())

2.7.1+cu128
True
12.8
1


In [46]:
!pip install bitsandbytes accelerate

Collecting bitsandbytes
  Using cached bitsandbytes-0.46.1-py3-none-win_amd64.whl.metadata (10 kB)
Using cached bitsandbytes-0.46.1-py3-none-win_amd64.whl (72.2 MB)
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.46.1


__Run my Model Loading Code with BitsAndBytesConfig:__

After the kernel restarts, run the following code. This uses the BitsAndBytesConfig which is the recommended way to load models in 8-bit or 4-bit, and bitsandbytes should now correctly detect your CUDA-enabled GPU.

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from huggingface_hub import login
import torch # Make sure to import torch

# Define your quantization configuration
# For 8-bit loading:
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)

# OR, for 4-bit loading (often better for RTX 40-series cards, try this if 8-bit still gives OOM):
# This configuration is generally recommended for RTX 40-series for optimal 4-bit performance
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4", # Use NormalFloat4 quantization
#     bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for faster computation on 40-series GPUs
#     bnb_4bit_use_double_quant=True, # Apply double quantization
# )

# You've already accepted the agreement on the HF website for Mistral-7B-Instruct-v0.1.
# login(token="MyToken") # Optional if already logged in via CLI

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

print(f"Loading tokenizer for {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name, token=True)
print("Tokenizer loaded.")

print(f"Loading model {model_name} with quantization...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=True,
    device_map="auto",
    quantization_config=bnb_config # Pass the defined bnb_config here
)
print("Model loaded successfully!")

# You can then test it
# prompt = "What is the capital of France?"
# inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# outputs = model.generate(inputs, max_new_tokens=50)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.1...
Tokenizer loaded.
Loading model mistralai/Mistral-7B-Instruct-v0.1 with quantization...


  _ = torch.tensor([0], device=i)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded successfully!


In [8]:
import os
import gc
import torch

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Environment variables set and cache cleared.")

Environment variables set and cache cleared.


In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from huggingface_hub import login
import torch

# Define your 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
# login(token="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") # Optional

In [12]:
print(f"Loading tokenizer for {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name, token=True)
print("Tokenizer loaded.")

print(f"Loading model {model_name} with 4-bit quantization and device_map='auto'...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=True,
    device_map="auto",
    quantization_config=bnb_config
)
print("Model loaded successfully!")

# Verify memory usage after loading
if torch.cuda.is_available():
    print(f"CUDA memory allocated after model load: {torch.cuda.memory_allocated() / (1024**3):.2f} GB")
    print(f"CUDA memory reserved after model load: {torch.cuda.memory_reserved() / (1024**3):.2f} GB")

Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.1...
Tokenizer loaded.
Loading model mistralai/Mistral-7B-Instruct-v0.1 with 4-bit quantization and device_map='auto'...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded successfully!
CUDA memory allocated after model load: 7.20 GB
CUDA memory reserved after model load: 7.97 GB


This should now succeed without OOM. The Error displaying widget might still appear cosmetically, but the "Model loaded successfully!" is the true indicator.

In [15]:
import gc  # <--- Make sure this is here for inference
import torch # Re-import torch if this is a separate cell and not already imported

prompt = "What is the capital of France?"
messages = [
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) # Use model.device for correct placement

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Generating response...")
with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=100, # Start with a lower number, increase if successful
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- Generated Output ---")
print(generated_text)

Generating response...


AttributeError: 

In [17]:
# Ensure these are imported if you are running this in a new cell
import gc
import torch

# Define your prompt using the message format expected by instruct models
prompt = "Write a short, engaging story about a lost cat who finds its way home."
messages = [
    {"role": "user", "content": prompt}
]

# Apply the chat template to format the prompt correctly for Mistral-Instruct
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Tokenize the input text
inputs_tokenized = tokenizer(input_text, return_tensors="pt")

# Explicitly move input_ids and attention_mask to the correct device
# model.device will be the primary device assigned by device_map="auto" (e.g., 'cuda:0')
input_ids = inputs_tokenized["input_ids"].to(model.device)
attention_mask = inputs_tokenized["attention_mask"].to(model.device)


# Clear cache before generation (good practice for memory management)
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Generating response...")
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,          # Pass input_ids explicitly
        attention_mask=attention_mask, # Pass attention_mask explicitly
        max_new_tokens=200,           # Adjust as needed (start low if OOM still occurs here)
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        # Add pad_token_id to avoid warnings for some models when batching
        pad_token_id=tokenizer.eos_token_id
    )

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Generated Output ---")
print(generated_text)

Generating response...

--- Generated Output ---
[INST] Write a short, engaging story about a lost cat who finds its way home. [/INST] Once upon a time, in a cozy little town nestled between rolling hills, there lived a cat named Whiskers. Whiskers was not your ordinary cat; she had a special talent for getting lost. No matter how carefully her owner, Mrs. Smith, tried to keep track of her, Whiskers always seemed to find her way into the most unexpected places.

One sunny morning, Whiskers decided to go for a walk. She loved exploring the town and meeting new people (and cats). But this time, she took a turn down a strange alleyway and found herself face-to-face with a mysterious-looking door. Without thinking twice, Whiskers stepped through the door and was suddenly transported to an entirely different world.

The new world was bright and colorful, filled with strange creatures and wondrous sights. Whiskers was terrified at first, but her adventurous spirit soon took over. She began t

__Comment__
Some are previous comments

Error displaying widget: This message appeared during the model loading phase, which means the progress bar widget (usually provided by tqdm and displayed using ipywidgets in Jupyter) failed to render correctly.

Model loaded successfully!: This is the most important part! It confirms that despite the progress bar error, the model was successfully downloaded and loaded into your GPU's memory using the quantization you configured.

The Error displaying widget is now a cosmetic issue with the progress bar, not a critical error preventing the model from loading. It means the interactive progress bar couldn't show up, but the underlying process completed. This often happens if the Jupyter frontend (your browser's view) doesn't perfectly load the ipywidgets JavaScript components for tqdm's notebook integration.

__Expected Outcome:__

With PyTorch correctly using CUDA and bitsandbytes enabled for GPU quantization, the model.from_pretrained() call should now proceed with downloading the model (if not cached) and loading it into your GPU memory using 8-bit (or 4-bit) precision. The "Error displaying widget" related to the progress bar should no longer appear, and you should see download progress and then the "Model loaded successfully!" message.

If you're using a Mac with M1/M2/M3 and have this line working:

If it returns True, then you can run like this:

If you are not sure about your GPU in your device, selects the best available device in the order: CUDA → MPS → CPU and also prints which one it chose:

In [20]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch

# Optional if already logged in via CLI
# login(token="your_hf_token")

# ------------------------------------------
# 📦 Device Selection: CUDA > MPS > CPU
# ------------------------------------------
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("✅ Using CUDA (GPU)")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("🟡 Using MPS (Apple Silicon GPU)")
else:
    device = torch.device("cpu")
    print("🔴 Using CPU")

# ------------------------------------------
# 🧠 Load Model from Hugging Face
# ------------------------------------------
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name, token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=True,
    torch_dtype=torch.float16 if device.type != "cpu" else torch.float32  # avoid FP16 on CPU
).to(device)

# ------------------------------------------
# 📝 Prompt + Inference
# ------------------------------------------
prompt = "The Eiffel Tower is located in"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10)
    print("📝 Generated Output:", tokenizer.decode(outputs[0], skip_special_tokens=True))


✅ Using CUDA (GPU)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


📝 Generated Output: The Eiffel Tower is located in Paris, France, and is one of the most


### Part 4: DPO vs PPO – Side-by-Side Educational Example
#### DPO: Direct Preference Optimization

In [23]:
import torch
import torch.nn.functional as F

# Simulated log-probs of chosen vs rejected completions
chosen_logp = torch.tensor([[-1.0]])
rejected_logp = torch.tensor([[-2.0]])

def dpo_loss(chosen_logp, rejected_logp, beta=0.1):
    return -F.logsigmoid((chosen_logp - rejected_logp) / beta).mean()

print("DPO Loss:", dpo_loss(chosen_logp, rejected_logp).item())



DPO Loss: 4.5398901420412585e-05


#### PPO: Proximal Policy Optimization (simplified for in-class demo)

In [25]:
import torch
import torch.nn.functional as F

# Simulated old and new policy log-probs (log π_θ(a|s) and log π_θ_old(a|s))
old_log_prob = torch.tensor([[-1.0]])  # from reference policy (e.g. GPT-4 before PPO step)
new_log_prob = torch.tensor([[-0.8]])  # from updated policy
reward = torch.tensor([[1.0]])         # reward from human or reward model
epsilon = 0.2                          # PPO clipping parameter

# Compute ratio of new to old policy
log_ratio = new_log_prob - old_log_prob
ratio = torch.exp(log_ratio)

# Unclipped and clipped advantages
advantage = reward  # assume reward ~ advantage for simplicity
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

# PPO loss (negative of the clipped surrogate objective)
ppo_loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()
print("PPO Loss:", ppo_loss.item())


PPO Loss: -1.2000000476837158


## 🔍 DPO vs PPO: Alignment Loss Comparison

| Criterion         | DPO (Direct Preference Optimization)     | PPO (Proximal Policy Optimization)        |
|------------------|-------------------------------------------|--------------------------------------------|
| 🧠 Origin         | Preference modeling (UnfoldAI 2023)        | Reinforcement Learning (OpenAI 2017)       |
| ✅ Rejection Signal | Yes — uses chosen vs rejected pairs       | No — requires scalar reward                |
| 🏆 Reward Signal   | Implicit via logit difference              | Explicit reward model needed               |
| 📐 Loss Function   | `-log(sigmoid((chosen - rejected)/β))`     | `-min(ratio * A, clipped_ratio * A)`       |
| 🔧 Optimization    | Binary classification over preferences     | Policy gradient with clipped surrogate     |
| 🎯 Application     | DPO-tuned models like LLaMA 3              | RLHF-tuned models like InstructGPT         |
| ⚙️ Complexity      | Simpler (no reward model needed)           | More complex (needs reward model + sampling) |



Explain how aligning models toward human preference uses logit differences.

### Bonus: Inference with Quantization (O1 & O3)

In [None]:
# Run model with torch_dtype=torch.float16 for O1
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

Concept: Explain how FP16/O1 optimizes memory and speed at inference.

### 🧠 Concept Breakdown: How FP16 & O1 Optimize Inference

#### 📌 What is FP16?

- **FP16** = *16-bit floating point*, also called **half precision**.
- It uses **less memory** than FP32 (standard 32-bit float), with:
  - 1 sign bit
  - 5 exponent bits
  - 10 mantissa bits
- Typical FP32 values: `0.123456789`
- FP16 representation: `0.1234` (lower precision but good enough for inference)

---

#### 📉 Why Use FP16?

| Feature        | FP32                     | FP16                    |
|----------------|--------------------------|-------------------------|
| Memory usage   | 4 bytes per value         | 2 bytes per value       |
| Compute speed  | Slower on GPUs            | Much faster on GPUs (especially A100/H100) |
| Energy usage   | Higher                    | Lower                   |
| Precision      | High                      | Slightly reduced (acceptable for inference) |

🧠 FP16 helps run **large models** on GPUs with limited memory (e.g., 24GB vs 80GB cards).

---

#### ⚙️ What Is O1 Optimization?

`O1` is a setting from **[DeepSpeed](https://www.deepspeed.ai/)** and **[Accelerate](https://huggingface.co/docs/accelerate)** used for **mixed-precision inference/training**.

| Optimization Level | Description                           |
|--------------------|---------------------------------------|
| O0                 | Full precision (FP32)                 |
| **O1**             | **Mixed precision (auto FP16 + FP32 fallback)** |
| O2                 | Pure FP16                             |
| O3                 | Advanced optimizations (e.g., quantization, kernel fusion) |

##### 🔧 What Does O1 Do?
- Automatically **casts compatible operations** (like matmul) to FP16
- **Keeps numerically sensitive ops** (e.g., layer norm, softmax) in FP32
- Result: **Best balance** between speed and stability

---

#### ⚡ Benefits at Inference Time

| Metric            | Before (FP32 / O0) | After (FP16 / O1) |
|------------------|---------------------|--------------------|
| VRAM usage       | High                | ~2x lower          |
| Batch size limit | Smaller             | Larger             |
| Latency          | Higher              | Lower              |
| Throughput       | Lower               | Higher             |

**Example:** Running LLaMA-7B in FP32 might require ~30GB VRAM, while FP16 can bring that down to ~16GB.

---

#### 💡 Code Example (Hugging Face Transformers)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "openchat/openchat-3.5-1210"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # 👈 Enable half-precision
    device_map="auto"
)

#### 🧪 Bonus: Combine with O3
O3 goes further with quantization, sparse attention, and custom kernels

Supported by tools like DeepSpeed, vLLM, AWQ, and TensorRT

