<div align="center">
    <h1>Introduction to Machine Learning</h1>
    <h2>Chapter 5: Natural Language Processing</h2>
    <h3>Large Language Models & Adaptation</h3>
    <h4>Author: Sina Daneshgar<h4>
</div>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SharifiZarchi/Introduction_to_Machine_Learning/blob/main/Jupyter_Notebooks/Chapter_05_Natural_Language_Processing/04-LLM_and_Adaptation.ipynb)
[![Open In kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/SharifiZarchi/Introduction_to_Machine_Learning/main/Jupyter_Notebooks/Chapter_05_Natural_Language_Processing/04-LLM_and_Adaptation.ipynb)

---

## Table of Contents

1.  [Language Modeling Basics](#1.-Language-Modeling-Basics)
2.  [Text Generation Strategies](#2.-Text-Generation-Strategies)
3.  [In-Context Learning (Zero-shot & Few-shot)](#3.-In-Context-Learning-(Zero-shot-&-Few-shot))
4.  [Model Adaptation: Parameter-Efficient Fine-Tuning (PEFT)](#4.-Model-Adaptation:-Parameter-Efficient-Fine-Tuning-(PEFT))

In [None]:
# Install necessary libraries if not already installed
# !pip install transformers torch numpy matplotlib
# !pip install peft  -----> For the adaptation section

import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch.nn.functional as F

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## 1. Language Modeling Basics

At its core, a causal language model (like GPT) is trained to predict the probability of the next token given a sequence of previous tokens:
$$ P(w_t | w_{1}, w_{2}, ..., w_{t-1}) $$

Let's load a pre-trained GPT-2 model and see this in action.

In [None]:
# Load pre-trained model and tokenizer
model_name = "gpt2"  # You can use "gpt2-medium" or "gpt2-large" for better results
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
model.eval()

print("Model loaded successfully!")

### Predicting the Next Token
Let's give the model a sentence and see what it thinks the next word should be.

In [None]:
def predict_next_token(text, top_k=5):
    # Encode input text
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    
    # Get model output
    with torch.no_grad():
        outputs = model(input_ids)
        predictions = outputs.logits[0, -1, :]  # Get logits for the last token
    
    # Apply softmax to get probabilities
    probs = F.softmax(predictions, dim=-1)
    
    # Get top k predictions
    top_probs, top_indices = torch.topk(probs, top_k)
    
    print(f"Input text: '{text}'")
    print("Top next token predictions:")
    for prob, idx in zip(top_probs, top_indices):
        token = tokenizer.decode([idx])
        print(f"  '{token}': {prob.item():.4f}")

# Test with a few examples
predict_next_token("The capital of France is")
print("-" * 30)
predict_next_token("Machine learning is a subfield of")

## 2. Text Generation Strategies

Generating text involves repeatedly predicting the next token and appending it to the input. However, *how* we select that next token matters significantly.

### A. Greedy Search
Simply select the token with the highest probability at each step.
- **Pros**: Fast, deterministic.
- **Cons**: Can get stuck in loops, misses high-probability paths that start with lower-probability tokens.

In [None]:
def generate_greedy(text, max_new_tokens=20):
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    
    output = model.generate(
        input_ids, 
        max_new_tokens=max_new_tokens, 
        do_sample=False,  # Disable sampling for greedy search
        pad_token_id=tokenizer.eos_token_id
    )
    
    return tokenizer.decode(output[0], skip_special_tokens=True)

print("Greedy Generation:")
print(generate_greedy("Once upon a time,"))

### B. Beam Search
Maintains multiple possible sequences (beams) and keeps the most likely ones.
- **Pros**: Finds better overall sequences than greedy.
- **Cons**: Slower, can still be repetitive.

In [None]:
def generate_beam(text, num_beams=5, max_new_tokens=20):
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    
    output = model.generate(
        input_ids, 
        max_new_tokens=max_new_tokens, 
        num_beams=num_beams, 
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    return tokenizer.decode(output[0], skip_special_tokens=True)

print("Beam Search Generation:")
print(generate_beam("Once upon a time,"))

### C. Sampling (Temperature, Top-k, Top-p)
Instead of picking the max, we sample from the probability distribution.
- **Temperature**: Controls randomness. Low temp (<1) makes it more confident/conservative. High temp (>1) makes it more random/creative.
- **Top-k**: Only sample from the top $k$ most likely tokens.
- **Top-p (Nucleus Sampling)**: Sample from the smallest set of tokens whose cumulative probability exceeds $p$.

In [None]:
def generate_sampling(text, temperature=0.7, top_k=50, top_p=0.9, max_new_tokens=40):
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    
    output = model.generate(
        input_ids, 
        max_new_tokens=max_new_tokens, 
        do_sample=True, 
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id
    )
    
    return tokenizer.decode(output[0], skip_special_tokens=True)

print("Sampling Generation (Creative):")
print(generate_sampling("The future of artificial intelligence is"))

## 3. In-Context Learning (Zero-shot & Few-shot)

One of the most powerful features of LLMs is their ability to learn from the context provided in the prompt, without updating weights.

### Zero-shot Learning
Asking the model to perform a task without any examples.

In [None]:
# Zero-shot Sentiment Analysis
prompt_zero_shot = """
Classify the sentiment of the following sentence as Positive or Negative.
Sentence: The movie was shockingly bad!
Sentiment:"""

print("--- Zero-shot ---")
print(generate_greedy(prompt_zero_shot, max_new_tokens=3))

### Few-shot Learning
Providing a few examples (shots) in the prompt to guide the model.

In [None]:
# Few-shot Sentiment Analysis
prompt_few_shot = """
Classify the sentiment of the following sentences as Positive or Negative.

Sentence: I hated this movie.
Sentiment: Negative

Sentence: This is the best day of my life.
Sentiment: Positive

Sentence: The movie was shockingly bad!
Sentiment:"""

print("--- Few-shot ---")
print(generate_greedy(prompt_few_shot, max_new_tokens=3))

## 4. Model Adaptation: Parameter-Efficient Fine-Tuning (PEFT)

Fine-tuning a full LLM (billions of parameters) is expensive. **PEFT** methods allow us to adapt models by training only a small number of extra parameters.

### LoRA (Low-Rank Adaptation)
LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture.

$$ W_{new} = W_{pretrained} + B \cdot A $$

Where $W$ is a weight matrix (e.g., in the attention layer), and $B, A$ are small trainable matrices.

*Note: The code below requires the `peft` library.*

In [None]:
# This is a demonstration of how to set up LoRA. 
# It requires the 'peft' library installed.

try:
    from peft import LoraConfig, get_peft_model, TaskType

    # Define LoRA Configuration
    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        inference_mode=False, 
        r=8,            # Rank of the low-rank matrices
        lora_alpha=32,  # Scaling factor
        lora_dropout=0.1
    )

    # Wrap the base model with LoRA
    # We use the same 'model' we loaded earlier
    peft_model = get_peft_model(model, peft_config)

    print("LoRA Model Created!")
    peft_model.print_trainable_parameters()
    
    # Now 'peft_model' can be trained just like a standard PyTorch model, 
    # but only a tiny fraction of parameters will be updated!

except ImportError:
    print("The 'peft' library is not installed. Please install it with `pip install peft` to run this section.")

### Fine-tuning with LoRA (Toy Example)

Now that we have our `peft_model`, we can train it using standard PyTorch loops. Notice that we are only updating the LoRA parameters (approx. 0.3% of the total), which makes the backward pass much faster and less memory-intensive.

Let's try to fine-tune the model on a single sentence to see the loss decrease.


In [None]:
# Define optimizer (only optimizing LoRA parameters)
optimizer = torch.optim.AdamW(peft_model.parameters(), lr=1e-3)

# Dummy training data
text = "Fine-tuning LLMs is efficient with LoRA."
inputs = tokenizer(text, return_tensors="pt").to(device)

# Labels are the same as inputs for Causal LM
labels = inputs.input_ids.clone()

print("Starting training...")
peft_model.train()

for epoch in range(10):
    optimizer.zero_grad()
    
    # Forward pass
    outputs = peft_model(input_ids=inputs.input_ids, labels=labels)
    loss = outputs.loss
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}")

print("Training finished! The model has adapted to this specific sentence.")


## Summary
In this notebook, we covered:
1.  **Language Modeling**: Predicting the next token is the fundamental task of GPT models.
2.  **Generation**: Sampling strategies (Temperature, Top-p) allow for diverse and creative outputs compared to greedy search.
3.  **In-Context Learning**: LLMs can perform tasks given just a prompt (Zero-shot) or a few examples (Few-shot).
4.  **Adaptation**: Techniques like LoRA allow us to fine-tune massive models efficiently.