<div align="center">
<img src="https://poorit.in/image.png" alt="Poorit" width="40" style="vertical-align: middle;"> <b>AI SYSTEMS ENGINEERING 1</b>

## Unit 3: Running Open-Source LLMs

**CV Raman Global University, Bhubaneswar**  
*AI Center of Excellence*

---

</div>

---

### What You'll Learn

In this notebook, you will:

1. **Load and run an open-source LLM** (GPT-2) using the Transformers library
2. **Understand generation parameters** — temperature, top_p, sampling
3. **Use a chat-style model** (TinyLlama) for instruction-following

**Duration:** ~45 minutes

**Note:** This notebook works on CPU but runs faster on GPU. On Colab, go to **Runtime > Change runtime type > T4 GPU**.

---

## 1. Environment Setup

In [None]:
# Install required packages
!pip install -q transformers accelerate torch

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Check if GPU is available
if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("Running on CPU — this is fine! Models will just be a bit slower.")
    print("For faster inference, use Colab with a T4 GPU runtime.")

---

## 2. Loading a Small Model (GPT-2)

In notebook 01 we used the `pipeline` API. Here we'll load the model and tokenizer directly, which gives us more control.

GPT-2 has 124M parameters — small enough to run on CPU.

In [None]:
# Load GPT-2 model and tokenizer
model_name = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded on: {device}")
print(f"Parameters: {model.num_parameters() / 1e6:.1f}M")

In [None]:
# Generate text
prompt = "India is a country known for"

inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

---

## 3. Understanding Generation Parameters

When generating text, several parameters control the output:

| Parameter | Description | Effect |
|-----------|-------------|--------|
| `max_new_tokens` | Maximum tokens to generate | Controls output length |
| `temperature` | Randomness (0.0–2.0) | Higher = more creative/random |
| `top_p` | Nucleus sampling threshold | Limits token choices to top probability mass |
| `do_sample` | Enable sampling | `False` = greedy (always picks most likely token) |

Let's see how **temperature** affects the output.

In [None]:
# Compare different temperatures
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

for temp in [0.3, 0.7, 1.2]:
    outputs = model.generate(
        **inputs,
        max_new_tokens=30,
        do_sample=True,
        temperature=temp,
        pad_token_id=tokenizer.eos_token_id
    )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Temperature {temp}:")
    print(f"  {text}\n")

**What you should notice:**
- **Low temperature (0.3)** — more focused and repetitive
- **Medium temperature (0.7)** — good balance of coherence and variety
- **High temperature (1.2)** — more creative but less predictable

---

## 4. Chat-Style Models (TinyLlama)

GPT-2 is a base model — it just predicts the next word. Modern models like TinyLlama are **instruction-tuned** and can follow chat-style prompts.

TinyLlama (1.1B parameters) is small enough for CPU but much more capable than GPT-2.

In [None]:
# Load TinyLlama using the pipeline API (simpler than manual loading)
chat = pipeline(
    "text-generation",
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device=0 if torch.cuda.is_available() else -1
)

print("TinyLlama loaded!")

In [None]:
# Chat with TinyLlama using the messages format
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain machine learning in simple terms."}
]

result = chat(
    messages,
    max_new_tokens=150,
    do_sample=True,
    temperature=0.7
)

print(result[0]["generated_text"][-1]["content"])

In [None]:
# Try another question
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are 3 tips for learning to code?"}
]

result = chat(
    messages,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7
)

print(result[0]["generated_text"][-1]["content"])

---

## 5. Exercise

Experiment with generation parameters! Change the `temperature` and `max_new_tokens` below and observe how the output changes.

In [None]:
# Exercise: Change the parameters and run this cell multiple times
# Try: temperature=0.2, temperature=1.5, max_new_tokens=50 vs 200

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short poem about coding."}
]

result = chat(
    messages,
    max_new_tokens=100,   # Try changing this
    do_sample=True,
    temperature=0.7       # Try changing this
)

print(result[0]["generated_text"][-1]["content"])

---

### Going Further

In this notebook we used small models (GPT-2 at 124M and TinyLlama at 1.1B parameters). To run larger, more capable models:

- **Quantization** reduces model precision (e.g., from 16-bit to 4-bit), cutting memory usage by 4x. This lets you run 7B parameter models on a consumer GPU. Libraries like `bitsandbytes` make this easy.
- **Gated models** (like Llama 3) require a free HuggingFace account and access approval. You authenticate using `huggingface_hub.login()` with a token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
- **Larger GPUs** (A100, H100) or cloud services let you run even bigger models without quantization.

---

## Key Takeaways

1. **AutoModelForCausalLM** loads text generation models; pair it with the matching tokenizer

2. **Temperature controls randomness** — low for focused output, high for creative output

3. **Chat-style models** (like TinyLlama) follow instructions and can be used with the `pipeline` API and message format

---

## Additional Resources

- [HuggingFace Model Hub](https://huggingface.co/models)
- [Text Generation Strategies](https://huggingface.co/docs/transformers/generation_strategies)
- [TinyLlama Model Card](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

---

**Course Information:**
- **Institution:** CV Raman Global University, Bhubaneswar
- **Program:** AI Center of Excellence
- **Course:** AI Systems Engineering 1
- **Developed by:** [Poorit Technologies](https://poorit.in) — *Transform Graduates into Industry-Ready Professionals*

---