# üß† Mini-LLM Demo

This notebook demonstrates text generation with Mini-LLM, an 80M parameter language model built from scratch with modern LLM architecture components (RoPE, RMSNorm, SwiGLU, GQA).

**Model on HuggingFace:** https://huggingface.co/Ashx098/Mini-LLM

---

## Setup

First, let's install the required dependencies:

In [None]:
!pip install -q torch transformers sentencepiece

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## Load Mini-LLM from HuggingFace

In [None]:
model_name = "Ashx098/Mini-LLM"

print(f"Loading model from {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
    device_map="auto" if device == "cuda" else None
).to(device)

print(f"Model loaded successfully!")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

## Text Generation Function

In [None]:
def generate_text(prompt, max_new_tokens=100, temperature=0.8, top_p=0.95):
    """
    Generate text using Mini-LLM.
    
    Args:
        prompt: Input text to start generation
        max_new_tokens: Maximum number of tokens to generate
        temperature: Controls randomness (higher = more creative)
        top_p: Nucleus sampling threshold
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Highlight the generated portion
    generated_only = generated_text[len(prompt):]
    
    return generated_text, generated_only

## Try It Out!

### Example 1: Creative Writing

In [None]:
prompt = "Once upon a time, there was a"
full_output, generated = generate_text(prompt, max_new_tokens=80, temperature=0.8)

print(f"Prompt: {prompt}")
print(f"Generated: {generated}")

### Example 2: Coding

In [None]:
prompt = "def fibonacci(n):"
full_output, generated = generate_text(prompt, max_new_tokens=60, temperature=0.6)

print(f"Prompt: {prompt}")
print(f"Generated:\n{generated}")

### Example 3: Knowledge Retrieval

In [None]:
prompt = "The capital of France is"
full_output, generated = generate_text(prompt, max_new_tokens=30, temperature=0.5)

print(f"Prompt: {prompt}")
print(f"Generated: {generated}")

### Example 4: Story Continuation

In [None]:
prompt = "In the year 2050, artificial intelligence had"
full_output, generated = generate_text(prompt, max_new_tokens=100, temperature=0.9)

print(f"Prompt: {prompt}")
print(f"Generated: {generated}")

### Try Your Own Prompt!

In [None]:
# Change this prompt to whatever you want!
custom_prompt = "The meaning of life is"

full_output, generated = generate_text(custom_prompt, max_new_tokens=80, temperature=0.7)

print(f"Prompt: {custom_prompt}")
print(f"Generated: {generated}")

## Understanding the Parameters

| Parameter | What It Does | Low Value | High Value |
|-----------|--------------|-----------|------------|
| **temperature** | Controls randomness | More predictable, repetitive | More creative, diverse |
| **top_p** | Nucleus sampling threshold | More conservative | More diverse vocabulary |
| **max_new_tokens** | Maximum length of generation | Shorter outputs | Longer outputs |

**Tips:**
- Use `temperature=0.5-0.7` for factual content
- Use `temperature=0.8-1.0` for creative writing
- Use `temperature=0.2-0.4` for code generation

## About Mini-LLM

**Mini-LLM** is an 80M parameter decoder-only transformer built from scratch with:

- **RoPE** (Rotary Position Embeddings) - Better sequence extrapolation
- **RMSNorm** - Faster, more stable normalization
- **SwiGLU** - State-of-the-art activation function
- **GQA** (Grouped Query Attention) - Efficient attention mechanism
- **SentencePiece BPE** - Real-world tokenization with 32K vocabulary

**GitHub:** https://github.com/yourusername/Mini-LLM  
**HuggingFace:** https://huggingface.co/Ashx098/Mini-LLM

---

*Built with ‚ù§Ô∏è by MSR Avinash*