# Module 2: LLM Basics & Inference (Low VRAM)

**Goal**: Load a "smart" LLM heavily compressed (4-bit) so it fits in your 4GB GPU, and talk to it.

**The Model**: `Qwen/Qwen2.5-1.5B-Instruct`. 
Why this model? It's one of the best "tiny" models available right now. It outperforms many larger old models and is perfect for learning.

**Technique**: 
- **Quantization**: We represent numbers with fewer bits (4 instead of 16). This reduces VRAM usage by ~70% with minimal quality loss.
- **Library**: We use `bitsandbytes` to handle this on the fly.

## 1. Import Libraries

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Force garbage collection to clear any old memory if you're rerunning cells
import gc
gc.collect()
torch.cuda.empty_cache()

## 2. Configure 4-bit Quantization
This is the magic part. We create a config that tells `transformers` to load the model in 4-bit strict NF4 format (Normal Float 4) to save maximum memory.

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,      # Quantize the quantization constants (extra space saving)
    bnb_4bit_quant_type="nf4",           # Normal Float 4 is better for LLM weights
    bnb_4bit_compute_dtype=torch.float16 # Compute in 16-bit for speed
)

## 3. Load the Model
This might take a minute to download (~3GB).

In [4]:
model_id = "Qwen/Qwen2.5-1.5B-Instruct"

print("Loading model... this controls the VRAM usage.")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto" # Automatically puts layers on GPU
)

print("Model loaded successfully!")

Loading model... this controls the VRAM usage.
Model loaded successfully!


## 4. Run Inference (Chat)
Let's ask it a question.

In [5]:
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to check for prime numbers."}
]

# Format the prompt specifically for Qwen (ChatML format)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True
    )

# Decode only the new tokens
generated_ids = [ 
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Sure! Here's an example Python function that checks if a number is prime:

```python
def is_prime(n):
    # Check if n is less than 2 (not prime)
    if n < 2:
        return False
    
    # Iterate from 2 to the square root of n
    for i in range(2, int(n**0.5) + 1):
        # If n is divisible by any number between 2 and its square root,
        # it's not prime
        if n % i == 0:
            return False
    
    # If we've gone through all possible divisors without finding one,
    # n must be prime
    return True
```

This function takes an integer `n` as input and returns `True` if `n` is prime, or `False` otherwise.

Here's how you can use this function:

```python
# Example usage
print(is_prime(7))   # Output: True
print(is_prime(4))   # Output: False
print(is_prime(9))   # Output: False
```

Note that this implementation uses a basic primality test based on trial division. It may not perform well for very large numbers due to time complexity constraints. There are more 

## 5. Memory Check
Let's see how much memory we used. It should be well under 4GB.

In [6]:
print(f"Updated VRAM usage: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")

Updated VRAM usage: 1.17 GB
