<a href="https://colab.research.google.com/github/AnasAlhasan/large-models-course/blob/main/notebooks/InferenceForLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**3. Techniques to Make Inference Efficient**
**(A) Quantization**

Store weights in lower precision (e.g., 8-bit instead of 16/32-bit).

Example:

Original weight = 123.4567 (float32)

Quantized = 123 (int8)

Trade-off: smaller, faster, but tiny accuracy loss.

👉 Colab Demo (quantization on a small Hugging Face model):

In [13]:
# First install the required dependencies
!pip install -U bitsandbytes
!pip install -U transformers accelerate

In [17]:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

# Load model with error handling
model_name = "facebook/opt-350m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

try:
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=quantization_config
    )
except Exception as e:
    print(f"Error loading quantized model: {e}")
    print("Falling back to FP16 precision...")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.float16
    )

# Run inference
inputs = tokenizer("The future of AI is", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=30)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Error loading quantized model: Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`
Falling back to FP16 precision...
The future of AI is in the hands of the AI community

The future of AI is in the hands of the AI community

The


**(B) Pruning**

Remove unnecessary neurons/weights (like trimming branches of a tree).

Example: if a neuron’s output is always near zero → cut it.

Speeds up inference but needs fine-tuning after pruning.

**(C) Distillation**

Train a smaller “student” model to mimic a large teacher model.

Example: DistilBERT is a smaller version of BERT, almost same performance, but much faster.

#**5. Hands-On: Prompt Engineering**

Once the model runs, prompting affects quality more than size.

Example:

In [21]:
prompt = "Translate this English text to French: 'How are you today?'"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Translate this English text to French: 'How are you today?'

The following is a translation of the text of the letter from the French Ministry of Foreign Affairs to the French Ambassador to the United States.

Dear Ambassador,

