# Lecture 6: Loading Models with `from_pretrained`
In this notebook, we will explore how to load pre-trained models using the `from_pretrained` method from the Hugging Face Transformers library. We will also dive into the configuration, weights, and caching mechanisms.

The Google Colab versin of this notebook is available here: https://colab.research.google.com/drive/1gb3hu83Wktk5cUDObPJqjlM5olAhbvam?usp=sharing


## **Feeling Brave?**

Try out the code for this lecture on your laptop or desktop. The same code in the above mentioned Google Colab is also available here for anyone who feels they want to take on the challenge of running this lecture on their machine.


**Consider the following** before running this notebook on your computer:
1. Make sure you have plenty of RAM (ideally >= 16 GB)

2. If you **do not have** an NVIDIA GPU, you will have to install bitsandbytes cour version by following these steps
    - In your terminal, activate your virtual environment and run `pip uninstall bitsandbytes` or in your notebook cell run `!pip uninstall bitsandbytes`
    - In your terminal run `pip install bitsandbytes-cpu` or in your notebook cell run `!pip install bitsandbytes-cpu`

3. If you **do** have an NVIDIA GPU
    - In your terminal, activate your virtual environment and run `pip uninstall torch torchvision torchaudio` or in your notebook cell run `!pip uninstall torch torchvision torchaudio`
    - In your terminal run `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129` or in your notebook cell run `!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129`

4. Make sure you have a stable internet connection as the download can take some time

*Note* that running models without a GPU can be extremely time consuming and may lead to your machine overheating if it is old.

# Step 1: Load libraries and log in to Huggingface

In [None]:
import os
import torch
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()

hf_token = os.getenv("HUGGINGFACE_API_KEY")
login(hf_token, add_to_git_credential=True)

# Step 2: Load quantization configuration for model

In [None]:
from transformers import BitsAndBytesConfig

# Quantization Config - this allows us to load the model into memory and use less memory
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

# This is a CPU equivalent of the GPU quantization config
# Uncomment the below if you are only using CPU

# quant_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_compute_dtype=float32,
#     bnb_4bit_quant_type="nf4"
# )

# Step 3: Load a Pre-trained Model
We will use the `meta-llama/Meta-Llama-3.1-8B-Instruct` model as an example. This step demonstrates how to load the model and tokenizer.

In [None]:
from transformers import AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quant_config)

print(f"Model '{model_name}' loaded successfully!")

# Step 4: Explore the Model Configuration
The configuration of a model contains important details such as the number of layers, hidden size, and more.

In [None]:
config = model.config
print("Model Configuration:")
print(config)

Let's have a look at the actual layers (just for fun!)

In [None]:
model

# Step 5: Understand Caching
When you load a model, it is cached locally to avoid downloading it again. The models are usually stored in the following path: ~/.cache/huggingface/hub by default

See further reference: [Huggingface cache management](~/.cache/huggingface/hub)

# Step 6: Tokenizing a Prompt and Generating Text
In this step, we will tokenize a list of messages, pass it to the model, and generate text as output.

In [None]:
# An instruct model requires a list of messages as we saw in AI Engineering Essentials Part 1
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is the best way to structure and organize my thoughts?"}
  ]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

print(inputs)

In [None]:
# Pass the input IDs to the model to generate output
output_ids = model.generate(inputs, max_new_tokens=80)

# Display the generated output IDs
print("Generated Output IDs:", output_ids)

In [None]:
generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)

print("Generated Text:", generated_text)

# Takeaways
- The `from_pretrained` method simplifies loading pre-trained models and tokenizers.
- Models are cached locally for efficiency.
- You can explore model configurations and map models to devices for optimized inference.

# Your Challenge
Fork this notebook, change the model ID to one in your native language, and share your results in the course repository!