# Models

Looking at the lower level API of Transformers - the models that wrap PyTorch code for the transformers themselves.

This notebook can run on a low-cost or free T4 runtime.

## Please note

I've added some new material in the middle of this lab to get more intuition on what a Transformer actually is. Later in the course, when we fine-tune LLMs, you'll get a deeper understanding of this.

Installing the neccessary libraries

In [None]:
!pip install -q --force-reinstall torch==2.6.0

In [None]:
!pip install -q bitsandbytes --upgrade

In [None]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate

Importing the neccessary libraries

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

This code tells that the HuggingFace tool to authenticate you with that key, and also writes it into your Git settings so you can clone or push to private Hugging Face repos without typing your password again

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

Each line assigns a friendly variable name to the exact Hugging Face model identifier you’ll use when loading the pipeline.

In [None]:
LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct"
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1"

This creates a **list of message objects** you’ll feed into a chat model

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

Quantization Configuration - This allows us to load the model into memory and use less memory

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

 Prepares text for a language model (LLAMA) by converting it into a numerical format the model can understand. It involves loading a tokenizer, defining padding, and applying a chat template to format the input text.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

##LLMA

**Quantization Configuration:**

This allows us to load the model into memory and use less memory

In [None]:
model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

In [None]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory: ,.1f} MB")

In [None]:
model

In [None]:
outputs = model.generate(inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

In [None]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for Data Scientists"}
  ]
generate(GEMMA2, messages)

In [None]:
def generate(model, messages):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  streamer = TextStreamer(tokenizer)
  model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)
  outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)
  memory = model.get_memory_footprint() / 1e6
  print(f"Memory footprint: {memory:,.1f} MB")

How to quantize and load the model back to Hugging Face

In [None]:
model = "google/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)
from huggingface_hub import notebook_login
notebook_login()
model.push_to_hub("korarishi1027/rishi-2-2b-it")
tokenizer.push_to_hub("korarishi1027/rishi-2-2b-it")