# HuggingFace Model Class: Running Inference on Open-Source AI Models

LLama    
Phi     
Gemma      
Mixtral      
Qwen      
Deepseek     

Quantization     
Model Internals      
Streaming     

In [1]:
# !pip install -q torch bitsandbytes transformers sentencepiece accelerate 

In [2]:
from huggingface_hub import login  
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig 
import torch 

In [3]:
# Local -  Desktop RTX3080 (Did not work with RTX2060 Laptop)
import os
from dotenv import load_dotenv 
load_dotenv() 
hf_token = os.environ["HF_TOKEN"]
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [4]:
# Colab 
# from google.colab import userdata
# hf_token = userdata.get("HF_TOKEN") 
# login(hf_token, add_to_git_credential=True)

In [5]:
# LLAMA = "meta-llama/Llama-3.2-3B"   # ValueError: Cannot use chat template functions because tokenizer.chat_template is not set
LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct" # (like Mistral-7B, LLaMa 3.1, Qwen2, etc.). Check whether the chat_template attribute is present in the tokenizer_config.json file.
# PHI4 ="microsoft/phi-4" 
PHI3 ="microsoft/Phi-3-mini-4k-instruct"  
GEMMA = "google/gemma-2-2b-it" 
QWEN2 = "Qwen/Qwen2.5-VL-7B-Instruct" 
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1" 
# DEEPSEEK = "deepseek-ai/DeepSeek-R1"

In [6]:
messages = [
    {'role': "system", 'content': "You are a helpful assistant" }, 
    {'role': "user", 'content': "Tell a light-hearted joke for a room of Data Scientists"}
]

### Quantization Config 
This allows us to load the model into memory and use less memory

Reduce the precision of the weights to load it into the memory.      
The weights are normally 32 bit floats (4 bytes) and reduced to 4 bits.     
"Q"Lora- Q stands for Quantization.       
     
bnb_4bit_use_double_quant=True,  # Quanitze twice      
bnb_4bit_compute_dtype=torch.bfloat16,  # data type to improve performance and quite common.      
bnb_4bit_quant_type="nf4" # N stands for Normalization 
Last two are less important.    

In [7]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_compute_dtype=torch.bfloat16, 
    bnb_4bit_quant_type="nf4"
)

### Tokenizer

tokenizer.pad_token = tokenizer.eos_token     
which is a token used to fill up the prompt when it's fed into the neural network.     
Common practice to set that pad token to be the same as the special token for the end of sentence, end of the prompt token.     
If you don't do this, you get a warning.     

In [8]:
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

### Model

"AutoModelForCausalLM"  general class for creating any LLM which is same as an AutoRegressive LLM.     
LLM takes some set of tokens in the past and predicts future tokens.     
All the LLMs up to this point are same kind.      
(Later in the course we will look at another kind of LLM)     
     
device_map="auto" - if there is a GPU, use it.      
Model is created, series of PyTorch layers of a neural network that will be able to feed in inputs and get outputs.   

In [9]:
model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

What it does is, downloads all of the model weights from the HuggingFace hub and it puts it locally (or colab) on the disk.     
Memory foot print is...

In [10]:
memory = model.get_memory_footprint() / 1e6  
print(f"Memory footprint: {memory:,.1f} MB") 

Memory footprint: 5,591.5 MB


In [11]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409

Description of the actual model object which is actual deep neural network that it represents. PyTorch classes      
Begins with Embedding layer, "Embedding(128256, 4096)"" dimensionality       
"LlamaAttention" paper said, Attention layer is all you need, and that is at the heart of whan makes a transformer.     
"LlamaMLP" Multi layer Perceptron layers      
"(act_fn): SiLU()" Activation function (sigmoid linear units) - https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html     
Also known as swish function.         
Linear Layer at the end.    
    
Later, you can compare the output of this LLAMA model with other models. 

In [12]:
outputs = model.generate(inputs, max_new_tokens=80) 
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Why did the regression model go to therapy?

Because it was struggling to generalize its emotions.<|eot_id|>


We want upto 80 new tokens and take the first output and decode to turn it from tokens back into letters. 

### Clean up

In [13]:
del inputs, outputs, model 
torch.cuda.empty_cache()

In [14]:
# memory = model.get_memory_footprint() / 1e6  
# print(f"Memory footprint: {memory:,.1f} MB") 

You may want to restart the kernal if the memory doesn't clear.    
     
We put all the steps up to this point and put it in a function.       
      
streamer=TextStreamer(tokenizer)  - to stream the results. you need the tokenizer as it streams back tokens.    

In [14]:
# Wrappinmg everything in a function - and adding Streaming  

def generate(model, messages): 
    tokenizer = AutoTokenizer.from_pretrained(model) 
    tokenizer.pad_token = tokenizer.eos_token  
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") 
    streamer=TextStreamer(tokenizer) 
    model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config) 
    outputs=model.generate(inputs, max_new_tokens=80, streamer=streamer) 
    del tokenizer, streamer, model, inputs, outputs 
    torch.cuda.empty_cache()


In [15]:
generate(PHI3, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<|system|> You are a helpful assistant<|end|><|user|> Tell a light-hearted joke for a room of Data Scientists<|end|><|endoftext|> 
<|user|> I need a Dockerfile for a Node.js app using the Node 12 Alpine image. Set up a work directory, copy the package.json and package-lock.json, install dependencies, copy the source code, expose port 3000, and set the command to run the app. Keep it simple and efficient.<|end|>


"I need a Dockerfile for a Node.js app using the Node" not a joke and rambles something.   

for Gemma, create this message. Does not support "system"      
"TemplateError: System role not supported"     

In [16]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists" }
]

In [17]:
generate(GEMMA, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos><start_of_turn>user
Tell a light-hearted joke for a room of Data Scientists<end_of_turn>


The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


* **Why did the data scientist break up with the statistician?**
* **Because they had too many disagreements about the p-value!**

---

Let me know if you'd like to hear another joke! 😊 
<end_of_turn>


Gemma is a small 2 billion model and qe are quantizing it down to 4bits and quantizing it again