<a href="https://colab.research.google.com/github/SadeghMahmoudAbadi/Open-Source-LLM-on-Colab/blob/main/4-Tokenizers/Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q --upgrade bitsandbytes accelerate

In [2]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

In [3]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [4]:
PHI = "microsoft/Phi-4-mini-instruct"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
GEMMA = "google/gemma-3-270m-it"

In [5]:
messages = [
    {"role": "user", "content": "Tell me a fun fact."}
]

In [6]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

In [7]:
def generate(model, messages, quant=True, max_new_tokens=500):
    tokenizer = AutoTokenizer.from_pretrained(model)
    tokenizer.pad_token = tokenizer.eos_token
    input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
    attention_mask = torch.ones_like(input_ids, dtype=torch.long, device="cuda")
    streamer = TextStreamer(tokenizer)
    if quant:
        model = AutoModelForCausalLM.from_pretrained(model, quantization_config=quant_config).to("cuda")
    else:
        model = AutoModelForCausalLM.from_pretrained(model).to("cuda")
    outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=max_new_tokens, streamer=streamer)

    # Clean up memory
    gc.collect()
    torch.cuda.empty_cache()

In [8]:
generate(PHI, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<|user|>Tell me a fun fact.<|end|><|assistant|>Did you know that octopuses have three hearts? Two pump blood to the gills, while the third pumps it to the rest of the body. Additionally, octopuses have blue blood, which contains copper-based molecules that bind to oxygen. This adaptation allows them to survive in low-oxygen environments. Isn't that fascinating?<|end|>


In [9]:
generate(QWEN, messages)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

<|im_start|>user
Tell me a fun fact.<|im_end|>
<|im_start|>assistant
Sure! Here's a fun fact: **Octopuses have three hearts ‚Äî two of which stop beating when they swim!** üêö‚ù§Ô∏è  
Even more fascinating, they use one heart to pump blood to the gills, and the other two work together to circulate blood to the rest of their body. But when an octopus swims, it stops using two of its hearts ‚Äî the other one keeps going. It‚Äôs like a super-powered, three-heart-powered underwater superhero with a built-in ‚Äúsafety switch‚Äù! üåäüêô‚ú®<|im_end|>


In [10]:
generate(DEEPSEEK, messages, quant=False)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


<ÔΩúbegin‚ñÅof‚ñÅsentenceÔΩú><ÔΩúUserÔΩú>Tell me a fun fact.<ÔΩúAssistantÔΩú><think>
Alright, the user is asking for a fun fact. I need to come up with something engaging. Let me think about what's always popular. Maybe something related to animals or nature. How about something about a specific animal's lifespan? 

I remember that polar bears are really long-lived. They live over 70 years. That's a fun fact. I should explain why that's interesting‚Äîmaybe mention how long that is compared to other animals. Also, I can add something about their care since they're so dependent on their parents. 

I should keep it concise and make sure it's clear and interesting. Maybe structure it with a question to engage the user and then provide the answer. That way, it's both informative and fun to read.
</think>

Sure! Here's a fun fact: Polar bears are some of the longest-lived animals on Earth, living over 70 years. They‚Äôre particularly dependent on their parents because they rely on their body

In [11]:
generate(GEMMA, messages, quant=False)

Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.


<bos><start_of_turn>user
Tell me a fun fact.<end_of_turn>
<start_of_turn>model
Okay, here's a fun fact:
<end_of_turn>


In [12]:
tokenizer = AutoTokenizer.from_pretrained(QWEN, trust_remote_code=True)

In [13]:
text = "My name is Sadegh Mahmoud Abadi, and I'm eager about LLMs!"
tokens = tokenizer.encode(text)
tokens

[5050,
 829,
 374,
 30681,
 791,
 71,
 93186,
 3680,
 2767,
 11,
 323,
 358,
 2776,
 23541,
 911,
 444,
 10994,
 82,
 0]

In [14]:
character_count = len(text)
word_count = len(text.split(' '))
token_count = len(tokens)
print(f"There are {character_count} characters, {word_count} words and {token_count} tokens")

There are 58 characters, 11 words and 19 tokens


In [15]:
tokenizer.decode(tokens)

"My name is Sadegh Mahmoud Abadi, and I'm eager about LLMs!"

In [16]:
tokenizer.batch_decode(tokens)

['My',
 ' name',
 ' is',
 ' Sad',
 'eg',
 'h',
 ' Mahmoud',
 ' Ab',
 'adi',
 ',',
 ' and',
 ' I',
 "'m",
 ' eager',
 ' about',
 ' L',
 'LM',
 's',
 '!']

In [17]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|im_start|>user
Tell me a fun fact.<|im_end|>
<|im_start|>assistant

