<h1>Chapter 2 - Tokens and Token Embeddings</h1>
<i>Exploring tokens and embeddings as an integral part of building LLMs</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)

---

This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

In [1]:
%%capture
!pip install transformers>=4.41.2 sentence-transformers>=3.0.1 gensim>=4.3.2 scikit-learn>=1.5.0 accelerate>=0.31.0

# Downloading and Running An LLM

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda", #this is the
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

I want to load a huggingface pretrained transformer model directly to GPU (not enough CPU space) e.g. loading BERT
<details>
  <summary> Answer  </summary>

  huggingface accelerate could be helpful in moving the model to GPU before it's fully loaded in CPU, so it worked when
  GPU memory > model size > CPU memory
  by using `device_map = 'cuda'`

  
  ```python
  !pip install accelerate
  ```
  
  then use

  ```python
  from transformers import AutoModelForCausalLM
  model = AutoModelForCausalLM.from_pretrained("bert-base-uncased", device_map = 'cuda')
  ```
</details>


[Official Doc for accelerate package](https://huggingface.co/docs/accelerate/en/index)

In [3]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

<details>
  <summary> <h3>What is the `assistant` tag doing there? </h3></summary>



<u>Questions</u>
  Role
  You are an xpert in LLMs and using the hugginface to achieve all the required tasks with the hugginface.

  context
  I'm readin ga book which is talking about the tokens and tokenizations. COuld you please explain the doubts i have on the same?
  We are using the phi intruct model from huggingface
  task/Query:
  There is a prompt as below
  ```python
  prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"
  ```
  Could you explain why is the `<|assistant|>` tag and what are the other tasg related to the model phi we can use?

<u>Ansewrs</u>

As an expert in LLMs and Hugging Face, I'll explain the tokens and tags for the Phi-2 model.


The `<|assistant|>` tag in the Phi model is a special token that indicates where the model's response should begin. These special tokens are part of the model's instruction-following format, which helps the model understand the structure of the conversation.

For the Phi model family, here are the key special tokens:

```python
# Main special tokens for Phi-2
"<|system|>"      # Used for system-level instructions/context
"<|user|>"        # Indicates user input
"<|assistant|>"   # Indicates where model should respond
"<|end|>"         # Marks the end of a response

# Example of using these tokens properly
prompt = """<|system|>You are a helpful AI assistant.
<|user|>Write an email apologizing to Sarah for the tragic gardening mishap.
<|assistant|>"""
```

You can verify these tokens using the tokenizer:

```python
from transformers import AutoTokenizer

# Load the Phi tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

# Check the special tokens
print("Special tokens:", tokenizer.special_tokens_map)

# See how tokens are encoded
tokens = tokenizer.tokenize(prompt)
print("Tokenized prompt:", tokens)

# Get the vocabulary size and special token IDs
print("Vocab size:", len(tokenizer))
print("Assistant token ID:", tokenizer.convert_tokens_to_ids("<|assistant|>"))
```

Best practices for using these tokens with Phi:

1. Full conversation format:
```python
prompt = """<|system|>You are a helpful AI assistant.
<|user|>Write an email about gardening.
<|assistant|>Here's a draft email...
<|user|>Make it more formal.
<|assistant|>"""
```

2. Single instruction format:
```python
prompt = """<|user|>Write an email apologizing to Sarah.
<|assistant|>"""
```

3. For system-level instructions:
```python
prompt = """<|system|>You are an expert gardener who writes in a professional tone.
<|user|>Write an apology email to Sarah about the damaged roses.
<|assistant|>"""
```

Important notes:
- Always include `<|assistant|>` at the end of your prompt when you want the model to generate a response
- The system token is optional but useful for setting context
- You don't always need to use all tokens - the minimal format with just user and assistant tokens often works well
- The model is trained to recognize these tokens as conversation boundaries and role indicators

If you want to explore the tokenizer's behavior more deeply:

```python
# Analyze how the tokenizer handles special tokens
def analyze_tokens(text):
    tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    
    print(f"Original text: {text}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    
# Example usage
analyze_tokens("<|system|>You are a helpful assistant.<|user|>Hello<|assistant|>")
```

This will help you understand how the model processes these special tokens during inference.
  
</details>

In [4]:
# Tokenize the input prompt and send it to the GPU
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

In [5]:
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


In [6]:
type(generation_output), generation_output

(torch.Tensor,
 tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
            293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
            372,  9559, 29889, 32001,  3323,   622, 29901,   317,  3742,   406,
           6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
             13,    13, 29928,   799]], device='cuda:0'))

In [7]:
# Print the output
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Sincere Apologies for the Gardening Mishap


Dear


In [8]:
print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='cuda:0')


In [9]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>


In [10]:
generation_output

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901,   317,  3742,   406,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799]], device='cuda:0')

In [11]:
print(tokenizer.decode(3323))assistant
print(tokenizer.decode(622))
print(tokenizer.decode([3323, 622]))
print(tokenizer.decode(29901))

Sub
ject
Subject
:


# Comparing Trained LLM Tokenizers


In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [13]:
colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

In [14]:
def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) # initialize the tokenizer
    token_ids = tokenizer(sentence).input_ids # ids for the tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

In [15]:
text = """
English and CAPITALIZATION
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

THe abve method could be used for checking which tokenizer clearly and optimally tokenizes the given text.

In [16]:
show_tokens(text, "bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98menglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mcapital[0m [0;30;48;2;166;216;84m##ization[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mfalse[0m [0;30;48;2;102;194;165mnone[0m [0;30;48;2;252;141;98meli[0m [0;30;48;2;141;160;203m##f[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m>[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98melse[0m [0;30;48;2;141;160;203m:[0m [0;30;48;2;231;138;195mtwo[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98m"[0m [0;30;48;2;141;160;203m"[0m [0;30;48;2;231;138;195mthree[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;25



In [17]:
show_tokens(text, "bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mCA[0m [0;30;48;2;166;216;84m##PI[0m [0;30;48;2;255;217;47m##TA[0m [0;30;48;2;102;194;165m##L[0m [0;30;48;2;252;141;98m##I[0m [0;30;48;2;141;160;203m##Z[0m [0;30;48;2;231;138;195m##AT[0m [0;30;48;2;166;216;84m##ION[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mF[0m [0;30;48;2;102;194;165m##als[0m [0;30;48;2;252;141;98m##e[0m [0;30;48;2;141;160;203mNone[0m [0;30;48;2;231;138;195mel[0m [0;30;48;2;166;216;84m##if[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m>[0m [0;30;48;2;141;160;203m=[0m [0;30;48;2;231;138;195melse[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47mtwo[0m [0;30;48;2;102;194;165mta[0m [0;30;48;2;252;1

In [18]:
show_tokens(text, "gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAP[0m [0;30;48;2;166;216;84mITAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m �[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mt[0m [0;30;48;2;102;194;165mok[0m [0;30;48;2;252;141;98mens[0m [0;30;48;2;141;160;203m False[0m [0;30;48;2;231;138;195m None[0m [0;30;48;2;166;216;84m el[0m [0;30;48;2;255;217;47mif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m [0m 

In [19]:
show_tokens(text, "google/flan-t5-small")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

[0;30;48;2;102;194;165mEnglish[0m [0;30;48;2;252;141;98mand[0m [0;30;48;2;141;160;203mCA[0m [0;30;48;2;231;138;195mPI[0m [0;30;48;2;166;216;84mTAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m[0m [0;30;48;2;141;160;203m<unk>[0m [0;30;48;2;231;138;195m[0m [0;30;48;2;166;216;84m<unk>[0m [0;30;48;2;255;217;47mshow[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98mto[0m [0;30;48;2;141;160;203mken[0m [0;30;48;2;231;138;195ms[0m [0;30;48;2;166;216;84mFal[0m [0;30;48;2;255;217;47ms[0m [0;30;48;2;102;194;165me[0m [0;30;48;2;252;141;98mNone[0m [0;30;48;2;141;160;203m[0m [0;30;48;2;231;138;195me[0m [0;30;48;2;166;216;84ml[0m [0;30;48;2;255;217;47mif[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m>[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84melse[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165mtwo[0m [0;30;48;2;252;141;98mtab[0m [0;30;48;2;141

In [20]:
# The official is `tiktoken` but this the same tokenizer on the HF platform
show_tokens(text, "Xenova/gpt-4")

tokenizer_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.23M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m �[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_tokens[0m [0;30;48;2;231;138;195m False[0m [0;30;48;2;166;216;84m None[0m [0;30;48;2;255;217;47m elif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m   [0m [0;30;48;2;141;160;203m "[0m [0;30;48;2;231;138;195m Three[0m [0;30;48;2;166;216;84m tabs[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m "[0m [0;30;48;2

In [None]:
# You need to request access before being able to use this tokenizer
show_tokens(text, "bigcode/starcoder2-15b")

tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtokens[0m [0;30;48;2;166;216;84m False[0m [0;30;48;2;255;217;47m None[0m [0;30;48;2;102;194;165m elif[0m [0;30;48;2;252;141;98m ==[0m [0;30;48;2;141;160;203m >=[0m [0;30;48;2;231;138;195m else[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m two[0m [0;30;48;2;102;194;165m tabs[0m [0;30;48;2;252;141;98m:"[0m [0;30;48;2;141;160;203m   [0m [0;30;48;2;231;138;195m "[0m [0;30;48;2;166;216;84m Three[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;25

In [21]:
show_tokens(text, "facebook/galactica-1.3b")

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAP[0m [0;30;48;2;166;216;84mITAL[0m [0;30;48;2;255;217;47mIZATION[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m �[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mtokens[0m [0;30;48;2;102;194;165m False[0m [0;30;48;2;252;141;98m None[0m [0;30;48;2;141;160;203m elif[0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m==[0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m>[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m t[0m [0;30;48;2;102;194;165mabs[0m [0;30;48;2;252;141;98m:[0m [

In [22]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

[0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mEnglish[0m [0;30;48;2;231;138;195mand[0m [0;30;48;2;166;216;84mC[0m [0;30;48;2;255;217;47mAP[0m [0;30;48;2;102;194;165mIT[0m [0;30;48;2;252;141;98mAL[0m [0;30;48;2;141;160;203mIZ[0m [0;30;48;2;231;138;195mATION[0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mshow[0m [0;30;48;2;231;138;195m_[0m [0;30;48;2;166;216;84mto[0m [0;30;48;2;255;217;47mkens[0m [0;30;48;2;102;194;165mFalse[0m [0;30;48;2;252;141;98mNone[0m [0;30;48;2;141;160;203melif[0m [0;30;48;2;231;138;195m==[0m [0;30;48;2;166;216;84m>=[0m [0;30;48;2;255;217;47melse[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98mtwo[0m [0;30;48;2;141;16

# Contextualized Word Embeddings From a Language Model (Like BERT)

In [23]:
from transformers import AutoModel, AutoTokenizer

In [24]:
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [25]:
# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

In [26]:
# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

In [27]:
# Process the tokens
output = model(**tokens)
# output = model(**tokens)[0] # original code

In [28]:
type(output), output[0]

(transformers.modeling_outputs.BaseModelOutput,
 tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
          [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
          [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
          [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
        grad_fn=<NativeLayerNormBackward0>))

In [30]:
# output.shape
type(output[0]), output[0].shape

(torch.Tensor, torch.Size([1, 4, 384]))

In [31]:
# what is the input sent to the model
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 world
[SEP]


In [32]:
output

BaseModelOutput(last_hidden_state=tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

# Text Embeddings (For Sentences and Whole Documents)

What is the differnece between text embeddings from SentenceTransformers and general contextualized embeddings from LLMs like microsoft/deberta-base ? Could you explain from a beginner's perspective?

<details>
  <summary> <h4>A great Answer explained by claude </h4></summary>
  Let me explain the key differences between SentenceTransformers and contextual embeddings from traditional LLMs in a beginner-friendly way:

1. Purpose and Optimization:

```python
# SentenceTransformers Example
from sentence_transformers import SentenceTransformer

# Specifically optimized for sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
    "I love programming",
    "I enjoy coding"
]
embeddings = model.encode(sentences)
# These embeddings are directly optimized for similarity comparisons
```

```python
# Traditional LLM Example (like DeBERTa)
from transformers import AutoTokenizer, AutoModel
import torch

model = AutoModel.from_pretrained('microsoft/deberta-base')
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')

# Need to handle tokenization and embedding extraction manually
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
# Need to process these embeddings further (usually mean/pooling)
embeddings = outputs.last_hidden_state.mean(dim=1)
```

2. Main Differences:

```python
# 1. Ease of Use
# SentenceTransformers
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Simple one-line encoding
text1 = "The cat sits on the mat"
text2 = "A cat is on the mat"
emb1 = model.encode(text1)
emb2 = model.encode(text2)

# Direct similarity computation
similarity = util.cos_sim(emb1, emb2)

# 2. Size and Speed
print(f"Embedding dimension: {len(emb1)}")  # Usually smaller, like 384

# 3. Specific Use Cases
query = "How to make pasta"
documents = [
    "Boil water and add pasta",
    "Pizza is delicious",
    "Pasta cooking instructions"
]

# Easy semantic search
query_emb = model.encode(query)
doc_emb = model.encode(documents)
similarities = util.cos_sim(query_emb, doc_emb)
```

Here's a practical comparison:

```python
# Comparing both approaches for a real task

# 1. Using SentenceTransformers for Semantic Search
from sentence_transformers import SentenceTransformer, util

def semantic_search_sentencetransformers():
    # Load model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Example database
    documents = [
        "The quick brown fox jumps over the lazy dog",
        "A man is walking his dog in the park",
        "The cat sits on the windowsill watching birds",
    ]
    
    # Query
    query = "Animal jumping over something"
    
    # Get embeddings
    doc_embeddings = model.encode(documents)
    query_embedding = model.encode(query)
    
    # Calculate similarities
    similarities = util.cos_sim(query_embedding, doc_embeddings)
    
    return similarities

# 2. Using DeBERTa for the same task
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def semantic_search_deberta():
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
    model = AutoModel.from_pretrained('microsoft/deberta-base')
    
    # Same documents and query
    documents = [
        "The quick brown fox jumps over the lazy dog",
        "A man is walking his dog in the park",
        "The cat sits on the windowsill watching birds",
    ]
    query = "Animal jumping over something"
    
    # Function to get embeddings
    def get_embedding(text):
        inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
        outputs = model(**inputs)
        # Need to handle the embedding extraction manually
        # Using mean pooling as an example
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings
    
    # Get embeddings
    doc_embeddings = torch.cat([get_embedding(doc) for doc in documents])
    query_embedding = get_embedding(query)
    
    # Calculate similarities
    similarities = F.cosine_similarity(query_embedding, doc_embeddings)
    
    return similarities
```

Key Takeaways:

1. SentenceTransformers:
   - Purpose-built for getting text embeddings
   - Optimized for similarity tasks
   - Easier to use
   - Smaller embedding dimensions
   - Faster inference
   - Best for: Semantic search, clustering, similarity comparison

2. LLM Contextual Embeddings (like DeBERTa):
   - More general-purpose
   - Richer contextual understanding
   - Require more processing
   - Larger embedding dimensions
   - More computational resources
   - Best for: Deep language understanding, fine-tuning for specific tasks

Real-world Example:

```python
# Example: Document Similarity System

class DocumentSimilaritySystem:
    def __init__(self, use_sentence_transformers=True):
        if use_sentence_transformers:
            self.model = SentenceTransformer('all-MiniLM-L6-v2')
            self.get_embeddings = self._get_st_embeddings
        else:
            self.tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
            self.model = AutoModel.from_pretrained('microsoft/deberta-base')
            self.get_embeddings = self._get_deberta_embeddings
    
    def _get_st_embeddings(self, texts):
        return self.model.encode(texts)
    
    def _get_deberta_embeddings(self, texts):
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1)
    
    def find_similar_documents(self, query, documents, top_k=3):
        query_emb = self.get_embeddings([query])
        doc_emb = self.get_embeddings(documents)
        
        similarities = util.cos_sim(query_emb, doc_emb)[0]
        top_results = torch.topk(similarities, min(top_k, len(documents)))
        
        return [(documents[idx], score.item())
                for score, idx in zip(top_results.values, top_results.indices)]

# Usage
system = DocumentSimilaritySystem(use_sentence_transformers=True)
results = system.find_similar_documents(
    query="Looking for information about machine learning",
    documents=[
        "Introduction to machine learning algorithms",
        "Today's weather forecast",
        "Basic principles of ML and AI",
        "Recipe for chocolate cake"
    ]
)
```

This comparison should give you a clear understanding of when to use each type of embedding based on your specific needs!
</details>

In [34]:
from sentence_transformers import SentenceTransformer

In [35]:
# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [37]:
# Convert text to text embeddings
vector = model.encode("Best movie ever!")

In [38]:
vector.shape

(768,)

# Word Embeddings Beyond LLMs


Now a follow-up question if how is both the above method ( text embeddings from SentenceTransformers and general contextualized embeddings from LLMs like microsoft/deberta-base ) different from "Word Embeddings Beyond LLMs using model = api.load("glove-wiki-gigaword-50")"?

<details>
  <summary>
    very intensive exaplantion by cluade
  </summary>
</details>

In [51]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")

In [52]:
model.most_similar([model['king']], topn=11)

[('king', 1.0000001192092896),
 ('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247777938843),
 ('son', 0.766719400882721),
 ('uncle', 0.7627150416374207),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492411136627197),
 ('ruler', 0.7434253692626953)]

# Recommending songs by embeddings

In [53]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [54]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


In [55]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [56]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4
)

In [57]:
model

<gensim.models.word2vec.Word2Vec at 0x791b443cbeb0>

In [58]:
song_id = 2172 # queried song which is to be matched with all the songs

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('2849', 0.9977474212646484),
 ('1954', 0.9963516592979431),
 ('6641', 0.9961868524551392),
 ('1922', 0.9961240887641907),
 ('3167', 0.9958761930465698),
 ('6626', 0.9956526756286621),
 ('5586', 0.9956172704696655),
 ('3094', 0.9952870607376099),
 ('2014', 0.9949393272399902),
 ('10105', 0.994709849357605)]

In [59]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [60]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2849,Run To The Hills,Iron Maiden
1954,The Number Of The Beast,Iron Maiden
6641,Shout At The Devil,Motley Crue
1922,One,Metallica
3167,Unchained,Van Halen


In [61]:
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2849,Run To The Hills,Iron Maiden
1954,The Number Of The Beast,Iron Maiden
6641,Shout At The Devil,Motley Crue
1922,One,Metallica
3167,Unchained,Van Halen


In [62]:
print_recommendations(842)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
27081,"Give Me Everything (w\/ Ne-Yo, Afrojack & Nayer)",Pitbull
5698,Turnin' Me On (w\/ Lil Wayne),Keri Hilson
63,Love The Way You Lie (w\/ Rihanna),Eminem
5668,How We Do (w\/ 50 Cent),The Game
453,Temperature,Sean Paul


# What is the differnece between tokenisation, tokens and token embeddings ?


Let me break down these related but distinct concepts with examples:

1. **Tokenization** - The process of breaking text into smaller units (tokens):

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example of tokenization
text = "I love AI and ML!"

# Process of tokenization
tokens = tokenizer.tokenize(text)
print(f"Original text: {text}")
print(f"After tokenization: {tokens}")
# Output might be: ['i', 'love', 'ai', 'and', 'ml', '!']

# Different types of tokenization
def show_tokenization_types():
    # 1. Word-based tokenization
    word_tokens = text.split()  # Simple word tokenization
    
    # 2. Subword tokenization (what most modern models use)
    subword_text = "unprecedented antidisestablishmentarianism"
    subword_tokens = tokenizer.tokenize(subword_text)
    
    # 3. Character-based tokenization
    char_tokens = list(text)
    
    return {
        "word_tokens": word_tokens,
        "subword_tokens": subword_tokens,
        "char_tokens": char_tokens
    }
```

2. **Tokens** - The actual units/pieces after tokenization:

```python
# Example showing different types of tokens
def explain_tokens():
    text = "AI is awesome! 🤖"
    
    # Get tokens and their IDs
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    
    # Special tokens
    special_tokens = {
        "PAD": tokenizer.pad_token,
        "UNK": tokenizer.unk_token,
        "CLS": tokenizer.cls_token,
        "SEP": tokenizer.sep_token
    }
    
    # Visualize different token types
    print("Regular tokens:", tokens)
    print("Token IDs:", token_ids)
    print("Special tokens:", special_tokens)

# Example of token manipulation
def token_operations():
    # Convert tokens to IDs and back
    text = "Machine learning is fascinating!"
    
    # Tokenization
    tokens = tokenizer.tokenize(text)
    
    # Convert to IDs
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # Convert back to tokens
    recovered_tokens = tokenizer.convert_ids_to_tokens(token_ids)
    
    return {
        "original_tokens": tokens,
        "token_ids": token_ids,
        "recovered_tokens": recovered_tokens
    }
```

3. **Token Embeddings** - The vector representations of tokens:

```python
from transformers import AutoModel
import torch

def explain_token_embeddings():
    # Load model and tokenizer
    model = AutoModel.from_pretrained("bert-base-uncased")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    text = "AI is amazing!"
    
    # Get tokens
    inputs = tokenizer(text, return_tensors="pt")
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Token embeddings are in the last hidden state
    token_embeddings = outputs.last_hidden_state
    
    return {
        "embedding_shape": token_embeddings.shape,
        "sample_embedding": token_embeddings[0][0][:5]  # First 5 dimensions of first token
    }

# Practical example showing all three concepts together
class TokenizationDemo:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.model = AutoModel.from_pretrained("bert-base-uncased")
    
    def process_text(self, text):
        # 1. Tokenization (process)
        tokens = self.tokenizer.tokenize(text)
        
        # 2. Tokens (units)
        token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
        
        # 3. Token Embeddings (vector representations)
        inputs = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
        embeddings = outputs.last_hidden_state
        
        return {
            "tokenization_result": tokens,
            "token_ids": token_ids,
            "embedding_shape": embeddings.shape
        }

# Usage example
demo = TokenizationDemo()
result = demo.process_text("Machine learning is amazing!")
```

Visual Comparison:

```python
def visual_token_comparison():
    text = "I love machine learning!"
    
    # 1. Tokenization (Process)
    print("1. Tokenization (Process):")
    print("Input text ──> Tokenizer ──> Individual tokens")
    print(f"'{text}' ──> [tokenizer] ──> {tokenizer.tokenize(text)}")
    
    # 2. Tokens (Units)
    print("\n2. Tokens (Units):")
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    print("Tokens:", tokens)
    print("Token IDs:", token_ids)
    
    # 3. Token Embeddings (Vectors)
    print("\n3. Token Embeddings (Vectors):")
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    print(f"Shape of embeddings: {embeddings.shape}")
    print("Each token has a vector representation of size:", embeddings.shape[-1])

# Key differences summarized:
differences = {
    "Tokenization": {
        "What": "Process of splitting text",
        "Example": "'hello world' → ['hello', 'world']",
        "Purpose": "Break text into processable units"
    },
    "Tokens": {
        "What": "Individual units after tokenization",
        "Example": "['hello', 'world', '!']",
        "Purpose": "Discrete units for model processing"
    },
    "Token Embeddings": {
        "What": "Vector representations of tokens",
        "Example": "token 'hello' → [0.1, -0.3, 0.5, ...]",
        "Purpose": "Mathematical representations for model computations"
    }
}
```

Practical Differences:

1. **Tokenization**:
   - Is a process/action
   - Converts raw text into tokens
   - Different models may use different tokenization strategies
   - Examples: WordPiece (BERT), BPE (GPT), SentencePiece (T5)

2. **Tokens**:
   - Are the output units of tokenization
   - Can be words, subwords, or characters
   - Have corresponding IDs in the vocabulary
   - Include special tokens like [CLS], [SEP], [PAD]

3. **Token Embeddings**:
   - Are numerical representations (vectors)
   - Capture semantic meaning
   - Used by the model for computations
   - Can be static or contextual

This understanding is crucial when working with transformer models and text processing tasks!

# Future improbvements:

1. run the code you got from the LLMs as answers to the questions you have.
1. make a spearate section to work on the same.
1. add some simple research about the token embeddings, word embeddings and text embeddings from aman's blog or some other blogs.
1. answer the questions:
  1. the glove is the text embeddings and sentencetransformers are contextual embeddings, but how do you make these for your custom data ?
1. Aha moments:
  1. you can use the contextual embeddings to solve the interview questions.