<a href="https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/Tokenizer_simple_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer
from transformers import LlamaTokenizer, AutoTokenizer

In [2]:
# 1. Define the input text
data = [
  "The earth is spherical.",
  "The earth is a planet."
]

# 2. Initialize Tokenizer (simulating the vocabulary build)
tokenizer = Tokenizer(num_words=15, lower=True, split=' ')
tokenizer.fit_on_texts(data)

# 3. Convert text to Sequence of Integers (Token IDs)
ID_sequences = tokenizer.texts_to_sequences(data)

# Output the dictionary (Vocabulary) and the IDs
print("ID dictionary:", tokenizer.word_index)
print("ID sequences:", ID_sequences)

ID dictionary: {'the': 1, 'earth': 2, 'is': 3, 'spherical': 4, 'a': 5, 'planet': 6}
ID sequences: [[1, 2, 3, 4], [1, 2, 3, 5, 6]]


In [5]:
# 1. Load a pre-trained tokenizer
tokenizer = LlamaTokenizer.from_pretrained('openlm-research/open_llama_3b_v2')

# 2. Define the input string
prompt = 'Is the earth spherical?'

# 3. Convert text to tokens (input_ids)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

print(input_ids)

tensor([[    1,  1383,   268,  5701,   618, 28649, 29584]])


In [10]:
# 1. Load a larger pre-trained tokenizer
tokenizer_larger = AutoTokenizer.from_pretrained('gpt2')

# 2. Define the input string
prompt = 'Is the earth spherical?'

# 3. Convert text to tokens (input_ids)
input_ids = tokenizer_larger(prompt, return_tensors="pt").input_ids

print(input_ids)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

tensor([[ 3792,   262,  4534, 43180,    30]])


### ***Observations:***

• *Case Sensitivity:* "Hello" vs "hello" results in different tokens.
• *Space Handling:* "hello world" (joined) vs "hello  world" results in different token sequences. For instance, "hello world" might split into "hello" and " world" (with a leading space included in the token).
• *Vocabulary Size:* The compression trades off sequence length for vocabulary size. A larger vocabulary (e.g., 100k symbols) results in shorter sequences of integers

### **For comparison, let us do legacy tokenization**

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# Download necessary NLTK data (run once)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
# Add explicit download for 'punkt_tab' as suggested by the error traceback
try:
    # NLTK's find typically looks for specific files, but downloading 'punkt_tab' as a collection
    # is suggested by the error. We will attempt to find a component of it, and if not found,
    # proceed with the download.
    nltk.data.find('tokenizers/punkt_tab/english.pickle') # Check for a known file within punkt_tab
except LookupError:
    nltk.download('punkt_tab') # Explicitly download 'punkt_tab' collection

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# 1. Define the input text
text = 'Is the earth spherical?'

print(f"Original text: '{text}'")

# 2. Tokenization (splitting into words)
tokens = word_tokenize(text)
print(f"\nTokens: {tokens}")

# 3. Lowercasing
lower_tokens = [word.lower() for word in tokens]
print(f"Lowercase tokens: {lower_tokens}")

# 4. Remove punctuation
punctuation_free_tokens = [word for word in lower_tokens if word not in string.punctuation]
print(f"Punctuation-free tokens: {punctuation_free_tokens}")

# 5. Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in punctuation_free_tokens if word not in stop_words]
print(f"Stop-word free tokens: {filtered_tokens}")

# 6. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(f"Lemmatized tokens: {lemmatized_tokens}")

Original text: 'Is the earth spherical?'

Tokens: ['Is', 'the', 'earth', 'spherical', '?']
Lowercase tokens: ['is', 'the', 'earth', 'spherical', '?']
Punctuation-free tokens: ['is', 'the', 'earth', 'spherical']
Stop-word free tokens: ['earth', 'spherical']
Lemmatized tokens: ['earth', 'spherical']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### ***Observations:***

Legacy Tokenization code is long. Also, the outcome is not suitable for GenAI