##tiktoken

In [None]:
pip install tiktoken



In [None]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("hello world")
print(tokens)
print(enc.decode(tokens))


[15339, 1917]
hello world


##tokenizer

### Phase 1: Prepare Your Data

- Convert all your training text to **UTF-8 bytes**.
- Collect statistics over these byte sequences.

In [None]:
# In Colab
!wget -O tiny_shakespeare.txt \
     https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


--2025-07-04 18:20:50--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘tiny_shakespeare.txt’


2025-07-04 18:20:51 (28.0 MB/s) - ‘tiny_shakespeare.txt’ saved [1115394/1115394]



1. Load and Read the File

In [None]:
# In a Colab code cell
import glob

# Read the entire file into one big string
with open("tiny_shakespeare.txt", "r", encoding="utf-8") as f:
    text_data = f.read()
print(f"Loaded {len(text_data):,} characters.")

Loaded 1,115,394 characters.


2. Convert All Text to UTF‑8 Bytes

In [None]:
# Convert the string to raw bytes
byte_data = text_data.encode("utf-8")
print(f"Total bytes: {len(byte_data):,}")

Total bytes: 1,115,394


In [None]:
print("First 100 characters of text_data:")
print(text_data[:100])
print("--------------------------------------")
print("\nFirst 100 bytes of byte_data:")
print(byte_data[:100])

First 100 characters of text_data:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You
--------------------------------------

First 100 bytes of byte_data:
b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'


If you want to see the integer values of each byte, you can do:

In [None]:
print(list(byte_data[:20]))
# e.g. [70, 105, 114, 115, 116, 32, 67, 105, 116, 105, 122, 101, ...]


[70, 105, 114, 115, 116, 32, 67, 105, 116, 105, 122, 101, 110, 58, 10, 66, 101, 102, 111, 114]


If you want to see them as actual characters, you’d have to decode each byte back to a one‐character string (which may raise errors for non‑ASCII bytes):

In [None]:
chars = [bytes([b]).decode("utf-8", "replace") for b in byte_data[:20]]
print(chars)


['F', 'i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i', 'z', 'e', 'n', ':', '\n', 'B', 'e', 'f', 'o', 'r']


3. Count Byte Frequencies

In [None]:
from collections import Counter

byte_freqs = Counter(byte_data)
print("Most common bytes:")
for byte, freq in byte_freqs.most_common(10):
    char = bytes([byte]).decode("utf-8", "replace")
    print(f"  Byte {byte!r} ({char!r}): {freq:,} occurrences")

Most common bytes:
  Byte 32 (' '): 169,892 occurrences
  Byte 101 ('e'): 94,611 occurrences
  Byte 116 ('t'): 67,009 occurrences
  Byte 111 ('o'): 65,798 occurrences
  Byte 97 ('a'): 55,507 occurrences
  Byte 104 ('h'): 51,310 occurrences
  Byte 115 ('s'): 49,696 occurrences
  Byte 114 ('r'): 48,889 occurrences
  Byte 110 ('n'): 48,529 occurrences
  Byte 105 ('i'): 45,537 occurrences


4. Inspect the Distribution

In [None]:
total = sum(byte_freqs.values())
print("\nTop byte frequencies (% of total):")
for byte, freq in byte_freqs.most_common(10):
    pct = freq / total * 100
    char = bytes([byte]).decode("utf-8", "replace")
    print(f"  {char!r} (byte {byte}): {pct:.2f}%")



Top byte frequencies (% of total):
  ' ' (byte 32): 15.23%
  'e' (byte 101): 8.48%
  't' (byte 116): 6.01%
  'o' (byte 111): 5.90%
  'a' (byte 97): 4.98%
  'h' (byte 104): 4.60%
  's' (byte 115): 4.46%
  'r' (byte 114): 4.38%
  'n' (byte 110): 4.35%
  'i' (byte 105): 4.08%


### Phase 2: Pre-tokenization (optional)

- Use regex to chunk text (OpenAI uses this but GPT-2 doesn’t).
- **Or** just split into raw bytes (pure byte-level BPE).

Whitespace & Punctuation Splitting

In [None]:
!pip install regex
import regex as re

def pre_tokenize(text):
    text = text.lower()
    tokens = re.findall(
        r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        text
    )
    return tokens




In [None]:
chunks = pre_tokenize(text_data)
byte_chunks = [chunk.encode('utf-8') for chunk in chunks]
byte_data = b''.join(byte_chunks)


Normalization (Optional)

In [None]:
import unicodedata
norm_text = unicodedata.normalize('NFKC', text_data.lower())


Chunk → Bytes

In [None]:
byte_sequences = [chunk.encode('utf-8') for chunk in chunks]
print(byte_sequences)

In [None]:
# Inspect a few
for seq in byte_sequences[:10]:
    print(seq)

b'first'
b' citizen'
b':'
b'\n'
b'before'
b' we'
b' proceed'
b' any'
b' further'
b','


### Phase 3: Learn BPE Merges

1. Start with a vocabulary of **single bytes** (0–255).
2. Count all **adjacent pairs** of tokens in the dataset.
3. Find the most frequent pair and merge it.
4. Add the merged pair to the vocabulary.
5. Repeat until you hit the target vocab size.

Step 1: Initialization

In [None]:
from collections import Counter

tokens = list(byte_data)  # Start with raw bytes
vocab = {i: bytes([i]) for i in range(256)}  # Initial vocab
next_token_id = 256  # First new token ID
target_vocab_size = 50257  # Final vocab size


Step 2: Count Adjacent Pairs

In [None]:
pair_freqs = Counter()
for i in range(len(tokens) - 1):
    pair = (tokens[i], tokens[i+1])
    pair_freqs[pair] += 1

# Show the top 10 pairs
for (b1, b2), freq in pair_freqs.most_common(10):
    c1 = bytes([b1]).decode('utf-8', 'replace')
    c2 = bytes([b2]).decode('utf-8', 'replace')
    print(f"Pair ({b1},{b2}) → ('{c1}','{c2}'): {freq:,} occurrences")

Pair (101,32) → ('e',' '): 27,965 occurrences
Pair (116,104) → ('t','h'): 26,047 occurrences
Pair (32,116) → (' ','t'): 24,243 occurrences
Pair (104,101) → ('h','e'): 19,268 occurrences
Pair (116,32) → ('t',' '): 16,508 occurrences
Pair (115,32) → ('s',' '): 15,486 occurrences
Pair (100,32) → ('d',' '): 14,542 occurrences
Pair (44,32) → (',',' '): 14,098 occurrences
Pair (32,97) → (' ','a'): 13,939 occurrences
Pair (111,117) → ('o','u'): 13,078 occurrences


Step 3: Merge Most Frequent Pair

In [None]:
if not pair_freqs:
    print("No more pairs to merge.")
else:
    most_common_pair, freq = pair_freqs.most_common(1)[0]
    b1, b2 = most_common_pair

    new_token = next_token_id
    vocab[new_token] = vocab[b1] + vocab[b2]
    next_token_id += 1


Step 4: Replace Pair in Tokens

In [None]:
new_tokens = []
i = 0
while i < len(tokens):
    if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == most_common_pair:
        new_tokens.append(new_token)
        i += 2
    else:
        new_tokens.append(tokens[i])
        i += 1
tokens = new_tokens


#### Wrap in a Loop to Build Full Vocabulary

In [None]:
from collections import Counter

tokens = list(byte_data)
vocab = {i: bytes([i]) for i in range(256)}
merges = {}
next_token_id = 256
target_vocab_size = 50257

while len(vocab) < target_vocab_size:
    # Step 1: Count adjacent token pairs
    pair_freqs = Counter()
    for i in range(len(tokens) - 1):
        pair = (tokens[i], tokens[i+1])
        pair_freqs[pair] += 1

    if not pair_freqs:
        print("No more pairs to merge.")
        break

    # Step 2: Find most frequent pair
    most_common_pair, freq = pair_freqs.most_common(1)[0]
    b1, b2 = most_common_pair

    # Step 3: Merge and assign new token ID
    new_token = next_token_id
    vocab[new_token] = vocab[b1] + vocab[b2]
    merges[(b1, b2)] = new_token
    next_token_id += 1

    # Step 4: Replace in token stream
    new_tokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == most_common_pair:
            new_tokens.append(new_token)
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    tokens = new_tokens

    if next_token_id % 1000 == 0:
        print(f"Vocab size: {len(vocab)} | Token count: {len(tokens)}")


Vocab size: 1000 | Token count: 410177
Vocab size: 2000 | Token count: 331700
Vocab size: 3000 | Token count: 296971
Vocab size: 4000 | Token count: 276174
Vocab size: 5000 | Token count: 261540
Vocab size: 6000 | Token count: 250416
Vocab size: 7000 | Token count: 241612
Vocab size: 8000 | Token count: 234259
Vocab size: 9000 | Token count: 228056
Vocab size: 10000 | Token count: 222734


KeyboardInterrupt: 

Add <|endoftext|> to Your Vocabulary

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Step 1: Choose a Token ID

In [None]:
END_OF_TEXT_TOKEN = "<|endoftext|>"
eot_token_id = 50256  # always the last ID in GPT-2


Step 2: Add It to Your Vocab

In [None]:
vocab[eot_token_id] = END_OF_TEXT_TOKEN.encode("utf-8")


Bonus: Add Reverse Mapping (ID → Text)

In [None]:
id_to_token = {i: token for i, token in vocab.items()}


### Phase 4: Tokenization Function

- Write a `tokenize(text) or encode(text)` function:
    - Convert text → UTF-8 bytes.
    - Apply your BPE merges greedily.
    - Output final token list.

In [None]:
def encode(text, vocab=vocab, merges=merges, special_tokens=None):
    if special_tokens and text in special_tokens:
        return [special_tokens[text]]

    tokens = []
    chunks = pre_tokenize(text)

    for chunk in chunks:
        byte_seq = list(chunk.encode('utf-8'))
        ids = byte_seq[:]

        # Apply greedy BPE on this chunk
        while True:
            pair_freqs = {}
            # Count frequencies of adjacent pairs
            current_pair_freqs = Counter()
            for i in range(len(ids) - 1):
                pair = (ids[i], ids[i+1])
                current_pair_freqs[pair] += 1

            # Find the most frequent pair that has a merge rule
            best_pair = None
            max_freq = -1
            for pair, freq in current_pair_freqs.items():
                if pair in merges and freq > max_freq:
                    max_freq = freq
                    best_pair = pair

            if not best_pair:
                break # No more pairs to merge based on learned merges

            # Merge the best pair
            new_id = merges[best_pair]
            new_ids = []
            i = 0
            while i < len(ids):
                if i < len(ids) - 1 and (ids[i], ids[i+1]) == best_pair:
                    new_ids.append(new_id)
                    i += 2
                else:
                    new_ids.append(ids[i])
                    i += 1
            ids = new_ids

        tokens.extend(ids)

    return tokens

### Phase 5: Decoding Function

- Map tokens → bytes → convert back to text (UTF-8 decode).

In [None]:
def decode(tokens, vocab=vocab):
    """
    Decodes a list of token IDs into a UTF-8 string.

    Args:
        tokens (List[int]): List of token IDs.
        vocab (dict): Token ID → byte sequence.

    Returns:
        str: The decoded string.
    """
    byte_stream = b''.join([vocab[token] for token in tokens])
    return byte_stream.decode('utf-8', errors='replace')


# Test the Tokenizer End-to-End

In [None]:
text = "Hello, world!"
tokens = encode(text)
decoded = decode(tokens)

print("Original:", text)
print("Tokens:", tokens)
print("Decoded:", decoded)


Original: Hello, world!
Tokens: [1359, 275, 111, 44, 3483, 114, 108, 100, 33]
Decoded: hello, world!


# Save the Vocabulary

In [None]:
import json

# Convert tuple keys in merges to strings for JSON serialization
merges_serializable = {str(k): v for k, v in merges.items()}
with open("merges_dict.json", "w") as f:
    json.dump(merges_serializable, f)
print("merges_dict.json saved.")

# Convert byte values in vocab to list of integers for JSON serialization
vocab_serializable = {k: list(v) for k, v in vocab.items()}
with open("vocab.json", "w") as f:
    json.dump(vocab_serializable, f)
print("vocab.json saved.")

merges_dict.json saved.
vocab.json saved.


# Load the Vocabulary

In [None]:
import json

# Load vocabulary
with open("vocab.json", "r") as f:
    loaded_vocab_str_keys = json.load(f)
    # Convert string keys back to integers and list values back to bytes
    loaded_vocab = {int(k): bytes(v) for k, v in loaded_vocab_str_keys.items()}

print(f"Loaded vocabulary with {len(loaded_vocab)} tokens.")

# Load merges
loaded_merges = {}
with open("merges.txt", "r") as f:
    for line in f:
        b1_str, b2_str = line.strip().split()
        # Convert string representations back to integers
        b1, b2 = int(b1_str), int(b2_str)
        # The new token ID is assigned sequentially during the original merge process.
        # We need to reconstruct the merge rule (pair -> new_id).
        # A simple way for this specific implementation is to find the new ID in the loaded vocab.
        # This assumes the merge file order matches the new token ID assignment order.
        # A more robust way would be to save the merge pair and its resulting new ID together.
        # For now, let's rebuild the merges dictionary based on the loaded vocab.
        # This part is tricky because merges.txt only stores the pairs, not the new ID.
        # A better saving format would be { "(b1, b2)": new_id }.

# Let's rethink how to load merges. If merges.txt only contains pairs, we need to
# rebuild the merges dictionary based on the loaded vocab.
# A more standard way to save BPE merges is a list of pairs in the order they were merged.
# The new ID is implicitly assigned sequentially.

# Let's assume merges.txt contains lines like "b1 b2" in the order they were merged.
# We can reconstruct the merges dictionary by replaying the merge process on a dummy sequence,
# or by saving the merges dictionary directly.

# A more direct way to save and load merges is to save the dictionary itself.
# Let's assume you saved merges as a dictionary { (b1, b2): new_id }
# If you saved it with tuple keys, you might need to convert keys to strings like "(b1, b2)"
# and convert them back on loading.

# Let's modify the saving code to save the merges dictionary in a loadable format.
# And then provide the loading code for that format.

In [None]:
import json

# Let's assume you saved merges like this:
# with open("merges_dict.json", "w") as f:
#     # Convert tuple keys to strings
#     merges_serializable = {str(k): v for k, v in merges.items()}
#     json.dump(merges_serializable, f)

# Loading the merges dictionary assuming it was saved as a JSON with string keys "(b1, b2)"
try:
    with open("merges_dict.json", "r") as f:
        loaded_merges_str_keys = json.load(f)
        # Convert string keys "(b1, b2)" back to tuple (int, int)
        loaded_merges = {}
        for key_str, new_id in loaded_merges_str_keys.items():
            # Assuming key_str is like "(b1, b2)"
            b1_str, b2_str = key_str.strip("()").split(", ")
            loaded_merges[(int(b1_str), int(b2_str))] = new_id

    print(f"Loaded {len(loaded_merges)} merge rules.")

except FileNotFoundError:
    print("merges_dict.json not found. Please ensure you have saved the merges dictionary.")
    loaded_merges = {} # Initialize as empty if file not found

# Now you can use loaded_vocab and loaded_merges to continue training or tokenize.

# Example of using the loaded tokenizer (assuming you have encode and decode functions)
# text_to_tokenize = "This is a test sentence."
# tokens = encode(text_to_tokenize, loaded_vocab, loaded_merges)
# decoded_text = decode(tokens, loaded_vocab)
# print("Original:", text_to_tokenize)
# print("Tokens:", tokens)
# print("Decoded:", decoded_text)

### Steps to Further Train the Tokenizer with New Data

To incrementally train your tokenizer with additional data, you need to load the current state of your vocabulary and merge rules and then continue the BPE training process with the new data.

Here are the general steps:

1.  **Load Existing Tokenizer State:**
    *   Load the saved `vocab` dictionary (from `vocab.json`).
    *   Load the saved `merges` dictionary (from `merges_dict.json`).

2.  **Prepare New Data:**
    *   Load your new text data.
    *   Combine the new data with your existing training data if you want to find merges across the entire corpus, or process the new data separately depending on your strategy.
    *   Convert the new or combined text data to UTF-8 bytes.
    *   Apply your pre-tokenization function to the new or combined byte data to get the initial list of tokens (integer IDs).

3.  **Continue BPE Training:**
    *   Initialize the BPE training loop (similar to the code in cell `WPa1FOXobAT3`) using the `loaded_vocab` and `loaded_merges`.
    *   Set `next_token_id` to `max(loaded_vocab.keys()) + 1`.
    *   Use the pre-tokenized list from the new or combined data as the starting `tokens` for the loop.
    *   The loop will continue to find the most frequent pairs in the current `tokens` and add new merges to your `merges` dictionary and new tokens to your `vocab` dictionary until your desired vocabulary size is reached or another stopping criterion is met.

4.  **Save Updated Tokenizer State:**
    *   After training, save the updated `vocab` and `merges` dictionaries back to their respective files to preserve the progress.

By following these steps, you can incrementally build and refine your tokenizer's vocabulary and merge rules as more data becomes available.

In [None]:
new_text_data = "new text data to train your tokenizer further"

In [None]:
# Step 1: Load Existing Tokenizer State
import json

try:
    with open("vocab.json", "r") as f:
        loaded_vocab_str_keys = json.load(f)
        loaded_vocab = {int(k): bytes(v) for k, v in loaded_vocab_str_keys.items()}
    print(f"Loaded vocabulary with {len(loaded_vocab)} tokens.")

    with open("merges_dict.json", "r") as f:
        loaded_merges_str_keys = json.load(f)
        loaded_merges = {}
        for key_str, new_id in loaded_merges_str_keys.items():
            b1_str, b2_str = key_str.strip("()").split(", ")
            loaded_merges[(int(b1_str), int(b2_str))] = new_id
    print(f"Loaded {len(loaded_merges)} merge rules.")

except FileNotFoundError:
    print("Could not find vocab.json or merges_dict.json. Starting with a fresh tokenizer.")
    loaded_vocab = {i: bytes([i]) for i in range(256)}
    loaded_merges = {}

# Initialize vocab and merges with the loaded state
vocab = loaded_vocab
merges = loaded_merges

# Step 2: Prepare New Data (and combine with old data for comprehensive training)
# Assuming text_data contains your original training data
# Assuming new_text_data contains your additional training data (from cell Slsg5aZgvmob)
combined_text_data = text_data + new_text_data

# Convert combined text to UTF-8 bytes
combined_byte_data = combined_text_data.encode("utf-8")

# Apply pre-tokenization to the combined byte data
# This gives us the initial sequence of tokens (byte IDs and potentially pre-tokenized chunk IDs)
# We need to make sure pre_tokenize handles byte sequences correctly, or apply it to text first
# Let's re-apply pre_tokenize to the combined text and then convert to byte sequences
combined_chunks = pre_tokenize(combined_text_data)
initial_tokens = []
for chunk in combined_chunks:
    byte_seq = list(chunk.encode('utf-8'))
    # At this stage, we treat these as initial token IDs (bytes) before applying merges
    initial_tokens.extend(byte_seq)

# Initialize tokens for the BPE loop with the initial tokens from combined data
tokens = initial_tokens

# Step 3: Continue BPE Training
# Set next_token_id to continue from the last learned token ID
if vocab:
    next_token_id = max(vocab.keys()) + 1
else:
    next_token_id = 256 # Should not happen if loading base vocab

# Define the target vocabulary size (can be larger than before if desired)
# Let's continue towards the original target size if not reached, or set a new target
current_vocab_size = len(vocab)
print(f"Starting training from vocab size: {current_vocab_size}")

# Continue merging until the target_vocab_size is reached or no more merges are possible
# We need to make sure the target_vocab_size variable is available or define it again
# Let's use the original target_vocab_size from cell WPa1FOXobAT3 if it exists, otherwise define it.
# Assuming target_vocab_size is available in the environment
# If not, you might need to define it here: target_vocab_size = 50257

while len(vocab) < target_vocab_size:
    # Step 1: Count adjacent token pairs in the current 'tokens' stream
    pair_freqs = Counter()
    for i in range(len(tokens) - 1):
        pair = (tokens[i], tokens[i+1])
        pair_freqs[pair] += 1

    # Step 2: Find most frequent pair that is not already a single token in the vocab
    # and has a merge rule defined or will be a new merge
    # We need to find the most frequent pair that *can* be merged
    best_pair = None
    max_freq = -1
    for pair, freq in pair_freqs.items():
        # Check if the pair is not already a single token (shouldn't happen with how tokens are generated)
        # and if its frequency is higher than the current max
        if freq > max_freq:
             # Also ensure the pair consists of valid token IDs currently in our vocabulary (or are base bytes)
             # This check is implicitly handled by how 'tokens' is constructed and updated
             max_freq = freq
             best_pair = pair

    if not best_pair or max_freq < 2: # Stop if no pairs or only single occurrences
        print("No more pairs to merge with frequency >= 2.")
        break

    # Step 3: Merge and assign new token ID if this pair hasn't been merged before
    if best_pair not in merges:
        new_token = next_token_id
        vocab[new_token] = vocab[best_pair[0]] + vocab[best_pair[1]]
        merges[best_pair] = new_token
        next_token_id += 1
    else:
         # This pair has been merged before, get the existing new token ID
         new_token = merges[best_pair]


    # Step 4: Replace in token stream
    new_tokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == best_pair:
            new_tokens.append(new_token)
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    tokens = new_tokens

    if len(vocab) % 1000 == 0:
        print(f"Vocab size: {len(vocab):,} | Token count: {len(tokens):,}")
    if len(vocab) >= target_vocab_size:
         print(f"Reached target vocab size: {target_vocab_size:,}")
         break


print(f"Final vocab size after training: {len(vocab):,}")
print(f"Final token count after training: {len(tokens):,}")


# Step 4: Save Updated Tokenizer State
# Convert tuple keys in merges to strings for JSON serialization
merges_serializable = {str(k): v for k, v in merges.items()}
with open("merges_dict.json", "w") as f:
    json.dump(merges_serializable, f)
print("Updated merges_dict.json saved.")

# Convert byte values in vocab to list of integers for JSON serialization
vocab_serializable = {k: list(v) for k, v in vocab.items()}
with open("vocab.json", "w") as f:
    json.dump(vocab_serializable, f)
print("Updated vocab.json saved.")

# problems

While there isn't one specific, commonly used technical term just for this particular problem in BPE training (where newly frequent pairs from new data might be missed if the vocabulary size limit is already met), it relates to the broader challenges in machine learning when dealing with:

Incremental Learning / Online Learning: Training a model (or a component like a tokenizer) on new data over time.
Data Drift: When the distribution or characteristics of incoming data change compared to the data the system was initially trained on. In your case, the new data has different frequency patterns.
Fixed Vocabulary Limitations: The inherent limitation of tokenizers with a predefined maximum vocabulary size, where less frequent or newly appearing patterns must be represented by sequences of existing tokens rather than being merged into new, more efficient tokens.
You could describe it as a challenge of "vocabulary adaptation in incremental BPE training under data drift" or "handling emergent frequent patterns with a fixed-size BPE vocabulary".

If you want to revisit this, you can simply refer back to the discussion about incremental tokenizer training and the issue of increasing the target vocabulary size when new data introduces new frequent patterns.