# Introduction

This notebook serves as a guideline to answer all the research questions proposed in the Thesis document

We will go over the following questions:
1. Section *1.2. Objectives*
    1. Is it possible for an English-trained model to achieve comparable performance in European Portuguese by strategically modifying the tokenizer?
    1. Can tokenizer adaptation accelerate the training process for new language adaptation?
    1. How much can inference efficiency be improved by adding language-specific tokens to the model’s vocabulary?
    1. What embedding initialization strategies are most effective for integrating new tokens into a pre-trained model?
1. Section *4. Exploratory Analysis and Design Rationale* -> *4.1. Research Questions*
    1. What impact does tokenizer adaptation have on model performance for European Portuguese? Does it achieve comparable performance to the baseline model?
    1. How does tokenizer adaptation affect token efficiency (as measured by the Fertility metric) for Portuguese text processing?
    1. Does tokenizer adaptation affect all models equally, or are there differences based on model architecture and size?
    1. What embedding initialization strategies are most effective for integrating new tokens into a pre-trained model?
    1. Does a model loose performance in English after going through the hacking process?


In [1]:
import re
import pandas as pd
import hack_tokenizer
from hack_tokenizer.utils import constants

with open('../data/tokenizer_pt-pt.txt', 'r') as f:
    DATASET_TOKENIZER = f.read()
ENGLISH_ONLY_MODELS = ['HuggingFaceTB/SmolLM2-135M']   # English-only models which we ran the evaluation on
MULTI_LANGUAL_MODELS = ["Qwen/Qwen2.5-1.5B-Instruct", "HuggingFaceTB/SmolLM3-3B"] # Multi-language models which we ran the evaluations on
NUMBER_NEW_TOKENS = constants.NUMBER_NEW_TOKENS
EMBED_INIT_METHOD = constants.EMBED_INIT_METHOD
BATCH_SIZE  = constants.GENERATION_BATCH_SIZE
DEVICE      = constants.DEVICE
# TEMPERATURE = constants.TEMPERATURE
TEMPERATURE = 0.8
TOP_P       = 0.9
TOP_K       = 100

# Section 1.2 - Objectives

Answering all questions in section 1.2

## Research Question 1.2.1
Is it possible for an English-trained model to achieve comparable performance in European Portuguese by strategically modifying the tokenizer?

### Quantitative Analysis

In [None]:
import pandas as pd

df_121 = pd.read_csv('RESULTS_SUMMARY_20250823164827.csv')
df_121 = df_121[['number_new_tokens', 'model', 'model_type', 'MMLU', 'CalamePT', 'SupergluePTPT']]
df_121[['number_new_tokens', 'model_type', 'CalamePT', 'SupergluePTPT']].to_html()
# df_121 = df_121.query('model in @ENGLISH_ONLY_MODELS')
# print(df_121[['model', 'number_new_tokens', 'model_type', 'CalamePT', 'SupergluePTPT']].to_html())   # To add to Markdown bellow
df_121

### Qualitative Analysis

Using an actual LLM to answer some questions to see if it has similar performance for Portuguese

In [None]:
PROMPTS = [
    {
        'PT': 'Para calcular a raiz quadrada de um número manualmente,',
        'EN': 'To calculate the square root of a number by hand,'
    },
    {
        'PT': 'Nos dias de hoje, democracia é o sistema politico',
        'EN': 'Nowadays, democracy is the political system'
    },
    {
        'PT': 'O poema seguinte contém várias palavras-chave: azul, borboleta e sol:',
        'EN': 'The following poem contains several keywords: Blue, Butterfly, and Sun:'
    }
]
RESPONSES = []

def obtain_responses(prompt_kwargs):
    responses = []
    for prompt in PROMPTS:
        responses.append({})
        for language in prompt.keys():
            generation = hacker.prompt(content=prompt[language], **prompt_kwargs)
            responses[-1][language] = ''.join(generation)
    return responses

for model_name in ENGLISH_ONLY_MODELS:
    model, tokenizer   = hack_tokenizer.utils.loader.load_model_and_tokenizer(model_name, DEVICE)
    encoding_tokenizer = hack_tokenizer.utils.loader.load_model_and_tokenizer(model_name, DEVICE)[1]
    hacker = hack_tokenizer.hack.ModelHacker(
        dataset=DATASET_TOKENIZER,
        batch_size=BATCH_SIZE
    )
    prompt_kwargs = {
        'model': model,
        'tokenizer': tokenizer,
        'encoding_tokenizer': encoding_tokenizer,
        'max_new_tokens': 50,
        'stop_words': ['<|im_end|>', '<|endoftext|>'],
        'temperature': TEMPERATURE,
        'top_p': TOP_P,
        'top_k': TOP_K,
        'print_response': False
    }

    # Obtain responses BEFORE hacking
    RESPONSES.append({'BASELINE': obtain_responses(prompt_kwargs)})

    model, tokenizer = hacker.hack(
        model, tokenizer,
        encoding_tokenizer,
        num_tokens=NUMBER_NEW_TOKENS,
        embed_initializer_method=EMBED_INIT_METHOD,
        show_progress=True,
        train=False,
    )

    # Go over all the prompts and check responses
    prompt_kwargs.update({'model': model, 'tokenizer': tokenizer})
    RESPONSES[-1]['INITIALIZED_NO_TRAINING'] = obtain_responses(prompt_kwargs)

    # Visually print all prompts
    print(f'{"":=^200s}\n\n{"MODEL: `" + model_name + "`": ^200s}\n\n')
    for n, prompt in enumerate(PROMPTS):
        # Print responses
        for model_type in RESPONSES[-1].keys():
            for lan in RESPONSES[-1][model_type][n].keys():
                # Use regex to remove multiple newlines to a maximum of 2
                response = re.sub(r'\n{3,}', '\n\n', RESPONSES[-1][model_type][n][lan])
                print(f'{model_type}[{lan}]: <PROMPT>{prompt[lan]}</PROMPT> <RESPONSE>{response}</RESPONSE>')

### Answer

The answer to this question was **NO**:
By looking into the data itself, it doesn't seem like the model gets better at Portuguese (`CalamePT` and `SupergluePTPT`) by adding new tokens and initializing the embedding table only.
One interesting take away is that the model `Qwen/Qwen2.5-1.5B-Instruct` saw a slight drop in both `CalamePT` and `SupergluePTPT` when adding 7500 tokens.
However, the model `HuggingFaceTB/SmolLM2-135M` showed a slight increase. Neither of these changes are statistically significant, but further investigation may reveal more details

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>model</th> <th>number_new_tokens</th> <th>model_type</th> <th>CalamePT</th> <th>SupergluePTPT</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>0</td> <td>BASELINE</td> <td>0.135356</td> <td>0.014678</td> </tr> <tr> <th>1</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>1000</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.135356</td> <td>0.014678</td> </tr> <tr> <th>2</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>1000</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.135356</td> <td>0.014678</td> </tr> <tr> <th>3</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>5000</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.135356</td> <td>0.015055</td> </tr> <tr> <th>4</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>5000</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.135356</td> <td>0.015055</td> </tr> <tr> <th>5</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>7500</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.135356</td> <td>0.015055</td> </tr> <tr> <th>6</th> <td>HuggingFaceTB/SmolLM2-135M</td> <td>7500</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.135356</td> <td>0.015055</td> </tr> <tr> <th>7</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>0</td> <td>BASELINE</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>8</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>1000</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>9</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>1000</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>10</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>5000</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>11</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>5000</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>12</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>7500</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>13</th> <td>HuggingFaceTB/SmolLM3-3B</td> <td>7500</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.585260</td> <td>0.496864</td> </tr> <tr> <th>14</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>0</td> <td>BASELINE</td> <td>0.496146</td> <td>0.402396</td> </tr> <tr> <th>15</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>1000</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.496146</td> <td>0.398570</td> </tr> <tr> <th>16</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>1000</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.496146</td> <td>0.397943</td> </tr> <tr> <th>17</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>5000</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.496146</td> <td>0.402835</td> </tr> <tr> <th>18</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>5000</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.496146</td> <td>0.402083</td> </tr> <tr> <th>19</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>7500</td> <td>INITIALIZED_NO_TRAINING</td> <td>0.495665</td> <td>0.401393</td> </tr> <tr> <th>20</th> <td>Qwen/Qwen2.5-1.5B-Instruct</td> <td>7500</td> <td>INITIALIZED_WITH_TRAINING</td> <td>0.495665</td> <td>0.400640</td> </tr> </tbody> </table> 

Also, By looking into the output of the `notebook cell` above, we can see that the quality of the English response is still higher than that of the Portuguese one. And it doesn't seem to change after we've added new tokens.





## Research Question 1.2.2
Can tokenizer adaptation accelerate the training process for new language adaptation?

### Answer

```
============================================================
                 REMOVE THIS QUESTION
============================================================
```

## Research Question 1.2.3
How much can inference efficiency be improved by adding language-specific tokens to the model’s vocabulary?

In [None]:
import pandas as pd
 
df_121 = pd.read_csv('RESULTS_SUMMARY_20250823164827.csv')
df_121 = df_121[['number_new_tokens', 'model', 'model_type', 'FertilityOutput', 'FertilityBoost']]
df_121

### Answer

Yes -> By observing the results of `FertilityBoost`, it's clear to see that adding the new tokens can increase generation speed by upwards of `16%`. This increase, however, is mosttly seen in the lowest, mono-lingual model. It has not been explored if this increase is due to the mono-lingual aspect or the model-size. (future work: Explore bigger mono-lingual models & small multi-lingual models)


## Research Question 1.2.4
What embedding initialization strategies are most effective for integrating new tokens into a pre-trained model?

### Approach

In order to obtain an answer to this question, we're going to evaluation `new_tokens` logits when they are **expected** to be predicted by using pre-determined phrases where we know those tokens should exist.

Then, we compare the results of multiple initialization methods to determine the best ones.

In [2]:
# -------------------------------------------------------------------
#                       STEP 1. Fetch a dataset
#
# In step 1 we're fetching a dataset and selecting a random number of lines which CONTAIN any added token

# Imports
import os
import tqdm
import numpy as np
np.random.seed(42)

def calculate_token_scores(model, tokenizer, phrase: str, token_id: int):
    '''
    Returns the score of a specific token_id when generating a new token with `phrase` as input.

    Parameters
    ----------
    model: Any
        model to generate the phrase with
    
    tokenizer: Any
        tokenizer to encode the given phrase
    
    phrase: str
        phrase to give as input to the model
    
    token_id: int
        token to retrieve the scores to

    Returns
    -------
    dict[Literal['score', 'rank', 'best_score'], float]
    '''
    inputs = tokenizer(phrase, return_tensors='pt')
    for key in inputs.keys(): inputs[key] = inputs[key].to(model.device)
    generation = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        pad_token_id = tokenizer.eos_token_id,
        output_scores=True,
        return_dict_in_generate=True,
        max_new_tokens=1,
    )
    scores = generation['scores'][0][0]
    token_score = scores[token_id]
    token_rank = (scores > token_score).sum()
    return {'score': token_score.item(), 'rank': token_rank.item(), 'best_score': scores.max().item()}

def choose_random_line(file, file_size: int) -> bytes:
    """
    Seek to a random byte position and return the next full line.
    
    Args:
        file: A file object opened in binary mode.
        file_size: The total size of the file in bytes.
        
    Returns:
        A full line (bytes), starting from the next newline after the random byte.
    """
    byte_pos = np.random.randint(0, file_size - 1)
    file.seek(byte_pos)
    
    # Skip partial line
    file.readline()
    
    # Return the next full line
    return file.readline()

def sample_lines(file_path: str, sample_size: int, musthave_chars_list: list[str]) -> list[str]:
    """
    Randomly sample a given number of unique lines from a large file.
    
    Args:
        file_path: Path to the file.
        sample_size: Number of unique lines to sample.
        
    Returns:
        A list of decoded strings (lines).
    """
    file_size = os.path.getsize(file_path)
    # Find #SAMPLE_SIZE phrases containing any of the new tokens
    selected_lines = set()
    with open(file_path, 'rb') as f:
        while len(selected_lines) < sample_size:
            line = choose_random_line(f, file_size)
            if line and line.strip() and any(t in line.decode() for t in musthave_chars_list):  # Ignore empty lines and lines which don't contain any added token
                selected_lines.add(line.decode())
    return list(selected_lines)

def print_prefix(model: str, method: str, start: pd.Timestamp):
    timestamp_print = f'{pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S")} | {(pd.Timestamp.now() - start).total_seconds():<5.2f} sec'
    return f'[`{model}`/{"`"+method+"`":<17s} @ {timestamp_print}]'

def analyze_init_method(model_name: str, num_new_tokens: int, embed_init_method: str, dataset_path: str, sample_size: int):
    start = pd.Timestamp.now()
    print_args = [model_name, embed_init_method, start]
    print(f'{print_prefix(*print_args)} - Initializing model and tokenizer')
    # "Hacking" a model and training a tokenizer to get "new_tokens"
    model, tokenizer   = hack_tokenizer.utils.loader.load_model_and_tokenizer(model_name, DEVICE)
    encoding_tokenizer = hack_tokenizer.utils.loader.load_model_and_tokenizer(model_name, DEVICE)[1]
    hacker = hack_tokenizer.hack.ModelHacker(dataset=DATASET_TOKENIZER, batch_size=BATCH_SIZE)

    print(f'{print_prefix(*print_args)} - Hacking model and tokenizer')
    model, tokenizer = hacker.hack(
        model, tokenizer,
        encoding_tokenizer,
        num_tokens=num_new_tokens,
        embed_initializer_method=embed_init_method,
        show_progress=False,
        train=False,
    )

    # Select lines
    lines = sample_lines(dataset_path, sample_size, hacker.new_tokens)

    # Iterate over them
    scores = []
    for line in tqdm.tqdm(lines, desc=f'{print_prefix(*print_args)} - Analyzing results'):
        # Pick a random `new_token` in `line` to simulate a generation for it
        new_tokens_in_line = [t for t in hacker.new_tokens if t in line]
        split_token = np.random.choice(new_tokens_in_line, 1)[0].item()   # Randomly choose a token to split the word on
        splitted_line = line.split(split_token)
        eval_phrase = splitted_line.pop(0)
        while len(eval_phrase) < 10 and len(splitted_line) > 0:    # While we don't have 10 characters, continuously expand the phrase
            eval_phrase += split_token + splitted_line.pop(0)
        # Calculate the score for it
        scores.append(calculate_token_scores(model, encoding_tokenizer, eval_phrase, tokenizer.encode(split_token)))
    print(f'{print_prefix(*print_args)} - Finished Analysis')
    return scores

#### Analysis

By using the Auxiliary functions, we're going to Analyze all different methods available to compare how the scores of the initializations compare with each other

In [10]:
import json
import os

SAMPLE_SIZE = 1_000
MODEL = 'HuggingFaceTB/SmolLM2-135M'
NUMBER_NEW_TOKENS = constants.NUMBER_NEW_TOKENS
DATASET_PATH = constants.DATA_DIR / 'FULL_opensubtitles_pt-pt.txt'
AVAILABLE_INIT_METHODS = [
    'mean',
    'min',
    'max',
    'quantile(0.25)', 'quantile(0.5)', 'quantile(0.75)'
] + [f'weighted_drop({i/10:.1f})' for i in range(5, 51, 5)]

# Results path (to store results as it takes a bit of time to run)
RQ124_RESULTS_PATH = 'RQ_1.2.4_Results.json'

# Iterating over all available init methods
if not os.path.isfile(RQ124_RESULTS_PATH):
    results = {}
    for embed_init_method in AVAILABLE_INIT_METHODS:
        results[embed_init_method] = analyze_init_method(MODEL, NUMBER_NEW_TOKENS, embed_init_method, DATASET_PATH, SAMPLE_SIZE)
    with open(RQ124_RESULTS_PATH, 'w') as f:
        json.dump(results, f, indent=4)

with open(RQ124_RESULTS_PATH, 'r') as f:
    results = json.load(f)


In [40]:
df = []
for key in results.keys():
    for n, i in enumerate(results[key]):
        i.update({'method': key, 'phrase_id': n})
    df += results[key]
df = pd.DataFrame(df)


# Count number of times each method is "number 1" in rank (meaning how many times each method had the lowest "rank")
min_values = df.groupby(by=['phrase_id'], as_index=False)[['rank']].min()
df = df.merge(min_values, how='left', on=['phrase_id'], suffixes=('', '_min'))
df['min_ranked'] = df['rank'] == df['rank_min']
evaluation = df.groupby(by=['method'])['min_ranked'].sum()

pd.options.plotting.backend = "plotly"

evaluation.plot(kind='bar', y='min_ranked')


# df.pivot(
#     index='phrase_id', columns='method', values='rank'
# ).reset_index()


# df.plot(x='phrase_id', figsize=(20, 10), logy=True, logx=True)

### Answer

After evaluating 5 different methods, we decided to utilize `weighted_drop`.

More specifically, `weighted_drop(2)`. The reason to choose `weighted_drop(2)` Vs the `weighted_drop(5)` is that by using $K=5$ (in $weighted\_drop(K)$), we would pretty much not include **any** information of tokens other than the first one.

For that reason, we assigned the **first** tested `weighted_drop` bigger than all the others.

Final answer: `weighted_drop(2)`