# Task 1: Language model inference

The goal if this first task is to familiarize yourself with the huggingface transformers and dataset libraries. You will learn how to load and tokenize a dataset, how to load a pre-trained language model, and finally, how to run a model in inference mode.

Your task is to complete the missing code blocks below.

In [1]:
# import dependencies
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import torch

from datasets import (
    load_dataset,
    load_dataset_builder,
    get_dataset_split_names,
    get_dataset_config_names,
)
from transformers import (
    XGLMTokenizer,
    XGLMTokenizerFast,
    XGLMForCausalLM,
    AutoModelForCausalLM,
    AutoTokenizer,
    GenerationConfig
)

# set up figure parameters to make them look nice
plt.rcParams["axes.formatter.use_mathtext"] = True
matplotlib.rcParams["font.family"] = "cmr10"
matplotlib.rcParams["axes.unicode_minus"] = False
matplotlib.rcParams.update({"font.size": 11})

# other utils
from utils import *

## Explore dataset

In [2]:
DATA_SET_NAME = "facebook/flores" # specify dataset name
MODEL_NAME = "facebook/xglm-564M" # specify model name
# MODEL_NAME = "gpt2" # specify model name

In [3]:
# Explore a dataset
LANGUAGE_CODE = "deu_Latn" # Language to explore

# covered language codes can be found here: https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage

ds_builder = load_dataset_builder(DATA_SET_NAME, LANGUAGE_CODE, trust_remote_code=True)
print(ds_builder.info.description) # print the dataset description

The creation of FLORES-200 doubles the existing language coverage of FLORES-101. 
Given the nature of the new languages, which have less standardization and require 
more specialized professional translations, the verification process became more complex. 
This required modifications to the translation workflow. FLORES-200 has several languages 
which were not translated from English. Specifically, several languages were translated 
from Spanish, French, Russian and Modern Standard Arabic. Moreover, FLORES-200 also 
includes two script alternatives for four languages. FLORES-200 consists of translations 
from 842 distinct web articles, totaling 3001 sentences. These sentences are divided 
into three splits: dev, devtest, and test (hidden). On average, sentences are approximately 
21 words long.



In [4]:
# print the features (columns) of the dataset
pprint(ds_builder.info.features)

{'URL': Value(dtype='string', id=None),
 'domain': Value(dtype='string', id=None),
 'has_hyperlink': Value(dtype='int32', id=None),
 'has_image': Value(dtype='int32', id=None),
 'id': Value(dtype='int32', id=None),
 'sentence': Value(dtype='string', id=None),
 'topic': Value(dtype='string', id=None)}


In [5]:
# get the available splits
pprint(ds_builder.info.splits)

None


## Load data, tokenize, and batchify

In [6]:
# specify languages
LANGUAGES = [
    "eng_Latn",
    "spa_Latn",
    "ita_Latn",
    "deu_Latn",
    "arb_Arab",
    "tel_Telu",
    "tam_Taml",
    "quy_Latn"
]

In [7]:
# Set up the splits to download
USE_SPLITS = ["dev", "devtest"]

"""
load flores data for each language
structure: 
dataset_per_lang = {
  language: {
      "dataset": {
           split (dev/devtest): {
               "raw": raw dataset (without tokenization),
               "tokenized": tokenized dataset
           }
      }, 
      "dataloader": None}
  }
}
"""
dataset_per_lang = {}
for language in LANGUAGES:
    print(f"Loading dataset for {language}", end="... ")

    # add a dataloader key set to None, they are defined in the cell tagged
    # @dataloader-creation
    dataset_per_lang[language] = {"dataset": {}, "dataloader": None}

    for split in USE_SPLITS:
        dataset_per_lang[language]["dataset"][split] = {}
        dataset_per_lang[language]["dataset"][split]["raw"] = load_dataset(
            DATA_SET_NAME,
            language,
            split=split,
            trust_remote_code=True,
            cache_dir="../cache/languages",
        )

    print("done")

Loading dataset for eng_Latn... done
Loading dataset for spa_Latn... done
Loading dataset for ita_Latn... done
Loading dataset for deu_Latn... done
Loading dataset for arb_Arab... done
Loading dataset for tel_Telu... done
Loading dataset for tam_Taml... done
Loading dataset for quy_Latn... done


In [8]:
# let's look at the English subset
EX_DATASET_LANG = "eng_Latn"
english_dataset = dataset_per_lang[EX_DATASET_LANG]["dataset"]["dev"]["raw"]
print(f"Size of the english dataset: {english_dataset.info.dataset_size}")
print("Features:")
pprint_tab(english_dataset.info.features)
print("\nSplits:")
print(english_dataset.info.splits)

Size of the english dataset: 501481
Features:
	{'URL': Value(dtype='string', id=None),
	 'domain': Value(dtype='string', id=None),
	 'has_hyperlink': Value(dtype='int32', id=None),
	 'has_image': Value(dtype='int32', id=None),
	 'id': Value(dtype='int32', id=None),
	 'sentence': Value(dtype='string', id=None),
	 'topic': Value(dtype='string', id=None)}

Splits:
{'dev': SplitInfo(name='dev', num_bytes=245488, num_examples=997, shard_lengths=None, dataset_name='flores'), 'devtest': SplitInfo(name='devtest', num_bytes=255993, num_examples=1012, shard_lengths=None, dataset_name='flores')}


In [9]:
# let's look at an individual sample from the dataset
def get_sample(idx: int, lang: str, split: str, data: str):
    return dataset_per_lang[lang]['dataset'][split][data][idx]

print(f"Viewing raw samples from {EX_DATASET_LANG}:")
for split in USE_SPLITS:
    first_sample = get_sample(0, EX_DATASET_LANG, split, "raw")
    last_sample = get_sample(-1, EX_DATASET_LANG, split, "raw")
    dataset_len = len(dataset_per_lang[EX_DATASET_LANG]["dataset"][split]["raw"]) - 1

    print("")
    print(f"\tFirst sample from {split} split:")
    pprint_tab(first_sample, indent="\t\t")
    print("")
    print(f"\t{dataset_len}-th sample from {split} split:")
    pprint_tab(last_sample, indent="\t\t")

Viewing raw samples from eng_Latn:

	First sample from dev split:
		{'URL': 'https://en.wikinews.org/wiki/Scientists_say_new_medical_diagnostic_chip_can_sort_cells_anywhere_with_an_inkjet',
		 'domain': 'wikinews',
		 'has_hyperlink': 0,
		 'has_image': 0,
		 'id': 1,
		 'sentence': 'On Monday, scientists from the Stanford University School of '
		             'Medicine announced the invention of a new diagnostic tool that '
		             'can sort cells by type: a tiny printable chip that can be '
		             'manufactured using standard inkjet printers for possibly about '
		             'one U.S. cent each.',
		 'topic': 'health'}

	996-th sample from dev split:
		{'URL': 'https://en.wikivoyage.org/wiki/Funeral_travel',
		 'domain': 'wikivoyage',
		 'has_hyperlink': 0,
		 'has_image': 0,
		 'id': 997,
		 'sentence': 'In all cases, you must book by phone directly with the airline.',
		 'topic': 'Reason to travel/Funeral travel'}

	First sample from devtest split:
		{'URL': 'https

In [10]:
# tokenize the data

# load a pre-trained tokenizer from the huggingface hub
# if this throws an error and one is using a conda environment, one has to
# install sentencepiece library: conda install -c huggingface sentencepiece
try:
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME, cache_dir="../cache/tokenizers"
    )
except ValueError:
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME, cache_dir="../cache/tokenizers", use_fast=False
    )

# gpt2 does not have a padding token, so we have to add it manually
if MODEL_NAME == "gpt2":
    tokenizer.add_special_tokens({"pad_token": tokenizer.unk_token})


# specify the tokenization function
def tokenization(example):
    return tokenizer(
        example["sentence"],
        padding="longest",
        truncation=True,
        return_tensors="pt",
    )


def add_batch_dimension(example):
    """
    Adds a batch dimension to the tensors.
    This function assumes the tensors are already in PyTorch tensors and simply
    unsqueezes them at the first dimension.
    See https://stackoverflow.com/questions/57237352/what-does-unsqueeze-do-in-pytorch
    """
    example["input_ids"] = example["input_ids"].unsqueeze(0)
    example["attention_mask"] = example["attention_mask"].unsqueeze(0)
    return example


for language in dataset_per_lang:
    for split in dataset_per_lang[language]["dataset"]:
        raw_dataset = copy(dataset_per_lang[language]["dataset"][split]["raw"])

        # Tokenize the dataset
        tokenized_dataset = raw_dataset.map(
            lambda example: tokenization(example), batched=True
        )

        # Update the dataset with Pytorch format
        tokenized_dataset.set_format(
            type="torch", columns=["input_ids", "attention_mask"]
        )

        # Apply unsqueeze operation
        tokenized_dataset = tokenized_dataset.map(
            lambda example: add_batch_dimension(example),
            batched=False,  # Set batched=False to apply function to each example individually
        )

        dataset_per_lang[language]["dataset"][split]["tokenized"] = tokenized_dataset

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

In [11]:
# let's take a look at a tokenized sample
LOOKAT_SAMPLE_ID = 17

# get raw and tokenized sample
raw_sample = get_sample(LOOKAT_SAMPLE_ID, EX_DATASET_LANG, "dev", "raw")
tokenized_sample = get_sample(LOOKAT_SAMPLE_ID, EX_DATASET_LANG, "dev", "tokenized")

print(f"Viewing {LOOKAT_SAMPLE_ID}-th sample from {EX_DATASET_LANG}:")
print("\tRaw sample:")
pprint_tab(raw_sample, indent="\t\t")
print("\n\tTokenized sample:")
pprint_tab(tokenized_sample, indent="\t\t")

EX_DATASET_LANG = "spa_Latn"

raw_sample = get_sample(LOOKAT_SAMPLE_ID, EX_DATASET_LANG, "dev", "raw")
tokenized_sample = get_sample(LOOKAT_SAMPLE_ID, EX_DATASET_LANG, "dev", "tokenized")

print(f"\nViewing {LOOKAT_SAMPLE_ID}-th sample from {EX_DATASET_LANG}:")
print("\tRaw sample:")
pprint_tab(raw_sample, indent="\t\t")
print("\n\tTokenized sample:")
pprint_tab(tokenized_sample, indent="\t\t")

Viewing 17-th sample from eng_Latn:
	Raw sample:
		{'URL': 'https://en.wikinews.org/wiki/Investigation_of_Deutsche_Bank_headquarters_spills_into_second_day',
		 'domain': 'wikinews',
		 'has_hyperlink': 0,
		 'has_image': 0,
		 'id': 18,
		 'sentence': 'British newspaper The Guardian suggested Deutsche Bank '
		             'controlled roughly a third of the 1200 shell companies used to '
		             'accomplish this.',
		 'topic': 'crime'}

	Tokenized sample:
		{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
		         0]]),
		 'input_ids': tensor([[     2,  23409, 123980,    268,  67521, 102943,  22532,   5355, 170318,
		              6, 208717,     11,  27643,     48,     32,  27933, 105094,  33409,
		           3964,     33, 169662,    319,      5,      1,      1, 

In [12]:
# construct a pytorch data loader for each dataset
BATCH_SIZE = 2  # for testing purposes, we start with a batch size of 2. You can change this later.

for language in dataset_per_lang:
    for split in dataset_per_lang[language]["dataset"]:
        tokenized_dataset = dataset_per_lang[language]["dataset"][split]["tokenized"]
        dataset_per_lang[language]["dataloader"] = torch.utils.data.DataLoader(
            tokenized_dataset, batch_size=BATCH_SIZE, shuffle=False
        )

## Load model

In [13]:
# load pre-trained model from the huggingface hub
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, cache_dir="../cache/models")

# specify device on model and put the model into evaluation mode
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = model.to(device)
model.eval()
if torch.cuda.is_available():
    model.cuda()

print(f"Using device: {device}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Using device: cuda


In [14]:
# test on a sample
inputs = tokenized_sample["input_ids"].to(device)
labels = tokenized_sample["input_ids"].to(device)

# torch.inference_mode() is now preferred over torch.no_grad().
# See: https://discuss.pytorch.org/t/pytorch-torch-no-grad-vs-torch-inference-mode/134099/2?u=timgianitsos
with torch.inference_mode():
    outputs = model(inputs, labels=labels, attention_mask=tokenized_sample["attention_mask"].to(device))
    loss = outputs.loss.item()

print(loss)

25.76503562927246


In [15]:
losses = {lang: [] for lang in LANGUAGES} # store per-batch losses for each language

# Frees unused memory so it can be used by other tensors
torch.cuda.empty_cache()  
del inputs, labels, outputs

# TODO: Revise this to make sure it uses the cross-entropy loss
# iterate over the dataset for each language and compute the cross-entropy loss per batch 
for language in dataset_per_lang:
    print(f"Computing losses for {language}", end="... \n")
    for split in dataset_per_lang[language]["dataset"]:
        print(f"\tSplit: {split}", end="... ")
        dataloader = dataset_per_lang[language]["dataloader"]
        for batch in dataloader:
            inputs = batch["input_ids"].to(device)
            labels = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            
            # torch.inference_mode() is now preferred over torch.no_grad().
            # See: https://discuss.pytorch.org/t/pytorch-torch-no-grad-vs-torch-inference-mode/134099/2?u=timgianitsos
            with torch.inference_mode():
                outputs = model(inputs, labels=labels, attention_mask=attention_mask)
                loss = outputs.loss.item()
            
            losses[language].append(loss)

            # Explicitly delete tensors to free up GPU memory
            del inputs, labels, attention_mask, outputs

        print("done")

    # After processing each language, try to free up memory explicitly
    torch.cuda.empty_cache()  # Frees unused memory so it can be used by other tensors

Computing losses for eng_Latn... 
	Split: dev... 

## Visualize loss per language

In [None]:
# create a figure
fig, axes = plt.subplots(figsize=(8, 5))

# create a bar plot for each language
x = np.arange(len(LANGUAGES))
y = [np.mean(losses["eng_Latn"][1]) for language in LANGUAGES]

axes.bar(x, y)

fig.tight_layout()

# format plot
axes.set_xlabel("Language") # x-axis label
axes.set_xticks(range(len(LANGUAGES))) # x-axis ticks
axes.set_xticklabels(losses.keys()) # x-axis tick labels
axes.set_ylabel("Loss") # y-axis label
axes.set_ylim(0, 9) # range of y-axis
axes.set_title(MODEL_NAME); # title
axes.grid(True, which='major', color='k', linestyle='-', alpha=0.2)
axes.grid(True, which='minor', color='k', linestyle='--', alpha=0.1)
axes.minorticks_on()

## Comparing XGLM to GPT2

Your next task is to re-run the analysis above, but using `gpt2` as the pre-trained language model. For this exercise, focus on your native language, unless it's English or isn't covered by flores. In that case, pick another language that you can read well. 

Compare the language modeling loss of XGLM and GPT2. What do you observe? Investigate the differences in tokenization for XGLM and GPT2. What do you observe? How can the good (or bad) performance of GPT2 be explained?

In [None]:
# TODO: your code goes here