# Task 1: Language model inference

The goal if this first task is to familiarize yourself with the huggingface transformers and dataset libraries. You will learn how to load and tokenize a dataset, how to load a pre-trained language model, and finally, how to run a model in inference mode.

Your task is to complete the missing code blocks below.

In [5]:
# import dependencies
import matplotlib.pyplot as plt
import numpy as np
import torch

from datasets import load_dataset, load_dataset_builder, get_dataset_split_names, get_dataset_config_names
from transformers import XGLMTokenizer, XGLMTokenizerFast, XGLMForCausalLM, AutoModelForCausalLM, AutoTokenizer, GenerationConfig

## Explore dataset

In [6]:
DATA_SET_NAME = "facebook/flores" # specify dataset name
MODEL_NAME = "facebook/xglm-564M" # specify model name
# MODEL_NAME = "gpt2" # specify model name

In [7]:
# Explore a dataset

# covered language codes can be found here: https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage

ds_builder = load_dataset_builder("facebook/flores", "deu_Latn")
print(ds_builder.info.description) # print the dataset description

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


The creation of FLORES-200 doubles the existing language coverage of FLORES-101. 
Given the nature of the new languages, which have less standardization and require 
more specialized professional translations, the verification process became more complex. 
This required modifications to the translation workflow. FLORES-200 has several languages 
which were not translated from English. Specifically, several languages were translated 
from Spanish, French, Russian and Modern Standard Arabic. Moreover, FLORES-200 also 
includes two script alternatives for four languages. FLORES-200 consists of translations 
from 842 distinct web articles, totaling 3001 sentences. These sentences are divided 
into three splits: dev, devtest, and test (hidden). On average, sentences are approximately 
21 words long.



In [20]:
# print the features (columns) of the dataset
# TODO: your code goes here
print(ds_builder.info.features)

{'id': Value(dtype='int32', id=None), 'URL': Value(dtype='string', id=None), 'domain': Value(dtype='string', id=None), 'topic': Value(dtype='string', id=None), 'has_image': Value(dtype='int32', id=None), 'has_hyperlink': Value(dtype='int32', id=None), 'sentence': Value(dtype='string', id=None)}


In [39]:
# get the available splits
# TODO: your code goes here
configs = get_dataset_config_names("facebook/flores")
print(configs)

['ace_Arab', 'bam_Latn', 'dzo_Tibt', 'hin_Deva', 'khm_Khmr', 'mag_Deva', 'pap_Latn', 'sot_Latn', 'tur_Latn', 'ace_Latn', 'ban_Latn', 'ell_Grek', 'hne_Deva', 'kik_Latn', 'mai_Deva', 'pbt_Arab', 'spa_Latn', 'twi_Latn', 'acm_Arab', 'bel_Cyrl', 'eng_Latn', 'hrv_Latn', 'kin_Latn', 'mal_Mlym', 'pes_Arab', 'srd_Latn', 'tzm_Tfng', 'acq_Arab', 'bem_Latn', 'epo_Latn', 'hun_Latn', 'kir_Cyrl', 'mar_Deva', 'plt_Latn', 'srp_Cyrl', 'uig_Arab', 'aeb_Arab', 'ben_Beng', 'est_Latn', 'hye_Armn', 'kmb_Latn', 'min_Arab', 'pol_Latn', 'ssw_Latn', 'ukr_Cyrl', 'afr_Latn', 'bho_Deva', 'eus_Latn', 'ibo_Latn', 'kmr_Latn', 'min_Latn', 'por_Latn', 'sun_Latn', 'umb_Latn', 'ajp_Arab', 'bjn_Arab', 'ewe_Latn', 'ilo_Latn', 'knc_Arab', 'mkd_Cyrl', 'prs_Arab', 'swe_Latn', 'urd_Arab', 'aka_Latn', 'bjn_Latn', 'fao_Latn', 'ind_Latn', 'knc_Latn', 'mlt_Latn', 'quy_Latn', 'swh_Latn', 'uzn_Latn', 'als_Latn', 'bod_Tibt', 'fij_Latn', 'isl_Latn', 'kon_Latn', 'mni_Beng', 'ron_Latn', 'szl_Latn', 'vec_Latn', 'amh_Ethi', 'bos_Latn', 'fi

In [42]:
deu_splits = get_dataset_split_names("facebook/flores", "deu_Latn")
print(deu_splits)

tam_splits = get_dataset_split_names("facebook/flores", "tam_Taml")
print(tam_splits)

['dev', 'devtest']
['dev', 'devtest']


## Load data, tokenize, and batchify

In [8]:
# specify languages
LANGUAGES = [
    "eng_Latn",
    "spa_Latn",
    "ita_Latn",
    "deu_Latn",
    "arb_Arab",
    "tel_Telu",
    "tam_Taml",
    "quy_Latn"
]

In [9]:
# load flores data for each language
# TODO: your code goes here
language_datasets = {}
for lang in LANGUAGES:
    try:
        dataset_builder = load_dataset_builder("facebook/flores", lang)
        language_datasets[lang] = dataset_builder
        print(f"Dataset loaded successfully for {lang}.")
    except Exception as e:
        print(f"Failed to load dataset for {lang}: {str(e)}")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Dataset loaded successfully for eng_Latn.
Dataset loaded successfully for spa_Latn.
Dataset loaded successfully for ita_Latn.
Dataset loaded successfully for deu_Latn.
Dataset loaded successfully for arb_Arab.
Dataset loaded successfully for tel_Telu.
Dataset loaded successfully for tam_Taml.
Dataset loaded successfully for quy_Latn.


In [10]:
# let's look at the English subset
# TODO: your code goes here

english_latn_dataset = language_datasets.get("eng_Latn")
print(english_latn_dataset.info.features)

{'id': Value(dtype='int32', id=None), 'URL': Value(dtype='string', id=None), 'domain': Value(dtype='string', id=None), 'topic': Value(dtype='string', id=None), 'has_image': Value(dtype='int32', id=None), 'has_hyperlink': Value(dtype='int32', id=None), 'sentence': Value(dtype='string', id=None)}


In [26]:
# let's look at an individal sample from the dataset
# TODO: your code goes here

In [37]:
# tokenize the data
from tqdm import tqdm
# load a pre-trained tokenizer from the huggingface hub
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# gpt2 does not have a padding token, so we have to add it manually
if MODEL_NAME == "gpt2":
    tokenizer.add_special_tokens({'pad_token': tokenizer.unk_token})

# specify the tokenization function
def tokenization(example):
    # fill in here
     return tokenizer(text, padding="max_length", truncation=True)
# TODO: your code goes here

tokenized_datasets = {}



In [None]:
# let's take a look at a tokenized sample
# TODO: your code goes here

In [None]:
# construct a pytorch data loader for each dataset
BATCH_SIZE = 2 # for testing purposes, we start with a batch size of 2. You can change this later.

# TODO: your code goes here

## Load model

In [None]:
# load pre-trained model from the huggingface hub
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# put the model into evaluation mode
# TODO: your code goes here

In [None]:
losses = {lang: [] for lang in LANGUAGES} # store per-batch losses for each language

# iterate over the datset for each language and compute the cross-entropy loss per batch 
# TODO: your code goes here

## Visualize loss per language

In [None]:
# create a figure
fig, axes = plt.subplots(figsize=(8, 5))

# create a bar plot for each langauge
# TODO: your code goes here

# format plot
axes.set_xlabel("language") # x-axis label
axes.set_xticks(range(len(LANGUAGES))) # x-axis ticks
axes.set_xticklabels(losses.keys()) # x-axis tick labels
axes.set_ylabel("loss") # y-axis label
axes.set_ylim(0, 9) # range of y-axis
axes.set_title(MODEL_NAME); # title

## Comparing XGLM to GPT2

Your next task is to re-run the analysis above, but using `gpt2` as the pre-trained language model. For this exercise, focus on your native language, unless it's English or isn't covered by flores. In that case, pick another language that you can read well. 

Compare the language modeling loss of XGLM and GPT2. What do you observe? Investigate the differences in tokenization for XGLM and GPT2. What do you observe? How can the good (or bad) performance of GPT2 be explained?

In [None]:
# TODO: your code goes here