<a href="https://colab.research.google.com/github/LuciAkirami/nlp-cookbook/blob/main/NLP_3_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers[torch] datasets huggingface_hub umap-learn > /dev/null

In this Notebook we will go through the Translation Tasks

Here we are working with the XTREME dataset, to be more precise called the WikiANN or PAN dataset which is a subset of XTREME dataset, which contains wikipedia articles in many languages
- Each article is annotated with LOC, PER and ORG tags in inside-outside beginning (IOB2) format
- In this IOB2 format, `B-` prefix indicates the beginning of an entity and consecutive tokens belonging to same entity are givein an `I-` prefix and an `O` tag indicates that token doesn't belong to any entity
- whenever we are dealing with a dataset containing multiple domains we can utilize the `get_dataset_config_names()` to find out which subsets are available

Example: Jeff Dean is a computer scientist at Google in California

| Tokens    | Jeff     | Dean     | is       | a        | computer | scientist | at       | Google   | in       | California |
|-----------|----------|----------|----------|----------|----------|-----------|----------|----------|----------|------------|
| Tags      | B-PER    | I-PER    | O        | O        | O        | O         | O        | B-ORG    | O        | B-LOC      |


In [8]:
from datasets import get_dataset_config_names, load_dataset

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


In [6]:
# we are interested with PAN subset
panx_subsets = [x for x in xtreme_subsets if x.startswith("PAN")]
# the letters after the "." implies the language code, like .de for german
panx_subsets[:5]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg', 'PAN-X.bn', 'PAN-X.de']

In [9]:
# let's try loading in german data
panx_german = load_dataset("xtreme",name="PAN-X.de")

Downloading data:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [13]:
panx_german

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
})

In [12]:
# moving through different splits
for split in panx_german:
  print(split)

train
validation
test


In [20]:
# getting the number of rows in a particular split
panx_german['test'].num_rows

10000

In [26]:
# shuffling the training data and getting only a subset of it
panx_german['train'].shuffle(seed=0).select(range(int(panx_german['train'].num_rows*0.5)))

Dataset({
    features: ['tokens', 'ner_tags', 'langs'],
    num_rows: 10000
})

In [28]:
# shuffling the test data and getting a subset of it
# creating a new data of type DatasetDict
new_german = DatasetDict()

# fetching a down sampled data from original data
new_german['test'] = (
            panx_german['test']
            # here we are setting the seed to zero
              .shuffle(seed=0)
              # here we are selecting the range of values
              # in this case it will be range(0,5000) as test has 10k rows
              .select(range(int(panx_german['test'].num_rows*0.5))))
new_german

DatasetDict({
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 5000
    })
})

In [15]:
# a DatasetDict fn returns an empty DatasetDict
from datasets import DatasetDict

DatasetDict()

DatasetDict({
    
})

Dict vs defaultdict

- defaultdict are considered over normal dict for their default behaviour
- if we try to access a key that doesn't exist in a normal dict, it throws an error
- if we try to access a key that doesn't exist in a defaultdict, it calls the fn that it was provided with, thus giving us a default value

In [16]:
# a default dict is very simialat to a normal dict
# takes in a function, which it returns when we access something in
# defaultdict that is not present
from datasets import DatasetDict
from collections import defaultdict

# passing the fn which returns an empty DatasetDict as default value
ds = defaultdict(DatasetDict)

# accessing a key that is not present in the dict
ds[1]

DatasetDict({
    
})

Here we assuming we are building something for swiss country and this country has mainly four languages i.e. German, French, Italian and English
- the `langs` variable contains the list of languages spoken in swiss country
- the `fracs` list contains the percentage of population that speaks the above languages, like 62.9% in swiss speak german

To make a realistic swiss corpus, we’ll sample the German (`de`), French (`fr`), Italian (it), and English (en) corpora from PAN-X according to their spoken proportions. This will create a language imbalance that is very common in real-world datasets, where acquiring labeled examples in a minority language can be expensive due to the lack of domain experts who are fluent in that language. This imbalanced dataset will simulate a common situation when working on multilingual applications,

In [29]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [32]:
panx_ch

defaultdict(datasets.dataset_dict.DatasetDict,
            {'de': DatasetDict({
                 train: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 12580
                 })
                 validation: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 6290
                 })
                 test: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 6290
                 })
             }),
             'fr': DatasetDict({
                 train: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 4580
                 })
                 validation: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 2290
                 })
                 test: Dataset({
                     features: ['tokens', 'ner_tags', 'la

In [33]:
panx_ch['it']

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 1680
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 840
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 840
    })
})

In [36]:
# checking the training data in each language
import pandas as pd

pd.DataFrame({lang: panx_ch[lang]['train'].num_rows for lang in panx_ch},
             index=['Number of training examples'])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [41]:
# exploring the data
for key, value in panx_ch["de"]["train"][0].items():
  print(key,value,sep=": ")

tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der', 'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


In [45]:
# exploring the features
for k, v in panx_ch["de"]["train"].features.items():
  print(k,v,sep=": ")

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)


we see that the `ner_tags` in the above `exploring the data` are mapped to the classes shown in the above `exploring the features` output
- The `Sequence` class specifies that the field contains a list of features, which in the case of `ner_tags` corresponds to a list of `ClassLabel` features.

In [51]:
# extracting the ClassLabel features from the Sequence Class
# notice there is no "s" in "feature" after "['ner_tags']"
tags = panx_ch['de']['train'].features['ner_tags'].feature
tags

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)

In [53]:
# creating a function to change a list of int to a list of ner labels
def create_ner_tags(batch):
  return {'ner_tags_str':[tags.int2str(i) for i in batch['ner_tags']]}

In [56]:
eg = panx_ch['de']['train'][0]
eg

{'tokens': ['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.'],
 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
 'langs': ['de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de']}

In [58]:
# taking the integer ner_tags of the above and coverting it back to labels
ner_labels = create_ner_tags(eg)
ner_labels

{'ner_tags_str': ['O',
  'O',
  'O',
  'O',
  'B-LOC',
  'I-LOC',
  'O',
  'O',
  'B-LOC',
  'B-LOC',
  'I-LOC',
  'O']}

In [68]:
# creating a datasetdict and also adding a new column to incorporate labels
panx_de = panx_ch['de'].map(create_ner_tags)
panx_de

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 12580
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 6290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 6290
    })
})

In [72]:
# creating a dataframe to visualize
eg = panx_de['train'][0]
pd.DataFrame(
    [
        eg['tokens'],
        eg['ner_tags_str']
    ],
    index=['Tokens',"NER Tags"]
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
NER Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


In [75]:
# checking the frequencies of tags to see if there is any imbalance
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


There seems to be no imbalance in the frequencies of each of train, validation and test datasets