In [28]:
!nvidia-smi

Wed Jun 26 19:40:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [29]:
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

Cloning into 'notebooks'...
remote: Enumerating objects: 526, done.[K
remote: Counting objects: 100% (526/526), done.[K
remote: Compressing objects: 100% (289/289), done.[K
remote: Total 526 (delta 251), reused 481 (delta 231), pack-reused 0[K
Receiving objects: 100% (526/526), 29.30 MiB | 15.13 MiB/s, done.
Resolving deltas: 100% (251/251), done.
/content/notebooks/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


In [30]:
from utils import *
setup_chapter()

Using transformers v4.16.2
Using datasets v1.16.1


In [31]:
import pandas as pd
toks = "Jeff Dean is a computer scientist at Google in California".split()
lbls = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
df = pd.DataFrame(data=[toks, lbls], index=['Tokens', 'Tags'])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,Jeff,Dean,is,a,computer,scientist,at,Google,in,California
Tags,B-PER,I-PER,O,O,O,O,O,B-ORG,O,B-LOC


In [32]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


In [33]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

In [34]:
from datasets import load_dataset

load_dataset("xtreme", name="PAN-X.de")

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
})

### First Code Block
```python
toks = "Jeff Dean is a computer scientist at Google in California".split()
lbls = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
df = pd.DataFrame(data={"toks": toks, "lbls": lbls}, index=["Tokens", "Tags"])
```
- `toks = ...split()`: This line splits a string into a list of words. `split()` by default splits by whitespace.
- `lbls = [...]`: This line defines a list of labels corresponding to each token in the `toks` list, indicating whether each token is part of a named entity (e.g., person, organization, location) and its position in the entity (beginning or inside).
- `df = pd.DataFrame(...)`: This creates a pandas DataFrame with two columns: "toks" and "lbls". The `index` parameter here is used incorrectly and would actually throw an error; it seems intended to set column names, which should be handled differently, typically not using `index`.

### Second Code Block
```python
from datasets import get_dataset_config_names
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")
```
- `get_dataset_config_names("xtreme")`: This fetches all available dataset configurations for a dataset named "xtreme" from the `datasets` library. The function returns a list of configuration names.
- `print(...)`: Displays the number of configurations found for the "xtreme" dataset.

### Third Code Block
```python
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]
```
- `panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]`: This line creates a list comprehension filtering `xtreme_subsets` to include only those that start with "PAN", which is likely a specific subset or category within the XTREME datasets.
- `panx_subsets[:3]`: This slices the list to show only the first three entries. It's common to slice lists to examine a subset of the data without printing everything.

### Fourth Code Block
```python
from datasets import load_dataset
load_dataset("xtreme", name="PAN-X.de")
```
- `load_dataset("xtreme", name="PAN-X.de")`: This function loads a specific dataset. It specifies "xtreme" as the dataset and "PAN-X.de" as the configuration name, which likely points to a dataset configured for German (denoted by ".de").



The "PAN-X" subsets within the "XTREME" dataset are part of a collection of benchmark datasets designed for evaluating the performance of models on cross-lingual natural language processing (NLP) tasks, including Named Entity Recognition (NER). The "XTREME" benchmark, which stands for Cross-lingual TRansfer Evaluation of Multilingual Encoders, includes multiple types of NLP tasks, and "PAN-X" specifically targets the NER challenge.

### Details of PAN-X

- **Purpose**: The PAN-X subset is used to assess how well NLP models can recognize named entities (like names of people, locations, organizations) in text across different languages.
- **Content**: It typically contains data in multiple languages, each labeled with annotations that mark named entities according to a standard schema (such as B-PER for beginning of a person's name, I-PER for continuation of a person's name, etc.).
- **Utility**: This subset is crucial for developing and testing the effectiveness of multilingual models in identifying and categorizing entities correctly across different linguistic contexts.

### Why It's Important

For NER systems, performance can vary significantly between languages due to linguistic differences and the availability of training data. By using datasets like PAN-X, researchers and developers can train models that are not only effective in languages with rich resources (like English) but also in less resource-dense languages, thus ensuring broader applicability and utility of NLP technologies globally.

The "PAN-X" subsets in "XTREME" allow researchers to benchmark their models on a consistent set of data across many languages, making it an essential tool for advancing the state of multilingual NLP.

### Training Models with PAN-X

1. **Multilingual NER**: PAN-X provides annotated text data in various languages. Each piece of text is labeled with tags that identify different types of named entities such as persons (PER), locations (LOC), organizations (ORG), etc. This annotation format is crucial for training NER models, as it teaches the model to recognize where entities start and end in text and what type of entity they are.

2. **Cross-lingual Generalization**: One of the main goals of using a dataset like PAN-X is to train models that not only perform well in high-resource languages but also generalize across languages that may have less training data available. This is particularly important in building NER systems that are globally applicable.

3. **Model Evaluation**: Besides training, PAN-X is also used for validating and testing the performance of NER models. This helps in understanding the effectiveness of different model architectures and training approaches in handling the linguistic nuances of different languages.

4. **Benchmarking**: PAN-X serves as a benchmark dataset for comparing different models and approaches in multilingual NER. By using a consistent dataset across various studies, it provides a standardized way to measure progress in the field.

### Use in Research and Development

Researchers and developers use PAN-X to:
- **Train models**: By feeding the labeled data from PAN-X into machine learning algorithms, models learn to identify and classify named entities in text.
- **Fine-tune models**: For models pre-trained on large datasets (like BERT or other transformer-based models), PAN-X can be used to fine-tune these models on the specific task of NER across multiple languages.
- **Test cross-lingual capabilities**: After training, models are tested on language subsets they were not explicitly trained on to evaluate their ability to transfer learning across languages.

In [35]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict) # this initialises panx_ch as a defaultdict which autom

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split].shuffle(seed=0).select(range(int(frac * ds[split].num_rows))))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Certainly! Let's break down the code you provided, focusing on the use of `defaultdict`, `DatasetDict`, `zip`, and dataset manipulation methods:

### Understanding the Code

**Code Setup:**
```python
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)
```
- `defaultdict(DatasetDict)`: This initializes `panx_ch` as a `defaultdict` that automatically creates a new `DatasetDict` if an accessed key doesn't exist. A `DatasetDict` is a special dictionary structure provided by the `datasets` library, typically used to store train, validation, and test datasets under respective keys.

**Loop Explanation:**
```python
for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))
```
- `for lang, frac in zip(langs, fracs)`: This loop iterates over languages and their corresponding fractions simultaneously. `zip()` is used to pair each language with its fraction, which represents the proportion of that language spoken in a specific context (here possibly weighted by linguistic presence in a country like Switzerland).

- `ds = load_dataset("xtreme", name=f"PAN-X.{lang})`: This line loads the dataset for a specific language. The `name` parameter is dynamically formatted to match the dataset name structure, loading different language data as specified by the loop iteration.

- `for split in ds:`: This nested loop iterates over each data split in the dataset `ds`, typically 'train', 'validation', and 'test'.

- `panx_ch[lang][split] = ...`: Here, data for each language and split is stored in `panx_ch`. If the language key does not exist, `defaultdict` automatically creates a new `DatasetDict` for it.

- `ds[split].shuffle(seed=0)`: This method shuffles the dataset split to ensure that data sampling is random, controlled by a seed for reproducibility.

- `.select(range(int(frac * ds[split].num_rows)))`: After shuffling, this method selects a subset of the data according to the specified fraction `frac`. This is done by creating a range that spans from 0 to the product of the fraction and the number of rows in the dataset split, effectively downsampling the data according to the given proportion.

### Purpose of the Code

The primary purpose of this code is to load multilingual datasets, shuffle them for randomness, and then downsample each according to the linguistic proportion specified by `fracs`. This can be useful in scenarios where you need to balance or weight datasets according to linguistic demographics or usage frequencies, ensuring that models trained on this data are better tailored to the actual linguistic landscape they will operate within.

A `defaultdict` is a subclass of the built-in `dict` class in Python, provided by the `collections` module. It's used to provide a default value for the dictionary entries that haven't been set yet. When you access or modify a key that does not exist in the dictionary, `defaultdict` automatically creates a new entry for it with a default value determined by a function you provide when you initialize the `defaultdict`.

### How `defaultdict` Works
When you declare a `defaultdict`, you must provide a function that Python can call to produce a default value whenever a key is accessed that does not exist in the dictionary. This function should not take any arguments and return the default value you want for your new keys.

### Example of `defaultdict` Usage
Suppose you have a `defaultdict(int)`. Here, `int` is a function that, when called without any arguments, returns `0`. So, any time you try to access or modify a key that doesn't exist, `defaultdict` will automatically create that key with a value of `0`.

### `defaultdict(DatasetDict)`
In your specific example, `defaultdict(DatasetDict)` initializes a `defaultdict` where the default value for any new key is a `DatasetDict` object. `DatasetDict` is a class provided by the `datasets` library (commonly used in NLP tasks), which typically organizes datasets into splits like training, validation, and testing. Here's how it's useful:

1. **Automatic Handling of New Keys**: When you access `panx_ch[lang]` and if `lang` has not been used as a key in `panx_ch` before, `defaultdict` automatically creates a new entry for `lang` with a `DatasetDict` as its value. You don't need to check if the key exists or initialize it manually, which makes your code cleaner and less error-prone.

2. **Immediate Usability**: Since the default value is a `DatasetDict`, you can immediately start using it to store dataset splits (like 'train', 'validation', 'test'). You don't need to initialize or set up these splits manually; you can directly assign datasets to them as your code progresses.

### Usage in the Code
In your provided code, `panx_ch` is a `defaultdict` of `DatasetDict`. When looping through different languages, if a specific language hasn't been added to `panx_ch` yet, it will be added with a new, empty `DatasetDict`. This allows the nested loop to fill in the dataset splits without needing to check or initialize the dictionary structure for each language beforehand.

This setup is particularly useful when dealing with dynamic or unknown sets of keys, where you expect to populate a dictionary in a structured way (like handling multiple languages and their dataset splits) without needing extensive checks or initializations for each key.

The `DatasetDict` is a specialized data structure provided by the `datasets` library, which is popular in the natural language processing (NLP) and machine learning communities for handling datasets. This data structure is designed specifically to manage multiple splits of datasets commonly used in machine learning workflows, such as training, validation, and testing sets.

### Key Features of `DatasetDict`

1. **Organized by Splits**: A `DatasetDict` typically organizes data into different subsets or splits. These splits are often named 'train', 'validation', and 'test', although the specific names can vary depending on the dataset. Each split is stored as a key in the `DatasetDict`, with the value being a `Dataset` object that contains the actual data for that split.

2. **Easy Data Manipulation**: Since each split is a `Dataset` object, you can easily apply transformations, filtering, or sampling operations on individual splits. This is particularly useful for preparing datasets for machine learning models, where different preprocessing might be required for training versus testing data.

3. **Consistency Across Tasks**: By using a `DatasetDict`, you maintain a consistent structure for accessing and manipulating your data across different parts of your machine learning pipeline. This consistency helps reduce bugs and improve the clarity of data handling code.

4. **Integration with `datasets` Library**: `DatasetDict` is tightly integrated with the `datasets` library's functionality. It supports seamless operations like serialization, loading data from disk, and more complex transformations like tokenization and feature encoding directly within the dictionary structure.

In [36]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


```python
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])
```
- **Import Statement**: The `import pandas as pd` line imports the pandas library, which is used for data manipulation and analysis.
- **Data Frame Creation**: This line creates a pandas DataFrame from a dictionary comprehension. The dictionary's keys are language codes (from the list `langs`), and the values are lists containing the number of training examples for each language in the `panx_ch` DatasetDict, accessed using `num_rows` on the "train" split.
- **Index Specification**: The `index` parameter sets the row label for the DataFrame. Here, it is set to "Number of training examples", which will be the label of the row displaying each language's number of training examples.

### Output
This DataFrame is displayed showing the number of training examples for each language: German (de), French (fr), Italian (it), and English (en), providing a quick overview of the dataset sizes for training.


In [37]:
element = panx_ch["de"]["train"][0]
for key, value in element.items():
    print(f"{key}: {value}")

tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der',
'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


In [38]:
# decrypting the "ner_tags" here to english words
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER',
'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None),
length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)


In [39]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG',
'B-LOC', 'I-LOC'], names_file=None, id=None)


In [40]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)

  0%|          | 0/6290 [00:00<?, ?ex/s]

  0%|          | 0/6290 [00:00<?, ?ex/s]

  0%|          | 0/12580 [00:00<?, ?ex/s]

In [41]:
print(panx_ch)
print(panx_de)

defaultdict(<class 'datasets.dataset_dict.DatasetDict'>, {'de': DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 6290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 6290
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 12580
    })
}), 'fr': DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 2290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 2290
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 4580
    })
}), 'it': DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 840
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 840
    })
    train: Dataset({
        features: ['tokens

In [42]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]], ['Tokens', 'Tags']) # can see "ner_tags_str" that was just converted being used here

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


In [43]:
from collections import Counter

# this is for quick checking to make sure no unusual imbalance in tags
split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items(): # split means train, test, validate
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"): # checking B-LOC, B-PER only
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,ORG,LOC,PER
validation,2683,3172,2893
test,2573,3180,3071
train,5366,6186,5810


### Multilingual Transformers
