In [189]:
!nvidia-smi

Thu Jun 27 09:21:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0              26W /  70W |   2351MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [190]:
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

Cloning into 'notebooks'...
remote: Enumerating objects: 526, done.[K
remote: Counting objects: 100% (526/526), done.[K
remote: Compressing objects: 100% (289/289), done.[K
remote: Total 526 (delta 251), reused 480 (delta 231), pack-reused 0[K
Receiving objects: 100% (526/526), 29.30 MiB | 25.02 MiB/s, done.
Resolving deltas: 100% (251/251), done.
/content/notebooks/notebooks/notebooks/notebooks/notebooks/notebooks/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


In [191]:
from utils import *
setup_chapter()

Using transformers v4.16.2
Using datasets v1.16.1


In [192]:
import pandas as pd
toks = "Jeff Dean is a computer scientist at Google in California".split()
lbls = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
df = pd.DataFrame(data=[toks, lbls], index=['Tokens', 'Tags'])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,Jeff,Dean,is,a,computer,scientist,at,Google,in,California
Tags,B-PER,I-PER,O,O,O,O,O,B-ORG,O,B-LOC


In [193]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


In [194]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

In [195]:
from datasets import load_dataset

load_dataset("xtreme", name="PAN-X.de")

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
})

### First Code Block
```python
toks = "Jeff Dean is a computer scientist at Google in California".split()
lbls = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
df = pd.DataFrame(data={"toks": toks, "lbls": lbls}, index=["Tokens", "Tags"])
```
- `toks = ...split()`: This line splits a string into a list of words. `split()` by default splits by whitespace.
- `lbls = [...]`: This line defines a list of labels corresponding to each token in the `toks` list, indicating whether each token is part of a named entity (e.g., person, organization, location) and its position in the entity (beginning or inside).
- `df = pd.DataFrame(...)`: This creates a pandas DataFrame with two columns: "toks" and "lbls". The `index` parameter here is used incorrectly and would actually throw an error; it seems intended to set column names, which should be handled differently, typically not using `index`.

### Second Code Block
```python
from datasets import get_dataset_config_names
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")
```
- `get_dataset_config_names("xtreme")`: This fetches all available dataset configurations for a dataset named "xtreme" from the `datasets` library. The function returns a list of configuration names.
- `print(...)`: Displays the number of configurations found for the "xtreme" dataset.

### Third Code Block
```python
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]
```
- `panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]`: This line creates a list comprehension filtering `xtreme_subsets` to include only those that start with "PAN", which is likely a specific subset or category within the XTREME datasets.
- `panx_subsets[:3]`: This slices the list to show only the first three entries. It's common to slice lists to examine a subset of the data without printing everything.

### Fourth Code Block
```python
from datasets import load_dataset
load_dataset("xtreme", name="PAN-X.de")
```
- `load_dataset("xtreme", name="PAN-X.de")`: This function loads a specific dataset. It specifies "xtreme" as the dataset and "PAN-X.de" as the configuration name, which likely points to a dataset configured for German (denoted by ".de").



The "PAN-X" subsets within the "XTREME" dataset are part of a collection of benchmark datasets designed for evaluating the performance of models on cross-lingual natural language processing (NLP) tasks, including Named Entity Recognition (NER). The "XTREME" benchmark, which stands for Cross-lingual TRansfer Evaluation of Multilingual Encoders, includes multiple types of NLP tasks, and "PAN-X" specifically targets the NER challenge.

### Details of PAN-X

- **Purpose**: The PAN-X subset is used to assess how well NLP models can recognize named entities (like names of people, locations, organizations) in text across different languages.
- **Content**: It typically contains data in multiple languages, each labeled with annotations that mark named entities according to a standard schema (such as B-PER for beginning of a person's name, I-PER for continuation of a person's name, etc.).
- **Utility**: This subset is crucial for developing and testing the effectiveness of multilingual models in identifying and categorizing entities correctly across different linguistic contexts.

### Why It's Important

For NER systems, performance can vary significantly between languages due to linguistic differences and the availability of training data. By using datasets like PAN-X, researchers and developers can train models that are not only effective in languages with rich resources (like English) but also in less resource-dense languages, thus ensuring broader applicability and utility of NLP technologies globally.

The "PAN-X" subsets in "XTREME" allow researchers to benchmark their models on a consistent set of data across many languages, making it an essential tool for advancing the state of multilingual NLP.

### Training Models with PAN-X

1. **Multilingual NER**: PAN-X provides annotated text data in various languages. Each piece of text is labeled with tags that identify different types of named entities such as persons (PER), locations (LOC), organizations (ORG), etc. This annotation format is crucial for training NER models, as it teaches the model to recognize where entities start and end in text and what type of entity they are.

2. **Cross-lingual Generalization**: One of the main goals of using a dataset like PAN-X is to train models that not only perform well in high-resource languages but also generalize across languages that may have less training data available. This is particularly important in building NER systems that are globally applicable.

3. **Model Evaluation**: Besides training, PAN-X is also used for validating and testing the performance of NER models. This helps in understanding the effectiveness of different model architectures and training approaches in handling the linguistic nuances of different languages.

4. **Benchmarking**: PAN-X serves as a benchmark dataset for comparing different models and approaches in multilingual NER. By using a consistent dataset across various studies, it provides a standardized way to measure progress in the field.

### Use in Research and Development

Researchers and developers use PAN-X to:
- **Train models**: By feeding the labeled data from PAN-X into machine learning algorithms, models learn to identify and classify named entities in text.
- **Fine-tune models**: For models pre-trained on large datasets (like BERT or other transformer-based models), PAN-X can be used to fine-tune these models on the specific task of NER across multiple languages.
- **Test cross-lingual capabilities**: After training, models are tested on language subsets they were not explicitly trained on to evaluate their ability to transfer learning across languages.

In [196]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict) # this initialises panx_ch as a defaultdict which autom

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split].shuffle(seed=0).select(range(int(frac * ds[split].num_rows))))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Certainly! Let's break down the code you provided, focusing on the use of `defaultdict`, `DatasetDict`, `zip`, and dataset manipulation methods:

### Understanding the Code

**Code Setup:**
```python
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)
```
- `defaultdict(DatasetDict)`: This initializes `panx_ch` as a `defaultdict` that automatically creates a new `DatasetDict` if an accessed key doesn't exist. A `DatasetDict` is a special dictionary structure provided by the `datasets` library, typically used to store train, validation, and test datasets under respective keys.

**Loop Explanation:**
```python
for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))
```
- `for lang, frac in zip(langs, fracs)`: This loop iterates over languages and their corresponding fractions simultaneously. `zip()` is used to pair each language with its fraction, which represents the proportion of that language spoken in a specific context (here possibly weighted by linguistic presence in a country like Switzerland).

- `ds = load_dataset("xtreme", name=f"PAN-X.{lang})`: This line loads the dataset for a specific language. The `name` parameter is dynamically formatted to match the dataset name structure, loading different language data as specified by the loop iteration.

- `for split in ds:`: This nested loop iterates over each data split in the dataset `ds`, typically 'train', 'validation', and 'test'.

- `panx_ch[lang][split] = ...`: Here, data for each language and split is stored in `panx_ch`. If the language key does not exist, `defaultdict` automatically creates a new `DatasetDict` for it.

- `ds[split].shuffle(seed=0)`: This method shuffles the dataset split to ensure that data sampling is random, controlled by a seed for reproducibility.

- `.select(range(int(frac * ds[split].num_rows)))`: After shuffling, this method selects a subset of the data according to the specified fraction `frac`. This is done by creating a range that spans from 0 to the product of the fraction and the number of rows in the dataset split, effectively downsampling the data according to the given proportion.

### Purpose of the Code

The primary purpose of this code is to load multilingual datasets, shuffle them for randomness, and then downsample each according to the linguistic proportion specified by `fracs`. This can be useful in scenarios where you need to balance or weight datasets according to linguistic demographics or usage frequencies, ensuring that models trained on this data are better tailored to the actual linguistic landscape they will operate within.

A `defaultdict` is a subclass of the built-in `dict` class in Python, provided by the `collections` module. It's used to provide a default value for the dictionary entries that haven't been set yet. When you access or modify a key that does not exist in the dictionary, `defaultdict` automatically creates a new entry for it with a default value determined by a function you provide when you initialize the `defaultdict`.

### How `defaultdict` Works
When you declare a `defaultdict`, you must provide a function that Python can call to produce a default value whenever a key is accessed that does not exist in the dictionary. This function should not take any arguments and return the default value you want for your new keys.

### Example of `defaultdict` Usage
Suppose you have a `defaultdict(int)`. Here, `int` is a function that, when called without any arguments, returns `0`. So, any time you try to access or modify a key that doesn't exist, `defaultdict` will automatically create that key with a value of `0`.

### `defaultdict(DatasetDict)`
In your specific example, `defaultdict(DatasetDict)` initializes a `defaultdict` where the default value for any new key is a `DatasetDict` object. `DatasetDict` is a class provided by the `datasets` library (commonly used in NLP tasks), which typically organizes datasets into splits like training, validation, and testing. Here's how it's useful:

1. **Automatic Handling of New Keys**: When you access `panx_ch[lang]` and if `lang` has not been used as a key in `panx_ch` before, `defaultdict` automatically creates a new entry for `lang` with a `DatasetDict` as its value. You don't need to check if the key exists or initialize it manually, which makes your code cleaner and less error-prone.

2. **Immediate Usability**: Since the default value is a `DatasetDict`, you can immediately start using it to store dataset splits (like 'train', 'validation', 'test'). You don't need to initialize or set up these splits manually; you can directly assign datasets to them as your code progresses.

### Usage in the Code
In your provided code, `panx_ch` is a `defaultdict` of `DatasetDict`. When looping through different languages, if a specific language hasn't been added to `panx_ch` yet, it will be added with a new, empty `DatasetDict`. This allows the nested loop to fill in the dataset splits without needing to check or initialize the dictionary structure for each language beforehand.

This setup is particularly useful when dealing with dynamic or unknown sets of keys, where you expect to populate a dictionary in a structured way (like handling multiple languages and their dataset splits) without needing extensive checks or initializations for each key.

The `DatasetDict` is a specialized data structure provided by the `datasets` library, which is popular in the natural language processing (NLP) and machine learning communities for handling datasets. This data structure is designed specifically to manage multiple splits of datasets commonly used in machine learning workflows, such as training, validation, and testing sets.

### Key Features of `DatasetDict`

1. **Organized by Splits**: A `DatasetDict` typically organizes data into different subsets or splits. These splits are often named 'train', 'validation', and 'test', although the specific names can vary depending on the dataset. Each split is stored as a key in the `DatasetDict`, with the value being a `Dataset` object that contains the actual data for that split.

2. **Easy Data Manipulation**: Since each split is a `Dataset` object, you can easily apply transformations, filtering, or sampling operations on individual splits. This is particularly useful for preparing datasets for machine learning models, where different preprocessing might be required for training versus testing data.

3. **Consistency Across Tasks**: By using a `DatasetDict`, you maintain a consistent structure for accessing and manipulating your data across different parts of your machine learning pipeline. This consistency helps reduce bugs and improve the clarity of data handling code.

4. **Integration with `datasets` Library**: `DatasetDict` is tightly integrated with the `datasets` library's functionality. It supports seamless operations like serialization, loading data from disk, and more complex transformations like tokenization and feature encoding directly within the dictionary structure.

In [197]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


```python
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])
```
- **Import Statement**: The `import pandas as pd` line imports the pandas library, which is used for data manipulation and analysis.
- **Data Frame Creation**: This line creates a pandas DataFrame from a dictionary comprehension. The dictionary's keys are language codes (from the list `langs`), and the values are lists containing the number of training examples for each language in the `panx_ch` DatasetDict, accessed using `num_rows` on the "train" split.
- **Index Specification**: The `index` parameter sets the row label for the DataFrame. Here, it is set to "Number of training examples", which will be the label of the row displaying each language's number of training examples.

### Output
This DataFrame is displayed showing the number of training examples for each language: German (de), French (fr), Italian (it), and English (en), providing a quick overview of the dataset sizes for training.


In [198]:
element = panx_ch["de"]["train"][0]
for key, value in element.items():
    print(f"{key}: {value}")

tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der',
'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


In [199]:
# decrypting the "ner_tags" here to english words
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER',
'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None),
length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)


In [200]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG',
'B-LOC', 'I-LOC'], names_file=None, id=None)


In [201]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)

  0%|          | 0/6290 [00:00<?, ?ex/s]

  0%|          | 0/6290 [00:00<?, ?ex/s]

  0%|          | 0/12580 [00:00<?, ?ex/s]

In [202]:
print(panx_ch)
print(panx_de)

defaultdict(<class 'datasets.dataset_dict.DatasetDict'>, {'de': DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 6290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 6290
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 12580
    })
}), 'fr': DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 2290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 2290
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 4580
    })
}), 'it': DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 840
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 840
    })
    train: Dataset({
        features: ['tokens

In [203]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]], ['Tokens', 'Tags']) # can see "ner_tags_str" that was just converted being used here

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


In [204]:
from collections import Counter

# this is for quick checking to make sure no unusual imbalance in tags
split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items(): # split means train, test, validate
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"): # checking B-LOC, B-PER only
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,ORG,LOC,PER
validation,2683,3172,2893
test,2573,3180,3071
train,5366,6186,5810


### Multilingual Transformers


In [205]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [206]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

In [207]:
df = pd.DataFrame([bert_tokens, xlmr_tokens], index=["BERT", "XLM-R"])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
BERT,[CLS],Jack,Spa,##rrow,loves,New,York,!,[SEP],
XLM-R,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>


In [208]:
"".join(xlmr_tokens).replace(u"\u2581", " ")

'<s> Jack Sparrow loves New York!</s>'

## Creating Custom Model for Token Classification

In [209]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    # I believe that this class and its forward() function is mainly for training, as it consist of dropout and loss calculations. And you don't have the labels when inference.
    # though not impossible this function is also for inference, as by not declaring any labels, it will simply skip loss calculation. But unsure why is dropout there by default
    config_class = XLMRobertaConfig

    def __init__(self, config):

        # super() method calls the initialisation of "RobertaPreTrainedModel"
        #           - This abstract class handles the initialisation or loading of pretrained weights.
        #           - And loads model body---RobertaModel.
        #           - And extend it to our own classification head consisting of dropout and standard feedforward-layer.
        super().__init__(config) # config files ensures that default parameters are used. If want to change, can do by overwriting the default setting in configuration.
        self.num_labels = config.num_labels
        # Load model body
        self.roberta = RobertaModel(config, add_pooling_layer=False) # add_pooling_layer set to false ensures all hidden states are returned, not only the ones assosciated with [CLS] token
        # Set up token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Load and initialize weights
        self.init_weights()                 # finally this line initialises all the weights, which loads pretrained weights for model body and random initialise weights for token classification head

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        # Use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, **kwargs)       # calls the attribute declared
        # Apply classifier to encoder representation
        sequence_output = self.dropout(outputs[0]) # extracts first element from each token. which represents the final layer of hidden states
        logits = self.classifier(sequence_output)
        # Calculate losses
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) # .view(-1) seems to perform operation to change these 2 variables to the correct format.
        # Return model output object
        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)


### Components of the Line

1. **`logits.view(-1, self.num_labels)`**:
   - `logits` is a tensor containing the raw output scores from the last layer of your neural network. These scores are typically not yet passed through an activation function like softmax.
   - `.view(-1, self.num_labels)` reshapes the `logits` tensor. The `-1` tells PyTorch to infer the size of this dimension based on the other dimensions and the total number of elements in the tensor. The second dimension, `self.num_labels`, explicitly sets the size of the second dimension to the number of different labels (or classes) the model predicts.
   - This reshaping is crucial for classification tasks where `logits` originally has a shape that includes dimensions for batch size and sequence length (if applicable). The reshaping flattens the logits into a 2D tensor where each row corresponds to a prediction vector for a single data point.

2. **`labels.view(-1)`**:
   - `labels` contains the true class labels for each data point in the batch.
   - `.view(-1)` reshapes the `labels` tensor to a one-dimensional tensor. This operation flattens any tensor of labels into a single long vector.
   - This reshaping is necessary to match the prediction vectors in the loss calculation, where each element of this flattened array corresponds to the true class label of each input example.

3. **`loss_fct(...)`**:
   - `loss_fct` is typically an instance of a loss class from PyTorch, like `nn.CrossEntropyLoss()`. This function computes the loss between the logits (predictions) and the labels (true data).
   - `CrossEntropyLoss` is often used in multi-class classification tasks. It combines `nn.LogSoftmax()` and `nn.NLLLoss()` (negative log-likelihood loss) in one single class. This means it first applies a log-softmax on the logits to turn them into probabilities, and then computes the negative log-likelihood loss.

### How the Line Works

- The reshaped `logits` tensor, now a 2D tensor with each row representing the logits for a single example, and the flattened `labels` tensor, are passed to the loss function.
- The `CrossEntropyLoss` function calculates the loss for each pair of predicted logits and true label by evaluating how well the predicted probabilities match the actual labels.
- The loss values for all examples are then typically averaged (or summed, depending on the loss function's parameters) to produce a single scalar value that represents the total loss for the batch of data. This scalar is used during the backpropagation process to update the model's weights.

### Practical Example

If you have a batch of 10 examples and your model is classifying among 5 possible classes, and `self.num_labels` is 5:
- Original `logits` might have a shape of `[10, 5]` (10 examples, 5 scores each).
- Original `labels` might have a shape of `[10]` (1 label per example).
- After reshaping: `logits` remains `[10, 5]`, and `labels` remains `[10]`.
- `loss_fct` computes how well each set of scores predicts the true class for its corresponding example, then averages these individual losses to get the batch loss.

This line is central in training neural networks, as minimizing this loss value is how the network learns to improve its predictions.

Absolutely, let's break this down with an example to better understand how the `forward()` method processes the data and what exactly `outputs` and `sequence_output` represent in the context of a token classification task using a model like `XLMRobertaForTokenClassification`.

### Example Scenario: Sentiment Analysis

Suppose you have the following sentence for analysis: "The food was great but the service was terrible."

#### Step 1: Input Preparation
First, the sentence is tokenized into words or subword units (depending on the tokenizer specifics of RoBERTa):
```plaintext
input_ids: [101, 592, 7954, 1012, 953, 2123, 1012, 502, 1104, 9124, 102]
```
Here, `101` and `102` are special tokens representing the start and end of a sentence, respectively.

#### Step 2: Model Encoding
When this input is passed to the RoBERTa model within the `forward` method:
```python
outputs = self.roberta(input_ids, attention_mask, token_type_ids, **kwargs)
```
- **RoBERTa Model**: The model processes these `input_ids` using its multiple layers of transformers. Each transformer layer uses self-attention mechanisms to encode the sentence, considering the context provided by other words.

#### What are `outputs`?
`outputs` is a tuple containing several elements, where:
- `outputs[0]` (often referred to as `sequence_output`) is a tensor representing the last layer hidden states of the encoder. The size of this tensor is typically `[batch_size, sequence_length, hidden_size]`. This tensor contains the embeddings (vector representations) for each token that encapsulate both the meaning of the token itself and its context within the sentence.

For our sentence with 11 tokens and assuming a `hidden_size` of 768 (common in many models):
```python
sequence_output.shape  # This would output: torch.Size([1, 11, 768])
```

#### Step 3: Dropout Application
```python
sequence_output = self.dropout(outputs[0])
```
- **Dropout**: This randomly sets a fraction of the input units to 0 at each update during training time, which helps to prevent overfitting. The `sequence_output` now has some of its elements zeroed out randomly.

#### Step 4: Classification
```python
logits = self.classifier(sequence_output)
```
- **Classifier**: This linear layer maps the `sequence_output` from `hidden_size` dimensions to the number of classes you have. For a sentiment analysis task with three classes (positive, neutral, negative), this layer would output a `[batch_size, sequence_length, 3]` tensor, providing raw scores for each class, for each token.

#### Step 5: Loss Calculation
If labels are provided (during training), the loss between these logits and the true labels is calculated using cross-entropy to guide the model training:
```python
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, num_classes), labels.view(-1))
```
Here, `logits` are reshaped to combine the batch and sequence length dimensions, and `labels` are similarly flattened, allowing the loss function to compute the loss for each token's prediction against its label.

### Summary
In this example, `outputs[0]` (or `sequence_output`) plays a crucial role as it carries the contextual embeddings for each token in the input sequence. These embeddings are then used to predict the class (e.g., sentiment) for each token, and potentially compute a loss for training. Understanding these outputs is key to understanding how models like RoBERTa process text data and learn from it.

The output from a transformer model like RoBERTa, when you call it in a forward pass, typically includes multiple components packaged in a tuple. The structure of this tuple is often designed by convention based on what most users would frequently need. Let's dive into the structure and specifically why the last layer of hidden states is typically the first element in this tuple.

### Standard Output Structure

When you run a forward pass through a model like RoBERTa, the output tuple is designed to provide the most essential information first. Here's a common structure:

1. **`outputs[0]` - Last Layer Hidden States**: The first element of the tuple (`outputs[0]`) is the last layer hidden states. This is because, in many applications of transformer models, the last layer's hidden states are the most significant. These states capture the most refined and high-level representations of the input data, having been processed through all the transformer layers. They are directly useful for a wide range of downstream tasks like classification, named entity recognition, and more.

2. **`outputs[1]` - Pooler Output**: Depending on the model configuration and specific model class, this could be the output from a pooler layer that aggregates the entire sequence’s information into a single vector. This is particularly useful for classification tasks that require a single output vector to represent the entire sequence.

3. **Additional Outputs**: Other elements can include hidden states from all layers (not just the last), attention scores from each layer, and other model-specific outputs that are useful for specific tasks or detailed model analysis.

### Why is the Last Layer Hidden States the First Element?

The decision to place the last layer hidden states as the first element (`outputs[0]`) is primarily based on usage patterns. In practice, most applications that utilize models like RoBERTa need access to the most processed, high-level features of the input, which are represented in the last layer hidden states. These features are what most downstream tasks build upon:

- **Fine-tuning for specific tasks**: When fine-tuning a pretrained model on a specific task like token classification or sentiment analysis, the last layer's hidden states are typically the main features used to predict outputs.
- **Efficiency**: Users can access the most commonly needed output—the last layer hidden states—without having to deal with the full outputs array, making the API simpler and more efficient in typical use cases.

### Example in Context

Consider this simplified use case:

```python
from transformers import RobertaModel, RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

# Encode some text
input_ids = tokenizer("Hello, world!", return_tensors="pt").input_ids

# Get model outputs
outputs = model(input_ids)

# Most often directly used:
last_layer_hidden_states = outputs[0]
```

In this example:
- `outputs[0]` directly gives you the last layer hidden states, which you can immediately use for further processing or as input to another model component (like a classifier head).

This convention of making the last layer hidden states the first element in the output tuple streamlines the workflow, allowing developers to quickly access the most critical data without additional steps, reflecting a design choice oriented around common use cases and efficiency.

### Loading a Custom Model

In [210]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)} # informing model about the tag we are using to label each entity and mapping of each tag to ID an vice versa.
tag2index = {tag: idx for idx, tag in enumerate(tags.names)} # need to first create these tags

In [211]:
from transformers import AutoConfig

# storing these mappings and tags.num_classes attrribute in AutoConfig object. Passing kwargs to "from_pretrained" overrides the default values
xlmr_config = AutoConfig.from_pretrained(xlmr_model_name,
                                         num_labels=tags.num_classes,
                                         id2label=index2tag, label2id=tag2index)

In [212]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device)) # now loading model weights as usual. Using the just set xlmr_config

In [213]:
# quickly checking whether we have loaded our tokenizer correctly
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")          # pt means PyTorch Tensor
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])  # creates a pandas DF to display tokens and their corresponding input_ids

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Input IDs,0,21763,37456,15555,5161,7,2356,5753,38,2


In [214]:
outputs = xlmr_model(input_ids.to(device)).logits         # feeds input_ids tensor to gpu, to the model, and get raw output logits
predictions = torch.argmax(outputs, dim=-1)               # computes the most likely tag indices for each token by finding index of highest score in logits along last dimension
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")
print(outputs)
print(predictions)

Number of tokens in sequence: 10
Shape of outputs: torch.Size([1, 10, 7])
tensor([[[ 0.5666,  0.5653,  0.4711, -0.4355,  0.5185,  0.7065, -0.2554],
         [ 0.6985,  0.3579,  0.8460, -0.2862,  0.5535,  0.6808, -0.0017],
         [ 0.5750,  0.3172,  0.8300, -0.3658,  0.4525,  0.6472, -0.0040],
         [ 0.7035,  0.4018,  0.8587, -0.3460,  0.4313,  0.7283,  0.0482],
         [ 0.6915,  0.4432,  0.6869, -0.3848,  0.4834,  0.6644, -0.0465],
         [ 0.6784,  0.5278,  0.6606, -0.3786,  0.4037,  0.6188, -0.0089],
         [ 0.6710,  0.4219,  0.7068, -0.3398,  0.6072,  0.6033, -0.0267],
         [ 0.7022,  0.4303,  0.7866, -0.2516,  0.5757,  0.6138, -0.0426],
         [ 0.6514,  0.3561,  0.8552, -0.4248,  0.4862,  0.7517, -0.0715],
         [ 0.5855,  0.5267,  0.4481, -0.4053,  0.4768,  0.8010, -0.3727]]],
       device='cuda:0', grad_fn=<ViewBackward0>)
tensor([[5, 2, 2, 2, 0, 0, 2, 2, 2, 5]], device='cuda:0')


It looks like there's been a misunderstanding about the function of `dim=-1` in the context of `torch.argmax`. Let's clarify what `dim=-1` really means.

### Meaning of `dim=-1` in PyTorch

In PyTorch, specifying `dim=-1` in a function like `torch.argmax` tells the function to operate along the last dimension of the tensor. This doesn't mean that it only considers the last "tag" or the last element in that dimension; rather, it processes the entire dimension to determine which index holds the maximum value along that axis. Here's a more detailed explanation:

- **Tensor Dimensions**: Consider a tensor of shape `[1, 10, 7]` (as in your model's logits output):
  - `1` is the batch size.
  - `10` represents the number of tokens in the sequence.
  - `7` represents the number of classes (or "tags") that each token can be classified into.

- **Using `argmax` with `dim=-1`**:
  - When you apply `torch.argmax(outputs, dim=-1)`, the function looks across the last dimension (the 7 classes) for each of the 10 tokens.
  - It evaluates all 7 class scores for each token and returns the index (from 0 to 6) of the class with the highest score.
  - The result is a tensor where each element represents the predicted class index for each token, based on which class had the highest score.

### Practical Example

If the model outputs the following scores for three tokens (simplified to three tokens for clarity):
```
[
  [0.1, 0.2, 0.7],  # Token 1
  [0.4, 0.5, 0.1],  # Token 2
  [0.3, 0.6, 0.1]   # Token 3
]
```
The shape here is `[1, 3, 3]`: 1 batch, 3 tokens, 3 classes.

Applying `argmax(dim=-1)` would result in:
```
[2, 1, 1]
```
- For Token 1, the highest value is 0.7 at index 2.
- For Token 2, the highest value is 0.5 at index 1.
- For Token 3, the highest value is 0.6 at index 1.

This result tells you which class (index) each token is most likely to belong to based on the model's predictions.

### Summary

So, `dim=-1` is used to specify that the operation should consider all elements along the last dimension and find the maximum among them for each "slice" of the data along other dimensions (each token in this case). This is crucial for tasks like classification where you need to decide among multiple categories (classes) for each item (token) in your dataset.

In [215]:
# to see what the pre-trained model predicts here
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"]) # unsuprisingly this random initialised head is very erroneous

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Tags,B-LOC,I-PER,I-PER,I-PER,O,O,I-PER,I-PER,I-PER,B-LOC


In [216]:
# hence now lets try and fine tune on labelled data to make it better
def tag_text(text, tags, model, tokenizer):
    # Get tokens with special characters
    tokens = tokenizer(text).tokens()
    # Encode the sequence into IDs
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    outputs = model(input_ids)[0]         # gets the last hidden state here i believe
    # Take argmax to get most likely class per token
    predictions = torch.argmax(outputs, dim=2)
    # Convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()] # now since predictions are in ids which are in ints, they are then converted to text which are easier to understand
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])


### The Line of Code in Question

```python
input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
```

Here's what's happening in this line:

1. **Tokenization**:
   - `xlmr_tokenizer(text, return_tensors="pt")`: This part of the code calls the tokenizer with the input `text`. The tokenizer is responsible for converting raw text into a format that the model can process.
   - `return_tensors="pt"` tells the tokenizer to return the output in the form of PyTorch tensors. This is particularly useful when you're working in a PyTorch environment and need the data to be in tensor format for processing by PyTorch models.

2. **Accessing `input_ids`**:
   - The tokenizer outputs a data structure that includes various fields necessary for model input. One of these fields is `input_ids`, which are the numeric representations of the tokens.
   - `.input_ids` accesses this specific field from the tokenizer's output. The `input_ids` are essentially a sequence of integers where each integer uniquely represents a token from the input text.

3. **Device Placement**:
   - `.to(device)`: This method is used to ensure that the `input_ids` tensor is sent to the correct device (GPU or CPU). This is important for performance reasons, especially when working with models that will run on a GPU for faster computation.
   - `device` is typically defined elsewhere in your code (not shown in the snippet you provided) and is set to either 'cuda' (GPU) or 'cpu' based on whether CUDA is available. It ensures that all tensor computations will be done on the specified device.

### Visual Representation

Here's a breakdown of how each part of that line of code works, visually represented:

- **Text Input**:
  - Input: `"Hello, world!"`
- **Tokenizer Processing**:
  - Processes `"Hello, world!"` to a series of tokens and then to token IDs.
  - Output: `{'input_ids': tensor([[  101, 7592, 1010, 2088,  999,   102]])}` (example IDs)
- **Accessing `input_ids`**:
  - Extracts `input_ids` from the tokenizer's output.
  - Intermediate Output: `tensor([[  101, 7592, 1010, 2088,  999,   102]])`
- **Sending to Device**:
  - Moves the `input_ids` tensor to GPU if available.
  - Final Variable Value in `input_ids`: `tensor([[  101, 7592, 1010, 2088,  999,   102]], device='cuda:0')`

### Tokenising Texts for NER

In [225]:
words, labels = de_example["tokens"], de_example["ner_tags"]

In [226]:
tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True) # is_split_into_words=True tells the tokenizer our input sequence has already been split into words
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

In [227]:
pd.DataFrame([tokens], index=["Tokens"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>


In [228]:
word_ids = tokenized_input.word_ids() # demonstrating .word_ids() method. We see how each word have their respective ids
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,


In [229]:
previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100) # setting -100 to word_idx == None or I_LOC, I_ORG, I_PER etc
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx

labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

"""
Chose -100 as ID to mask subword reopresentations because PyTorch's cross-entropy loss class has an attribute called "ignore_index", whose value is 100.
This index is hence ignored during training, so we can use it to ignore the tokens assosciated with consecutive subwords.
"""

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,
Label IDs,-100,0,0,-100,0,0,5,-100,-100,6,...,5,-100,-100,-100,6,-100,-100,0,-100,-100
Labels,IGN,O,O,IGN,O,O,B-LOC,IGN,IGN,I-LOC,...,B-LOC,IGN,IGN,IGN,I-LOC,IGN,IGN,O,IGN,IGN


In [234]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True, # tokenization: converts raw text into input IDs using xlmr_tokenizr
                                      is_split_into_words=True)            # truncation=true parameter: ensures that sequence longer than model's maximum are truncated to fit
                                                                           # is_split_into_words=true indidcates that the input is pretokenized, or already split into words
    labels = []
    for idx, label in enumerate(examples["ner_tags"]): # loops over each set of tags in ner_tags // idx goes like == 0, 1, 2, 3, ...
        word_ids = tokenized_inputs.word_ids(batch_index=idx) # word_ids retrieves word indices for each token // hence here idx is used to pick out which set of tags we are now on

        # from this point on just trying to implement previous function
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [235]:
# using map() function to map a desired function to all sets in corpus
def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True, remove_columns=['langs', 'ner_tags', 'tokens'])

In [233]:
# By applying the function to a DatasetDict object, we get an encoded Dataset object per split. Now we have a model and a dataset, and need to define a performance matrix!
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/13 [00:00<?, ?ba/s]