<a href="https://colab.research.google.com/github/Jopat2409/com3610_notebooks/blob/main/xlm_r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%%capture
!pip install seqeval
!pip install transformers
!pip install datasets

In [3]:
import seqeval.metrics
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline, XLMRobertaForTokenClassification
from datasets import load_dataset, Dataset

## Initialisazation

Create the token → ID and ID → token dictionaries

In [4]:
TOKEN_TO_ID = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
ID_TO_TOKEN = {v: k for k, v in TOKEN_TO_ID.items()}
PIPELINE_KWARGS = {"ignore_labels": []}

# The model to work on
NER_MODEL = "FacebookAI/xlm-roberta-large-finetuned-conll03-english"

We can now create the tokenizer, model and pipeline using the TPU device

In [5]:
tokenizer = AutoTokenizer.from_pretrained(NER_MODEL)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL)
pipe = pipeline(task="ner", model=model, tokenizer=tokenizer, device=0)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/852 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at FacebookAI/xlm-roberta-large-finetuned-conll03-english were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


## Dataset creation and evaluation

We can now load the data (CoNLL-2003 NER dataset in this case) and make predictions from the model

In [6]:
data = load_dataset("conll2003", split="test")
data_joined = data.map(lambda x: {"tokens": " ".join(x["tokens"])})

predictions = pipe(data_joined["tokens"], **PIPELINE_KWARGS)

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In order to get an entity-level F1 score we need to take the first predicted label for each word

In [7]:
def generate_word_positions(sentence: str):
    first_word = next(i for i in range(len(sentence)) if sentence[i] != ' ')
    spaces = [first_word + i for i, ch in enumerate(sentence[first_word:]) if ch==' ']
    return list(zip([first_word] + [sp+1 for sp in spaces], [sp-1 for sp in spaces] + [len(sentence)-1]))


In [12]:
class BERTError:

    def __init__(self, sentence, expected, predicted):
        self.sentence = sentence
        self.expected = expected
        self.predicted = predicted

        self.incorrect = [i for i, tag in enumerate(self.expected) if not self.token_matches(tag, self.predicted[i])]
        self.incorrect_mappings = [(self.expected[i], self.predicted[i]) for i in self.incorrect]

    def token_matches(self, t1, t2):
        return (t1 == t2 == "O") or (t1[2:] == t2[2:])

    def __repr__(self) -> str:
        words = self.sentence.split(' ')
        for index in self.incorrect:
            words[index] = f"[E:{self.expected[index]},P:{self.predicted[index]}]({words[index]})"
        return ' '.join(words)

    def to_error_dict(self):
      return {
          "sentence": self.sentence,
          "expected": self.expected,
          "tokens": self.sentence.split(' '),
          "predicted": self.predicted,
          "incorrect": self.incorrect
      }

In [13]:

preds = []
ner_labels = [list(map(lambda x: ID_TO_TOKEN[x], labels)) for labels in data["ner_tags"]]

prediction_info = []

for i, prediction in enumerate(predictions):
    pred_processed = []

    # get a list of tuples giving the indexes of the start and end character of each word
    word_offsets = generate_word_positions(data_joined["tokens"][i])

    token_index = 0
    for word_offset in word_offsets:
        # for each word, we may keep only the predicted label for the first token, discard the others
        while prediction[token_index]["start"] < word_offset[0]:
            token_index += 1

        if prediction[token_index]["start"] > word_offset[0]:  # bad indexing
            pred_processed.append("O")
        elif prediction[token_index]["start"] == word_offset[0]:
            pred_processed.append(prediction[token_index]["entity"])

    preds.append(pred_processed)
    if not all((pred_processed[x] == ner_labels[i][x] == 'O') or (pred_processed[x][2:] == ner_labels[i][x][2:]) for x in range(len(pred_processed))):
        prediction_info.append(BERTError(data_joined["tokens"][i], ner_labels[i], pred_processed))

print(seqeval.metrics.classification_report(ner_labels, preds, digits=4))

              precision    recall  f1-score   support

         LOC     0.9440    0.9400    0.9420      1668
        MISC     0.8043    0.8433    0.8234       702
         ORG     0.8952    0.9308    0.9126      1661
         PER     0.9783    0.9740    0.9761      1617

   micro avg     0.9210    0.9350    0.9280      5648
   macro avg     0.9055    0.9220    0.9135      5648
weighted avg     0.9221    0.9350    0.9284      5648



In [14]:
from collections import defaultdict

tot = defaultdict(int)
for i in prediction_info:
  ''' if any("I-MISC" in m or "B-MISC" in m for m in i.incorrect_mappings):
    tot += 1 '''
  for err in i.incorrect_mappings:
      tot[err] += 1

print(tot)

defaultdict(<class 'int'>, {('B-PER', 'I-LOC'): 7, ('B-PER', 'I-ORG'): 17, ('B-MISC', 'O'): 17, ('B-LOC', 'I-ORG'): 53, ('B-PER', 'O'): 8, ('O', 'I-MISC'): 126, ('B-ORG', 'I-PER'): 5, ('B-MISC', 'I-LOC'): 14, ('I-MISC', 'O'): 11, ('B-ORG', 'I-LOC'): 38, ('O', 'I-LOC'): 31, ('I-LOC', 'O'): 2, ('B-ORG', 'I-MISC'): 25, ('I-ORG', 'I-MISC'): 10, ('B-ORG', 'O'): 15, ('B-LOC', 'I-MISC'): 27, ('I-ORG', 'O'): 16, ('I-LOC', 'I-ORG'): 12, ('O', 'I-ORG'): 105, ('B-MISC', 'I-ORG'): 41, ('I-MISC', 'I-ORG'): 19, ('B-LOC', 'O'): 3, ('I-MISC', 'I-LOC'): 8, ('O', 'I-PER'): 18, ('I-PER', 'I-ORG'): 8, ('B-LOC', 'I-PER'): 4, ('I-LOC', 'I-PER'): 2, ('B-MISC', 'I-PER'): 7, ('I-ORG', 'I-LOC'): 14, ('I-ORG', 'I-PER'): 2, ('B-PER', 'I-MISC'): 2, ('I-PER', 'I-LOC'): 1, ('I-MISC', 'I-PER'): 4})


In [17]:
import json

with open("xlmr.json", "w") as f:
  json.dump({"model": NER_MODEL, "errors": [e.to_error_dict() for e in prediction_info]}, f)

In [18]:
from google.colab import files
files.download('xlmr.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>