<a href="https://colab.research.google.com/github/Calcifer777/learn-nlp/blob/main/learn-transformers/ner_ft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers datasets

In [70]:
import pandas as pd

import transformers
from datasets import load_dataset

from transformers import (
    RobertaForTokenClassification,
    RobertaConfig,
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
)

import torch

In [73]:
ds = load_dataset("xtreme", name="PAN-X.en")

Downloading and preparing dataset xtreme/PAN-X.en to /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4...


Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset xtreme downloaded and prepared to /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [74]:
ds.set_format("pd")

In [75]:
pd.DataFrame(ds["train"][:5])

Unnamed: 0,tokens,ner_tags,langs
0,"[R.H., Saunders, (, St., Lawrence, River, ), (...","[3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0]","[en, en, en, en, en, en, en, en, en, en, en]"
1,"[;, ', '', Anders, Lindström, '', ']","[0, 0, 0, 1, 2, 0, 0]","[en, en, en, en, en, en, en]"
2,"[Karl, Ove, Knausgård, (, born, 1968, )]","[1, 2, 2, 0, 0, 0, 0]","[en, en, en, en, en, en, en]"
3,"[Atlantic, City, ,, New, Jersey]","[5, 6, 6, 6, 6]","[en, en, en, en, en]"
4,"[Her, daughter, from, the, second, marriage, w...","[0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, ...","[en, en, en, en, en, en, en, en, en, en, en, e..."


In [76]:
ds.reset_format()

In [77]:
class_labels = ds["train"].features["ner_tags"].feature

In [78]:
ds = ds.map(
    function=lambda batch: {"ner_tags_str": [class_labels.int2str(tag) for tag in batch["ner_tags"]]},
    batched=True,
    batch_size=64,
)

  0%|          | 0/313 [00:00<?, ?ba/s]

  0%|          | 0/157 [00:00<?, ?ba/s]

  0%|          | 0/157 [00:00<?, ?ba/s]

In [79]:
tags_freqs = dict()

for split, ds_split in ds.items():
  df_split = pd.DataFrame(ds_split[:])
  tmp = df_split.explode("ner_tags_str").groupby("ner_tags_str").size()
  tags_freqs[split] = tmp / tmp.sum()

pd.DataFrame.from_dict(tags_freqs, orient="columns")

Unnamed: 0_level_0,train,validation,test
ner_tags_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B-LOC,0.058263,0.060023,0.057976
B-ORG,0.058743,0.058073,0.059072
B-PER,0.057134,0.057552,0.056719
I-LOC,0.082154,0.078934,0.08026
I-ORG,0.144806,0.144507,0.144499
I-PER,0.091637,0.093374,0.093121
O,0.507263,0.507537,0.508353


In [51]:
model_name = "roberta-base"

In [57]:
idx2tag = {idx: v for idx, v in enumerate(class_labels.names)}
tag2idx = {v: idx for idx, v in enumerate(class_labels.names)}

In [58]:
config = AutoConfig.from_pretrained(
    model_name,
    num_labels=class_labels.num_classes,
    id2label=idx2tag, 
    label2id=tag2idx,
)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [69]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = (
    RobertaForTokenClassification
    .from_pretrained(model_name, config=config)
    .to(device)
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able

In [72]:
#https://stackoverflow.com/questions/61134275/difficulty-in-understanding-the-tokenizer-used-in-roberta-model
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [83]:
tokenizer.convert_ids_to_tokens(
  tokenizer("hi, my name is marco")["input_ids"]    
)

['<s>', 'hi', ',', 'Ġmy', 'Ġname', 'Ġis', 'Ġmar', 'co', '</s>']