Ref doc link- https://huggingface.co/docs/transformers/tasks/token_classification

Installing all the required libraries

In [None]:
pip install transformers datasets evaluate seqeval

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Importing required libraries

In [None]:
import numpy as np
import pandas as pd

Loading the dataset.
Link-https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus?select=ner.csv


In [None]:
ner_data = pd.read_csv("ner.csv", encoding = 'unicode_escape')

ner_data.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [None]:
ner_data.shape

(47959, 4)

Preprocessing the Tags

The ast library in Python stands for Abstract Syntax Tree. It is a built-in Python library that allows for interaction and manipulation of Python source code as a tree of abstract syntax nodes.

For each row, it accesses the value in the 'Tag' column and uses ast.literal_eval to evaluate this value. The ast.literal_eval function safely evaluates a string as a Python literal or expression. This is useful if the 'Tag' column contains string representations of Python data structures, such as lists, tuples, or dictionaries.

In [None]:
import ast

def preprocess_data(df):
    for i in range(len(df)):
        tags = ast.literal_eval(df['Tag'][i])

        df['Tag'][i] = [str(word) for word in tags]

    return df

Applying preprocessing to the dataframe

In [None]:
ner_data = preprocess_data(ner_data)

ner_data.sample(10)

Unnamed: 0,Sentence #,Sentence,POS,Tag
19944,Sentence: 19945,More than 400 prisoners are being held at Guan...,"['JJR', 'IN', 'CD', 'NNS', 'VBP', 'VBG', 'VBN'...","[O, O, O, O, O, O, O, O, B-geo, O, O, O, O, O,..."
47337,Sentence: 47338,"Meanwhile , Germany 's Foreign Minister Frank-...","['RB', ',', 'NNP', 'POS', 'NNP', 'NNP', 'NNP',...","[O, O, B-geo, O, O, B-per, I-per, I-per, O, O,..."
46284,Sentence: 46285,Mr. Struck opened the conference by suggesting...,"['NNP', 'NNP', 'VBD', 'DT', 'NN', 'IN', 'VBG',...","[B-per, I-per, O, O, O, O, O, O, O, O, O, B-or..."
23373,Sentence: 23374,Three others were wounded .,"['CD', 'NNS', 'VBD', 'VBN', '.']","[O, O, O, O, O]"
14728,Sentence: 14729,The Sudanese government has been accused of ar...,"['DT', 'JJ', 'NN', 'VBZ', 'VBN', 'VBN', 'IN', ...","[O, B-gpe, O, O, O, O, O, O, B-gpe, O, O, O, O..."
6857,Sentence: 6858,Authorities said they arrested seven suspects ...,"['NNS', 'VBD', 'PRP', 'VBN', 'CD', 'NNS', ',',...","[O, O, O, O, O, O, O, O, O, O, O, O, O]"
22067,Sentence: 22068,Two U.S. senators are calling for an investiga...,"['CD', 'NNP', 'NNS', 'VBP', 'VBG', 'IN', 'DT',...","[O, B-geo, O, O, O, O, O, O, O, O, O, O, O, O,..."
25652,Sentence: 25653,A published report in the United States predic...,"['DT', 'VBN', 'NN', 'IN', 'DT', 'NNP', 'NNPS',...","[O, O, O, O, O, B-geo, I-geo, O, O, O, O, O, O..."
23579,Sentence: 23580,In a videotaped message discussing this year '...,"['IN', 'DT', 'VBN', 'NN', 'VBG', 'DT', 'NN', '...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-org,..."
29044,Sentence: 29045,Security along the countries ' mountainous bor...,"['NN', 'IN', 'DT', 'NNS', 'POS', 'JJ', 'NN', '...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


Here We will tokenize Sentence using word tokenize

In [None]:
from nltk.tokenize import word_tokenize

# Function to tokenize sentences
def tokenize_sentence(sentence):
    return word_tokenize(sentence)

# Apply tokenization to the DataFrame
ner_data['tokens'] = ner_data['Sentence'].apply(tokenize_sentence)

ner_data

Unnamed: 0,Sentence #,Sentence,POS,Tag,tokens
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo...","[Thousands, of, demonstrators, have, marched, ..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[Families, of, soldiers, killed, in, the, conf..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","[O, O, O, O, O, O, O, O, O, O, O, B-geo, I-geo...","[They, marched, from, the, Houses, of, Parliam..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[Police, put, the, number, of, marchers, at, 1..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","[O, O, O, O, O, O, O, O, O, O, O, B-geo, O, O,...","[The, protest, comes, on, the, eve, of, the, a..."
...,...,...,...,...,...
47954,Sentence: 47955,Indian border security forces are accusing the...,"['JJ', 'NN', 'NN', 'NNS', 'VBP', 'VBG', 'PRP$'...","[B-gpe, O, O, O, O, O, O, B-gpe, O, O, O, O, O...","[Indian, border, security, forces, are, accusi..."
47955,Sentence: 47956,Indian officials said no one was injured in Sa...,"['JJ', 'NNS', 'VBD', 'DT', 'NN', 'VBD', 'VBN',...","[B-gpe, O, O, O, O, O, O, O, B-tim, O, O, O, O...","[Indian, officials, said, no, one, was, injure..."
47956,Sentence: 47957,Two more landed in fields belonging to a nearb...,"['CD', 'JJR', 'VBD', 'IN', 'NNS', 'VBG', 'TO',...","[O, O, O, O, O, O, O, O, O, O, O]","[Two, more, landed, in, fields, belonging, to,..."
47957,Sentence: 47958,They say not all of the rockets exploded upon ...,"['PRP', 'VBP', 'RB', 'DT', 'IN', 'DT', 'NNS', ...","[O, O, O, O, O, O, O, O, O, O, O]","[They, say, not, all, of, the, rockets, explod..."


In [None]:
lengths_column_tag = ner_data['Tag'].apply(lambda x: len(x))
lengths_column_tokens = ner_data['tokens'].apply(lambda x: len(x))

# Filter rows with matching lengths
matching_lengths = lengths_column_tag== lengths_column_tokens

ner_data = ner_data[matching_lengths]

ner_data.shape

(47738, 5)

- B- indicates the beginning of an entity.
- I- indicates a token is contained inside the same entity (for example, the - - State token is a part of an entity like Empire State Building).
- 0 indicates the token doesn’t correspond to any entity.

`art` - creative workd
`gpe` - geo-political entity

In [None]:
# Flatten the lists in the 'text' column
all_words = [word for sublist in ner_data['Tag'].tolist() for word in sublist]

# Create a unique set of words and assign a unique number to each
unique_tags = sorted(set(all_words))

unique_tags

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O']

Defining mapping for each tag to number

In [None]:
tag_mapper = {'O': 0, 'B-art': 1,'B-eve': 2,'B-geo': 3,'B-gpe': 4, 'B-nat': 5,'B-org': 6,'B-per': 7,'B-tim': 8,
              'I-art': 9, 'I-eve': 10, 'I-geo': 11, 'I-gpe': 12, 'I-nat': 13, 'I-org': 14, 'I-per': 15, 'I-tim': 16}

tag_mapper

{'O': 0,
 'B-art': 1,
 'B-eve': 2,
 'B-geo': 3,
 'B-gpe': 4,
 'B-nat': 5,
 'B-org': 6,
 'B-per': 7,
 'B-tim': 8,
 'I-art': 9,
 'I-eve': 10,
 'I-geo': 11,
 'I-gpe': 12,
 'I-nat': 13,
 'I-org': 14,
 'I-per': 15,
 'I-tim': 16}

Replacing tags with numbers in the 'text' column

In [None]:
ner_data = ner_data.copy()

ner_data['Tag'] = ner_data['Tag'].apply(lambda x: [tag_mapper[word] for word in x])

ner_data.sample(10)

Unnamed: 0,Sentence #,Sentence,POS,Tag,tokens
33809,Sentence: 33810,In early Asian trading Friday crude oil was at...,"['IN', 'JJ', 'JJ', 'NN', 'NNP', 'JJ', 'NN', 'V...","[0, 0, 8, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[In, early, Asian, trading, Friday, crude, oil..."
10428,Sentence: 10429,"They spoke Monday , ahead of this week 's meet...","['PRP', 'VBD', 'NNP', ',', 'RB', 'IN', 'DT', '...","[0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[They, spoke, Monday, ,, ahead, of, this, week..."
5524,Sentence: 5525,He said 10 others have died in combat or suici...,"['PRP', 'VBD', 'CD', 'NNS', 'VBP', 'VBN', 'IN'...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[He, said, 10, others, have, died, in, combat,..."
33608,Sentence: 33609,Mr. Chavez has accused foreign oil companies o...,"['NNP', 'NNP', 'VBZ', 'VBN', 'JJ', 'NN', 'NNS'...","[7, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[Mr., Chavez, has, accused, foreign, oil, comp..."
39611,Sentence: 39612,One attacker was killed in a shootout followin...,"['CD', 'NN', 'VBD', 'VBN', 'IN', 'DT', 'NN', '...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[One, attacker, was, killed, in, a, shootout, ..."
1479,Sentence: 1480,Nepal 's new government has released more Maoi...,"['NNP', 'POS', 'JJ', 'NN', 'VBZ', 'VBN', 'RBR'...","[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[Nepal, 's, new, government, has, released, mo..."
30396,Sentence: 30397,Today I overheard a little boy say he was goin...,"['NN', 'PRP', 'VBD', 'DT', 'JJ', 'NN', 'VBP', ...","[8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[Today, I, overheard, a, little, boy, say, he,..."
9206,Sentence: 9207,A NATO statement says the soldier was killed S...,"['DT', 'NNP', 'NN', 'VBZ', 'DT', 'NN', 'VBD', ...","[0, 6, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, ...","[A, NATO, statement, says, the, soldier, was, ..."
45565,Sentence: 45566,He urged the Palestinian leadership to end the...,"['PRP', 'VBD', 'DT', 'JJ', 'NN', 'TO', 'VB', '...","[0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, ...","[He, urged, the, Palestinian, leadership, to, ..."
37945,Sentence: 37946,He noted they were in Iran either to visit fam...,"['PRP', 'VBD', 'PRP', 'VBD', 'IN', 'NNP', 'CC'...","[0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[He, noted, they, were, in, Iran, either, to, ..."


In [None]:
ner_data.drop(columns = ['Sentence', 'Sentence #','POS'], inplace = True)

ner_data.sample(5)

Unnamed: 0,Tag,tokens
16768,"[6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[U.S., Government, experts, are, drawing, a, m..."
4016,"[0, 3, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0]","[The, United, States, ', ``, Dream, Team, ``, ..."
11616,"[0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[In, a, statement, Friday, the, ministry, says..."
36036,"[0, 0, 6, 14, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[Prosecutors, accused, Omar, Said, Omar, Frida..."
11244,"[0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0]","[He, said, their, rights, are, guaranteed, by,..."


Converting pandas df to huggingface dataset

In [None]:
from datasets import Dataset

ner_ds = Dataset.from_pandas(ner_data)

ner_ds

Dataset({
    features: ['Tag', 'tokens', '__index_level_0__'],
    num_rows: 47738
})

Splitting the dataset’s into a train and test set with the train_test_split method

In [None]:
ner_ds = ner_ds.train_test_split(test_size = 0.2)

ner_ds

DatasetDict({
    train: Dataset({
        features: ['Tag', 'tokens', '__index_level_0__'],
        num_rows: 38190
    })
    test: Dataset({
        features: ['Tag', 'tokens', '__index_level_0__'],
        num_rows: 9548
    })
})

Checking an instance from training set

In [None]:
example = ner_ds["train"][2]

example

{'Tag': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  3,
  0,
  3,
  0,
  3,
  0,
  0,
  3,
  11,
  0,
  0,
  0,
  0,
  8,
  0,
  0,
  0,
  0,
  0],
 'tokens': ['The',
  'response',
  'followed',
  'allegations',
  'by',
  'several',
  'other',
  'countries',
  ',',
  'including',
  'India',
  ',',
  'Japan',
  ',',
  'Britain',
  'and',
  'the',
  'United',
  'States',
  ',',
  'who',
  'all',
  'said',
  'Wednesday',
  "'s",
  'vote',
  'was',
  'flawed',
  '.'],
 '__index_level_0__': 2641}

The next step is to load a DistilBERT tokenizer to preprocess the tokens field

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Tokenize an instance

is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.

In [None]:
example = ner_ds["train"][3]

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

tokens


['[CLS]',
 'tibetan',
 'exiles',
 'in',
 'india',
 'fast',
 '##ed',
 'and',
 'prayed',
 'for',
 'peace',
 'on',
 'saturday',
 ',',
 'and',
 'their',
 'spiritual',
 'leader',
 ',',
 'the',
 'dalai',
 'lama',
 ',',
 'joined',
 'in',
 'from',
 'a',
 'hospital',
 'bed',
 'in',
 'mumbai',
 '.',
 '[SEP]']

In [None]:
tokenized_input = tokenizer(example['tokens'], is_split_into_words = True)

tokenized_input

{'input_ids': [101, 11953, 27127, 1999, 2634, 3435, 2098, 1998, 14283, 2005, 3521, 2006, 5095, 1010, 1998, 2037, 6259, 3003, 1010, 1996, 28511, 18832, 1010, 2587, 1999, 2013, 1037, 2902, 2793, 1999, 8955, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Getting back the tokens

In [None]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

tokens

['[CLS]',
 'tibetan',
 'exiles',
 'in',
 'india',
 'fast',
 '##ed',
 'and',
 'prayed',
 'for',
 'peace',
 'on',
 'saturday',
 ',',
 'and',
 'their',
 'spiritual',
 'leader',
 ',',
 'the',
 'dalai',
 'lama',
 ',',
 'joined',
 'in',
 'from',
 'a',
 'hospital',
 'bed',
 'in',
 'mumbai',
 '.',
 '[SEP]']

some special tokens [CLS] and [SEP] are added  and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into two subwords. You’ll need to realign the tokens and labels by:

- Mapping all tokens to their corresponding word with the word_ids method.
- Assigning the label -100 to the special tokens [CLS] and [SEP] so they’re ignored by the PyTorch loss function (see CrossEntropyLoss).
- Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation = True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"Tag"]):
        # Map tokens to their respective word.
        word_ids = tokenized_inputs.word_ids(batch_index = i)

        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            # Only label the first token of a given word. This is when a word
            # has been tokenized into subwords
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. We can speed up the map function by setting batched=True to process multiple elements of the dataset at once.

In [None]:
tokenized_ner_ds = ner_ds.map(tokenize_and_align_labels, batched = True)

tokenized_ner_ds

Map:   0%|          | 0/38190 [00:00<?, ? examples/s]

Map:   0%|          | 0/9548 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Tag', 'tokens', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 38190
    })
    test: Dataset({
        features: ['Tag', 'tokens', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 9548
    })
})

Viewing a training example with

In [None]:
print(tokenized_ner_ds['train'][3]['Tag'])
print(tokenized_ner_ds['train'][3]['tokens'])
print(tokenized_ner_ds['train'][3]['labels'])

[4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0]
['Tibetan', 'exiles', 'in', 'India', 'fasted', 'and', 'prayed', 'for', 'peace', 'on', 'Saturday', ',', 'and', 'their', 'spiritual', 'leader', ',', 'the', 'Dalai', 'Lama', ',', 'joined', 'in', 'from', 'a', 'hospital', 'bed', 'in', 'Mumbai', '.']
[-100, 4, 0, 0, 3, 0, -100, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, -100]


Now creating a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer = tokenizer, return_tensors = "tf")

Including a metric during training is often helpful for evaluating your model’s performance. For this task,we  load the seqeval framework (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric). Seqeval actually produces several scores: precision, recall, F1, and accuracy.

In [None]:
import evaluate

seqeval = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Getting the NER labels first, and then create a function that passes your true predictions and true labels to compute to calculate the scores

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [unique_tags[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [unique_tags[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions = true_predictions, references = true_labels)

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Creating a map of the expected ids to their labels with id2label and label2id

In [None]:
id2label = {
 0: 'O',
 1: 'B-art',
 2: 'B-eve',
 3: 'B-geo',
 4: 'B-gpe',
 5: 'B-nat',
 6: 'B-org',
 7: 'B-per',
 8: 'B-tim',
 9: 'I-art',
 10: 'I-eve',
 11: 'I-geo',
 12: 'I-gpe',
 13: 'I-nat',
 14: 'I-org',
 15: 'I-per',
 16: 'I-tim'
}



In [None]:
label2id = {v: k for k, v in id2label.items()}

label2id

{'O': 0,
 'B-art': 1,
 'B-eve': 2,
 'B-geo': 3,
 'B-gpe': 4,
 'B-nat': 5,
 'B-org': 6,
 'B-per': 7,
 'B-tim': 8,
 'I-art': 9,
 'I-eve': 10,
 'I-geo': 11,
 'I-gpe': 12,
 'I-nat': 13,
 'I-org': 14,
 'I-per': 15,
 'I-tim': 16}

For finetuning a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters

In [None]:
from transformers import create_optimizer

batch_size = 16
num_train_epochs = 5

num_train_steps = (len(tokenized_ner_ds["train"]) // batch_size) * num_train_epochs

optimizer, lr_schedule = create_optimizer(
    init_lr = 2e-5,
    num_train_steps = num_train_steps,
    weight_decay_rate = 0.01,
    num_warmup_steps = 0,
)

We load DistilBERT with TFAutoModelForTokenClassification along with the number of expected labels, and the label mappings

In [None]:
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased", num_labels = 17, id2label = id2label, label2id = label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForTokenClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able t

Converting train and validation datasets to the tf.data.Dataset format with prepare_tf_dataset()

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_ner_ds["train"],
    shuffle = True,
    batch_size = 16,
    collate_fn = data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_ner_ds["test"],
    shuffle = False,
    batch_size = 16,
    collate_fn = data_collator,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Model is compiled

In [None]:
import tensorflow as tf

model.compile(optimizer = optimizer)

model.summary()

Model: "tf_distil_bert_for_token_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  13073     
                                                                 
Total params: 66375953 (253.20 MB)
Trainable params: 66375953 (253.20 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The last two things to setup before we start training is to compute the seqeval scores from the predictions, and provide a way to push our model to the Hub. Both are done by using Keras callbacks.

Passing compute_metrics function to KerasMetricCallback:

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn = compute_metrics, eval_dataset = tf_validation_set)

Finally, We’re ready to start training your model! Calling fit with your training and validation datasets, the number of epochs, and your callbacks to finetune the model

In [None]:
model.fit(x = tf_train_set, validation_data = tf_validation_set, epochs = 3, callbacks = [metric_callback])

Epoch 1/3

  _warn_prf(average, modifier, msg_start, len(result))


Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x799d8e162950>

In [None]:
from transformers import pipeline

classifier = pipeline("ner", model = model, tokenizer = tokenizer)

### TODO Recording:

- Paste only one text at a time and run the cell and show

In [None]:
text = "Apple Inc. is located in Cupertino, California."

text= "Rome is a beautiful city in Italy."

# text= "Elon Musk is the CEO of SpaceX and Tesla."

text = "The Golden State Warriors are an American professional basketball team based in San Francisco."


classifier(text)

[{'entity': 'B-org',
  'score': 0.9567796,
  'index': 2,
  'word': 'golden',
  'start': 4,
  'end': 10},
 {'entity': 'I-org',
  'score': 0.94898456,
  'index': 3,
  'word': 'state',
  'start': 11,
  'end': 16},
 {'entity': 'I-org',
  'score': 0.9478982,
  'index': 4,
  'word': 'warriors',
  'start': 17,
  'end': 25},
 {'entity': 'B-gpe',
  'score': 0.9788721,
  'index': 7,
  'word': 'american',
  'start': 33,
  'end': 41},
 {'entity': 'B-geo',
  'score': 0.9883619,
  'index': 13,
  'word': 'san',
  'start': 80,
  'end': 83},
 {'entity': 'I-geo',
  'score': 0.9842614,
  'index': 14,
  'word': 'francisco',
  'start': 84,
  'end': 93}]