<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/token_classification/NER_C_L.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
!pip install datasets
!pip install transformers
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.15.0-py3-none-any.whl (191 kB)
[K     |████████████████████████████████| 191 kB 8.3 MB/s 
Installing collected packages: accelerate
Successfully installed accelerate-0.15.0


In [38]:
import datasets
import transformers
import torch
import accelerate
from tqdm.auto import tqdm

## Dataset

preparing data

In [3]:
raw_datasets = datasets.load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Since we want to do NER we will look at NER tags

In [4]:
print(raw_datasets["test"][0]["tokens"])
print(raw_datasets["test"][0]["ner_tags"])

['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
[0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0]


to access correspondance between integers of classes

In [5]:
ner_feature = raw_datasets["train"].features["ner_tags"]

to access just the name of the classes

In [6]:
ner_feature.feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

Now to decode labels and printing them in an example

In [7]:
# choose deferent instance by changing i
i = 3
names = ner_feature.feature.names
words = raw_datasets["train"][i]["tokens"]
labels = raw_datasets["train"][i]["ner_tags"]
l_1 = ""
l_2 = ""
# iterate though words and corresponding labels
for word, label in zip(words, labels):
    # retrieve label name
    label_name = names[label]
    # calculating max len for better representation
    max_length = max(len(word), len(label_name))
    #  l_1 for tokens
    l_1 += word + " " * (max_length - len(word) + 1)
    # l_2 for labels
    l_2 += label_name + " " * (max_length - len(label_name) + 1)

print(l_1, l_2, sep= "\n")

The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep . 
O   B-ORG    I-ORG      O    O  O        O  O         O    B-MISC O      O  O         O  O    B-MISC  O    O     O          O         O       O   O   O       O   O  O           O  O     O 


## processing the data

In [8]:
checkpoint = "dslim/bert-base-NER"

In [9]:
# to load the tokenizer to tokenie the input with proper tokenizer
tokenizer = transformers.BertTokenizerFast.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

In [10]:
# to check if the tokenizer is fast or not
tokenizer.is_fast

True

In [11]:
inputs = tokenizer(
    raw_datasets["train"][0]["tokens"], # feature of dataset we want to tokenize
    is_split_into_words= True, # token feature is seperated to words already
)

an example of using tokenizer and its output

In [17]:
print(inputs.tokens(), inputs.word_ids(), raw_datasets["train"][0]["ner_tags"], sep= "\n")

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]
[3, 0, 7, 0, 0, 0, 7, 0, 0]


we need a function to:
* special tokens get a label of -100
* each token gets the same label as the token that started the word's inside

In [18]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        # if word_id is not none assign word_id to current word for word tracking
        if word_id != current_word:
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        # append -100 if word_id is none
        elif word_id is None:
            new_labels.append(-100)
        else:
            label = labels[word_id]
            #add +1 to current label to cover I- labels
            if label % 2 == 1:
                label +=1
            new_labels.append(label)
    return new_labels

In [19]:
def tokenize_and_align_labels(example):
    tokenized_input = tokenizer(example["tokens"],
                                truncation= True,
                                is_split_into_words= True)
    all_labels = example["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_input.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
        tokenized_input["labels"] = new_labels
    return tokenized_input

In [20]:
# now to apply the function using map method
tokenized_datasets = raw_datasets.map(tokenize_and_align_labels, 
                                      batched= True,
                                      remove_columns= raw_datasets["train"].column_names)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [21]:
tokenized_datasets["train"][0]

{'input_ids': [101,
  7270,
  22961,
  1528,
  1840,
  1106,
  21423,
  1418,
  2495,
  12913,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]}

## Preparing data for custom loop

In [23]:
tokenized_datasets.set_format("torch")

In [24]:
# preparing train data
train_dataloader = torch.utils.data.DataLoader(
    tokenized_datasets["train"],
    shuffle= True,
    collate_fn= transformers.DataCollatorForTokenClassification(tokenizer= tokenizer),#for dynamic padding and doing necessary processing on tokens
    batch_size= 16,
    )
# preparing validation data
eval_dataloader = torch.utils.data.DataLoader(
    tokenized_datasets["validation"],
    collate_fn= transformers.DataCollatorForTokenClassification(tokenizer= tokenizer),#for dynamic padding and doing necessary processing on tokens
    batch_size= 16,
    )

## Model for fine-tuning

In [27]:
id2label = {i: label for i, label in enumerate(names)}
label2id = {k: v for v, k in id2label.items()}

In [28]:
model = transformers.BertForTokenClassification.from_pretrained(checkpoint,
                                                                id2label= id2label,
                                                                label2id= label2id)

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

## Preparing model for training

In [29]:
optimizer = torch.optim.AdamW(model.parameters(), lr= 2e-5)

pushing everything into accelerator

In [32]:
accl = accelerate.Accelerator()

In [33]:
model, optimizer, train_dataloader, eval_dataloader = accl.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

defining learning rate scheduler

In [34]:
num_train_epochs = 4
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

In [35]:
num_training_steps

3512

In [36]:
lr_sch = transformers.get_scheduler(
    name= "linear",
    optimizer= optimizer,
    num_warmup_steps= 200,
    num_training_steps= num_training_steps - 200,
)

to postprocess model output and labels for evaluation

In [37]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()
    # filter outs the special tokens
    true_labels = [[names[l] for l in label if l != -100] for label in labels]
    true_predictions = [[names[p] for (p,l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    return true_labels, true_predictions

## Training Loop

In evaluation since two process may have padded the inputs and labels to different shapes, we need to use `accl.pad_across_processes`