<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/token_classification/POS_Tagger.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# POS_Tagger
* This notebook contains necessary codes for implementing a pos_tagger based on transformers
* To do so 🤗 transformers library has been used
* To reach our goal I have fine-tuned BERT-base


## Install & Import necessary libraries

In [2]:
!pip install datasets
!pip install transformers
!pip install poseval
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement poseval (from versions: none)[0m
[31mERROR: No matching distribution found for poseval[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.8 MB/s 
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16179 sha256=651ef19dbc51b89fac681f477f770151241c3b710c76794138229bc1cb281fb0
  Stored in directory: /root/.cache/pip/wheels/ad/5c/ba/05fa3

In [3]:
# to download and use necessary dataset
import datasets
# transformers library contains required models and tokenizers for token classification task
import transformers
# to load metric and evaluate performance our model
import evaluate
import numpy as np

## Dataset

To fine-tune the model I've used [conll2003](https://huggingface.co/datasets/conll2003) library which contains necessary tag labels for training a pos tagger

In [4]:
raw_datasets = datasets.load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

As you can see above, our datasets contains train, validation and test splits and feature names and number of examples per each split can be seen too

Now to see a sample of data and it's related labels for part of speech tagging 

In [6]:
raw_datasets["train"][0]["tokens"], raw_datasets["train"][0]["pos_tags"]

(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 [22, 42, 16, 21, 35, 37, 16, 21, 7])

We will store pos tag names in pos_features variable for next uses

In [7]:
pos_features = raw_datasets["train"].features["pos_tags"]

In [8]:
pos_features

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)

In [9]:
label_names = pos_features.feature.names

Now see a sample

In [10]:
names = pos_features.feature.names
words = raw_datasets["train"][5]["tokens"]
labels = raw_datasets["train"][5]["pos_tags"]
l_1 = ""
l_2 = ""
for word, label in zip(words, labels):
    label_name = names[label]
    max_length = max(len(word), len(label_name))
    l_1 += word + " " * (max_length - len(word) + 1)
    if label_name in ['"', "''", '#',  '$',  '(',  ')', ',', '.', ':', '``',]:
        l_2 +=" " * (max_length + 1)
    else:
        l_2 += label_name + " " * (max_length - len(label_name) + 1)

print(l_1, l_2, sep= "\n")

" We  do  n't support any such recommendation because we  do  n't see any grounds for it  , " the Commission 's  chief spokesman Nikolaus van der Pas told a  news briefing . 
  PRP VBP RB  VB      DT  JJ   NN             IN      PRP VBP RB  VB  DT  NNS     IN  PRP     DT  NNP        POS JJ    NN        NNP      NNP FW  NNP VBD  DT NN   NN         


## Defining Tokenizer

In [11]:
# choosing a model that is compatible for token classification
checkpoint = "bert-base-cased"

In [12]:
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

To check if tokenizer is [fast](https://huggingface.co/docs/tokenizers/v0.13.2/en/index#tokenizers) or not 

In [14]:
tokenizer.is_fast

True

Checking tokenizer and its generated outputs

In [15]:
inputs = tokenizer(raw_datasets["train"][2]["tokens"],
                   is_split_into_words= True)

In [16]:
inputs

{'input_ids': [101, 26660, 13329, 12649, 15928, 1820, 118, 4775, 118, 1659, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [17]:
inputs.tokens()

['[CLS]', 'BR', '##US', '##SE', '##LS', '1996', '-', '08', '-', '22', '[SEP]']

In [18]:
inputs.word_ids()

[None, 0, 0, 0, 0, 1, 1, 1, 1, 1, None]

we are required to convert special tokens into -100 in order to be ignored by loss function

In [19]:
def special_token_care(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            new_labels.append(-100)
        else:
            label = labels[word_id]
    return new_labels

In [20]:
def tokenize_and_care(example):
    tokenized_input = tokenizer(example["tokens"],
                                truncation= True,
                                is_split_into_words= True)
    all_labels = example["pos_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_input.word_ids(i)
        new_labels.append(special_token_care(labels, word_ids))
        tokenized_input["labels"] = new_labels
    return tokenized_input

Now applying the function to dataset using dataset map methd

In [22]:
tokenized_datasets = raw_datasets.map(tokenize_and_care, 
                                      batched= True,
                                      remove_columns= raw_datasets["train"].column_names)

  0%|          | 0/15 [00:00<?, ?ba/s]



In [23]:
tokenized_datasets["train"][12]

{'input_ids': [101,
  2809,
  1699,
  1105,
  2855,
  5534,
  17355,
  9022,
  2879,
  112,
  188,
  5835,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 30, 22, 10, 22, 38, 22, 27, 21, 7, -100]}

I chose `DataCollatorForTokenClassification()` which is a data collator that will dynamically pad the inputs received, as well as the labels.

In [24]:
data_collator = transformers.DataCollatorForTokenClassification(tokenizer= tokenizer)

To test it

In [25]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [26]:
batch["input_ids"]

tensor([[  101,  7270, 22961,  1528,  1840,  1106, 21423,  1418,  2495, 12913,
           119,   102],
        [  101,  1943, 14428,   102,     0,     0,     0,     0,     0,     0,
             0,     0]])

## Metric

to evaluate model poseval is selected which is optimized for pos tagging taks

In [27]:
metric = evaluate.load("poseval")

Downloading builder script:   0%|          | 0.00/4.46k [00:00<?, ?B/s]

In [28]:
def compute_metric(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis= -1)
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [[label_names[p] for (p,l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    metrics = metric.compute(predictions= true_predictions,
                             references= true_labels)
    return {"precision": metrics["weighted avg"]["precision"],
            "recall": metrics["weighted avg"]["recall"],
            "f1":metrics["weighted avg"]["f1-score"],}

A sample of how metric works

In [29]:
labels = raw_datasets["train"][0]["pos_tags"]
labels = [label_names[i] for i in labels]
predictions = labels.copy()
predictions[2] = "VB"
result = metric.compute(predictions= [predictions],
               references= [labels])

In [30]:
result["weighted avg"]

{'precision': 0.9444444444444444,
 'recall': 0.8888888888888888,
 'f1-score': 0.8888888888888888,
 'support': 9}

## Defining the Model

In [33]:
# dictionaries of labels are required for this model
id2labels = {i: label for i, label in enumerate(label_names)}
label2ids = {v: k for k, v in id2labels.items()}

In [34]:
model = transformers.AutoModelForTokenClassification.from_pretrained(checkpoint,
                                                                     id2label=id2labels,
                                                                     label2id= label2ids)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

to check if number of classes are correct

In [35]:
model.config.num_labels

47

In [36]:
len(id2labels)

47

## Defining model training arguments

In [38]:
args = transformers.TrainingArguments("bert-finetuned-pos",
                                      evaluation_strategy= "epoch",
                                      save_strategy= "epoch",
                                      save_total_limit= 2,
                                      fp16= True,
                                      learning_rate= 2e-5,
                                      num_train_epochs= 4,
                                      weight_decay= .01)

In [39]:
trainer = transformers.Trainer(model= model,
                               args= args,
                               train_dataset= tokenized_datasets["train"],
                               eval_dataset= tokenized_datasets["validation"],
                               data_collator= data_collator,
                               compute_metrics= compute_metric,
                               tokenizer= tokenizer)

Using cuda_amp half precision backend


## Training and evaluation

In [40]:
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7024
  Number of trainable parameters = 107755823


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,1.0929,0.806059,0.702448,0.70044,0.695269
2,0.6643,0.592377,0.785479,0.783945,0.780495
3,0.5284,0.466001,0.838253,0.839862,0.837452
4,0.4331,0.41702,0.860247,0.861649,0.859749


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert-finetuned-pos/checkpoint-1756
Configuration saved in bert-finetuned-pos/checkpoint-1756/config.json
Model weights saved in bert-finetuned-pos/checkpoint-1756/pytorch_model.bin
tokenizer config file saved in bert-finetuned-pos/checkpoint-1756/tokenizer_config.json
Special tokens file saved in bert-finetuned-pos/checkpoint-1756/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert-finetuned-pos/checkpoint-3512
Configuration saved in bert-finetuned-pos/checkpoint-3512/config.json
Model weights 

TrainOutput(global_step=7024, training_loss=0.8000684536671041, metrics={'train_runtime': 743.6565, 'train_samples_per_second': 75.524, 'train_steps_per_second': 9.445, 'total_flos': 1232332665517476.0, 'train_loss': 0.8000684536671041, 'epoch': 4.0})

after training we can use 🤗 transformers `pipeline` method to easily use our model

In [44]:
pos_tagger = transformers.pipeline("token-classification", model= "/content/bert-finetuned-pos/checkpoint-7024")

loading configuration file /content/bert-finetuned-pos/checkpoint-7024/config.json
Model config BertConfig {
  "_name_or_path": "/content/bert-finetuned-pos/checkpoint-7024",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "\"",
    "1": "''",
    "2": "#",
    "3": "$",
    "4": "(",
    "5": ")",
    "6": ",",
    "7": ".",
    "8": ":",
    "9": "``",
    "10": "CC",
    "11": "CD",
    "12": "DT",
    "13": "EX",
    "14": "FW",
    "15": "IN",
    "16": "JJ",
    "17": "JJR",
    "18": "JJS",
    "19": "LS",
    "20": "MD",
    "21": "NN",
    "22": "NNP",
    "23": "NNPS",
    "24": "NNS",
    "25": "NN|SYM",
    "26": "PDT",
    "27": "POS",
    "28": "PRP",
    "29": "PRP$",
    "30": "RB",
    "31": "RBR",
    "32": "RBS",
    "33": "RP",
    "34": "SYM",
    "35":

In [45]:
pos_tagger("I'm going to sleep.")

[{'entity': 'PRP',
  'score': 0.99791414,
  'index': 1,
  'word': 'I',
  'start': 0,
  'end': 1},
 {'entity': 'VBP',
  'score': 0.98194695,
  'index': 2,
  'word': "'",
  'start': 1,
  'end': 2},
 {'entity': 'VBG',
  'score': 0.9878749,
  'index': 3,
  'word': 'm',
  'start': 2,
  'end': 3},
 {'entity': 'TO',
  'score': 0.9908902,
  'index': 4,
  'word': 'going',
  'start': 4,
  'end': 9},
 {'entity': 'VB',
  'score': 0.98916924,
  'index': 5,
  'word': 'to',
  'start': 10,
  'end': 12},
 {'entity': '.',
  'score': 0.99833655,
  'index': 6,
  'word': 'sleep',
  'start': 13,
  'end': 18},
 {'entity': '.',
  'score': 0.99425054,
  'index': 7,
  'word': '.',
  'start': 18,
  'end': 19}]