<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/week13/4b_token_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical 4b - Token Classification

In this practical we will learn how to use the HuggingFace Transformers library to perform token classification.

Just like what we did in Practical 3a, we will use the DistiBERT transformer architecture, which also allows us to classify each and every word in a sentence.

####**NOTE: Be sure to set your runtime to a GPU instance!**

## Section 1 - Install Transformers

Run the following cell to install the HuggingFace Transformers library.

In [1]:
!pip install transformers



## Section 2 - Import, Define Classes and Helper Functions

Run the following cell to import all necessary libraries, define the necessary variables, classes and functions required for our processing.


In [2]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer


# Initialize the DistilBERT tokenizer.
#
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Define a function that loads up a space-, comma-, tab-separated file
# and extracts the input word and label for each word.
# 
# It is assumed that the file is in the CONLL format:
#
#      sentence1-word1, ..., label1-1
#      sentence1-word2, ..., label1-2
#      sentence1-word3, ..., label1-3
#      <empty line>
#      sentence2-word1, ..., label2-1
#      sentence2-word2, ..., label2-2
#      ...
#      sentence2-wordn, ..., label2-n
#      <empty line>
#      ...
#
# This function returns a 2D list of words and a 2D list of labels
# corresponding to each word.
#
def load_conll(filepath, delimiter=' ', word_column_index=0, label_column_index=3):
    all_texts = []
    all_tags = []

    texts = []
    tags = []

    # Opens the file.
    #
    with open(filepath, "r") as f:

        # Loops through each line 
        for line in f:

            # Split each line by its delimiter (default is a space)
            tokens = line.split(delimiter)

            # If the line is empty, treat it as the end of the
            # previous sentence, and construct a new sentence
            #
            if len(tokens) == 1:
                # Append the sentence
                # 
                all_texts.append(texts)
                all_tags.append(tags)

                # Create a new sentence
                #
                texts = []
                tags = []
            else:
                # Not yet end of the sentence, continue to add
                # words into the current sentence
                #
                thistext = tokens[word_column_index].replace('\n', '')
                thistag = tokens[label_column_index].replace('\n', '')

                texts.append(thistext)
                tags.append(thistag)

    # Insert the last sentence if it contains at least 1 word.
    #
    if len(texts) > 0:
        all_texts.append(texts)
        all_tags.append(tags)

    # Return the result to the caller
    #
    return all_texts, all_tags


# This function is taken from HuggingFace's documentation at:
# https://huggingface.co/transformers/custom_datasets.html
#
# This function simply converts the string classification tags for each
# word into their index using the token_labels_id_by_label dictionary.
#
# Also, it uses the offset_mapping to determine which words are [CLS],
# [SEP] and sub-words so that we can leave the tag as a -100 value 
# (ignored).
# 
def encode_tags(tags, encodings):
    labels = [[token_labels_id_by_label[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)

        # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels




## Section 3 - Defining Our Classification Labels

Run the following cell to declare the token classification labels that we will be using.
 

In [3]:

# Define a list of unique token labels that we will recognize
#
token_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# Create a reverse-mapping dictionary of the label -> index.
#
token_labels_id_by_label = {tag: id for id, tag in enumerate(token_labels)}



In [4]:
token_labels_id_by_label

{'B-LOC': 5,
 'B-MISC': 7,
 'B-ORG': 3,
 'B-PER': 1,
 'I-LOC': 6,
 'I-MISC': 8,
 'I-ORG': 4,
 'I-PER': 2,
 'O': 0}

## Section 4 - Load and Split Our Data

We are now prepared to process our data. 

Go ahead and upload the token_train.txt, token_test.txt file into Colab.

Then, fill up the codes below to load the data from the token_train.txt, token_test.txt file.
   ```
   train_texts, train_tags = load_conll("token_train.txt")
   val_texts, val_tags = load_conll("token_test.txt")
   ```



In [5]:
load_conll('try.txt')

([['-DOCSTART-'],
  ['SOCCER',
   '-',
   'JAPAN',
   'GET',
   'LUCKY',
   'WIN',
   ',',
   'CHINA',
   'IN',
   'SURPRISE',
   'DEFEAT',
   '.'],
  ['Nadim', 'Ladki'],
  ['AL-AIN', ',']],
 [['O'],
  ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O'],
  ['B-PER', 'I-PER'],
  ['B-LOC', 'O']])

In [6]:
# TODO:
# Loads the training and test text files.
#...#
train_texts, train_tags = load_conll("token_train.txt")
val_texts, val_tags = load_conll("token_test.txt")



print (train_texts[0:5])
print (train_tags[0:5])
print (len(train_texts))
print (len(val_texts))


[['-DOCSTART-'], ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn'], ['BRUSSELS', '1996-08-22'], ['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.']]
[['O'], ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'], ['B-PER', 'I-PER'], ['B-LOC', 'O'], ['O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
14987
3684


In [7]:
len(train_texts[4])

30

In [8]:
len(train_tags[4])

30

## Section 5 - Preparing Our Data for Training

Modify the following cell to:

1. Tokenize all the training and validation input texts into individual word indexes and attention masks.
   ```
   train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
   ```

2. Convert the individual word tags into their corresponding indexes. 
   ```
   train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
   ```

3. Remove the 'offset_mapping' since we do not require that for training.
   ```
   train_encodings.pop("offset_mapping") # we don't want to pass this to the model
val_encodings.pop("offset_mapping")
   ```

3. Construct the TokenClassificationDataset in preparation for training.
   ```
   train_dataset = TokenClassificationDataset(train_encodings, train_labels)
val_dataset = TokenClassificationDataset(val_encodings, val_labels)
   ```

In [9]:
# TODO:
# Call the tokenizer to assign word indexes to each word.
#
# NOTE: When loading up the data from the train.txt and test.txt (CONLL format),
# the words have already been split up. 
#...#
train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)


train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)

# TODO:
# Call the encode_tags function to convert the string-based tag per word
# into numeric indexes.
#...#




# TODO:
# Remove the offset_mapping list as we don't need it for training.
#...#




# TODO:
# Construct the data set to be used for training.
#...#




In [10]:
import tensorflow as tf

train_encodings.pop("offset_mapping") # we don't want to pass this to the model
val_encodings.pop("offset_mapping")

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

Run the following cell below to see the train_texts and individual samples in the dataset for the first few lines of text.

In [12]:
print(len(train_texts))
print(len(val_texts))

for i in range(10):
    print(train_texts[i])
    print(list(train_dataset.take(1).as_numpy_iterator()))
    print("---")



14987
3684
['-DOCSTART-']
[({'input_ids': array([  101,  1011,  9986, 14117,  2102,  1011,   102,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,

## Section 6 - Train our Token Classification Model

Run the following cell below to train the token classification model.

Now, this training per epoch will take up a while complete. If it takes too long, ensure that you updated your runtime to use a GPU instance. If it still takes too long, we'll leave it running for 5 minutes, and use a saved model that I've already trained with this same dataset.


In [None]:
from transformers import DistilBertForTokenClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

token_model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased', num_labels=len(token_labels))

trainer = Trainer(
    model=token_model,                   # the instantiated Token Classification 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

## Section 7 - Save our Token Classification Model

Once your training is complete, save the model and download it to your own computer before the session expires!

Alternatively, you can connect to and push your model to Google Drive once your training has completed.

In [None]:
torch.save(token_model, 'tokenclassification.model')

## Section 8 - Evaluate the Model

Run the following cells below to evaluate your model performance.

Obviously, you can only do this AFTER your training is completed. 

In [None]:
import numpy as np

from transformers import DistilBertTokenizerFast

# Initialize the DistilBERT tokenizer.
#
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Define a list of unique labels that we will recognized
#
token_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# Define the function to infer the individual tokens
#
def infer_tokens(text):
    encodings = tokenizer([text], is_split_into_words=True, padding=True, truncation=True, return_offsets_mapping=True, return_tensors="pt")

    label_mapping = [0] * len(encodings.offset_mapping[0])
    for i, offset in enumerate(encodings.offset_mapping[0]):
        if encodings.offset_mapping[0][i][0] == 0 and encodings.offset_mapping[0][i][1] != 0:
            label_mapping[i] = 1

    encodings.pop("offset_mapping")
    encodings = encodings.to("cuda")

    # Use the token classification model to predict the labels
    # for each word.
    #
    output = token_model.forward(**encodings)[0].detach().to("cpu")

    result = []

    for i in range(output.shape[1]):
        if label_mapping[i] == 1:
            result.append(np.argmax(output[0][i]).item())

    return result



In [None]:
from tqdm import tqdm

# This function takes in a list of sentences (texts) and passes them into the
# infer_tokens method to tokenize and predict each word's label.
# 
# It will then convert the list of labels into their numeric index, and
# return both actual label and predicted label to the caller.
#
def get_actual_pred_y(texts, labels):
    all_actual_y = []
    all_pred_y = []

    for i in tqdm(range(len(texts))):
        x = texts[i]

        actual_y = list(filter(lambda x: x != -100, labels[i]))
        pred_y = infer_tokens(x)

        if (len(actual_y) == len(pred_y)):
            all_actual_y += actual_y
            all_pred_y += pred_y
        else:
            print ("Error: %d, %d, %d, %s " % (i, len(actual_y), len(pred_y), x ))

    return all_actual_y, all_pred_y

# Get the actual and predicted labels for all words in all sentences
# for both the training and the test set.
# 
actual_y_train, pred_y_train = get_actual_pred_y(train_texts, train_labels)
actual_y_test, pred_y_test = get_actual_pred_y(val_texts, val_labels)



In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

def display_model_evaluation_results(y_train, pred_y_train, y_test, pred_y_test, labels):
    
    plt.figure(figsize=(20,6))  

    labels = np.array(labels)

    # Print the first Confusion Matrix for the training data
    #
    cm = confusion_matrix(y_train, pred_y_train)
    print (cm.shape)

    cm_df = pd.DataFrame(cm, labels, labels)
    plt.subplot(1, 2, 1)
    plt.title('Confusion Matrix (Train Data)')
    sns.heatmap(cm_df, annot=True)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')        
    
    # Print the second Confusion Matrix for the test data
    #    
    cm = confusion_matrix(y_test, pred_y_test)
    
    cm_df = pd.DataFrame(cm, labels, labels)          
    plt.subplot(1, 2, 2)
    plt.title('Confusion Matrix (Test Data)')
    sns.heatmap(cm_df, annot=True)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')        
    
    plt.show()

    # Finally display the classification reports
    #
    print ("Train Data")
    print ("--------------------------------------------------------")
    print(classification_report(y_train, pred_y_train, target_names=labels))
    print ("")
    print ("Test Data")
    print ("--------------------------------------------------------")
    print(classification_report(y_test, pred_y_test, target_names=labels))


display_model_evaluation_results(actual_y_train, pred_y_train, actual_y_test, pred_y_test, token_labels)