<h1 align="center"> From Mongo Annotations To BERT Data </h1>

<div> In this notebook I present a way to transform the annotations stored in the mongo using the annotation tool made in VUEto feed tensorflow BERT models (or others). For this purpose, only a few librariesare needed. These packages can be changed for others with similar functionalities if it's desired.
</div>
<div>
    
- **Spacy:** Used for the basic text tokenization. Other basic text tokenization as White Space Tokenization can be used here.
    
- **tensorflow_text:** Used to obtain the subtoken words.

- **tensorflow:** Fine-tunning BERT models

- **tensorflow_hub:** Pre-trainned BERT models
</div>


In [None]:
import pandas as pd
import numpy as np

from extraction.db.db_handler import MongoHandler
import sys
sys.path.append("../../tmp/")

import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

import spacy


In [None]:
mh = MongoHandler()
annot = mh.get_all_annotations()

If the user has filled the mongo DB with annotations you can use the annot variable defined in the last chung. Other else, the annotations will appear as the example of annot shown in the following cell.

In [None]:
annot = [
    {
        'docid': 1,
        'text': 'Declaration of final dividend\n\nThe Board has declared a final ordinary dividend of 506 cents per share for the year\n\nended 30 September 2021. This, together with the interim ordinary dividend of 320\n\ncents per share, brings the total dividend for the year to 826 cents. In view of the\n\ncompany’s ungeared balance sheet and strong cash generating ability, it has been\n\ndecided to determine this year’s total dividend on the company’s adjusted headline\n\nearnings. Consequently, HEPS was adjusted to exclude the impact of the product\n\nrecall and the civil unrest, which took place in July this year. The Company’s\n\ndividend policy of 1.75x cover has therefore been applied to HEPS after the\n\naforementioned adjustments.\n\n\nIn accordance with paragraphs 11.17 (a) (i) to (x) and 11.17 (c) of the JSE Listings\n\nRequirements, the following additional information is disclosed:\n\n\n   •   The ordinary final dividend has been declared out of income reserves\n\n   •   The local Dividends Tax rate is 20% (twenty percent) effective 22 February\n\n       2017\n\n   •   The gross final dividend amount of 506.00000 cents per ordinary share will be\n\n   •   paid to shareholders who are exempt from the Dividends Tax\n\n   •   The net final dividend amount of 404.80000 cents per ordinary share will be\n\n       paid to\n\n   •   shareholders who are liable for the Dividends Tax\n\n   •   Tiger Brands has 189 818 926 ordinary shares in issue (which includes 10\n\n       326 758 treasury shares)\n\n\n•   Tiger Brands Limited’s income tax reference number is 9325/110/71/7.',
        'annotations': [],
        'userannotations': [
            {
                'tagid': 1,
                'label': 'per_share_amount',
                'label_id': 11,
                'start': 83,
                'end': 86,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 2,
                'label': 'currency',
                'label_id': 0,
                'start': 87,
                'end': 92,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 3,
                'label': 'announcement_date',
                'label_id': 7,
                'start': 123,
                'end': 140,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 4,
                'label': 'per_share_amount',
                'label_id': 11,
                'start': 195,
                'end': 198,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 5,
                'label': 'currency',
                'label_id': 0,
                'start': 200,
                'end': 205,
                'confidence': 1.0,
                'annotatedby': 'user'
            }
        ]
    },
    {
        'docid': 1,
        'text': 'Declaration of final dividend\n\nThe Board has declared a final ordinary dividend of 506 cents per share for the year\n\nended 30 September 2021. This, together with the interim ordinary dividend of 320\n\ncents per share, brings the total dividend for the year to 826 cents. In view of the\n\ncompany’s ungeared balance sheet and strong cash generating ability, it has been\n\ndecided to determine this year’s total dividend on the company’s adjusted headline\n\nearnings. Consequently, HEPS was adjusted to exclude the impact of the product\n\nrecall and the civil unrest, which took place in July this year. The Company’s\n\ndividend policy of 1.75x cover has therefore been applied to HEPS after the\n\naforementioned adjustments.\n\n\nIn accordance with paragraphs 11.17 (a) (i) to (x) and 11.17 (c) of the JSE Listings\n\nRequirements, the following additional information is disclosed:\n\n\n   •   The ordinary final dividend has been declared out of income reserves\n\n   •   The local Dividends Tax rate is 20% (twenty percent) effective 22 February\n\n       2017\n\n   •   The gross final dividend amount of 506.00000 cents per ordinary share will be\n\n   •   paid to shareholders who are exempt from the Dividends Tax\n\n   •   The net final dividend amount of 404.80000 cents per ordinary share will be\n\n       paid to\n\n   •   shareholders who are liable for the Dividends Tax\n\n   •   Tiger Brands has 189 818 926 ordinary shares in issue (which includes 10\n\n       326 758 treasury shares)\n\n\n•   Tiger Brands Limited’s income tax reference number is 9325/110/71/7.',
        'annotations': [],
        'userannotations': [
            {
                'tagid': 1,
                'label': 'per_share_amount',
                'label_id': 11,
                'start': 83,
                'end': 86,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 2,
                'label': 'currency',
                'label_id': 0,
                'start': 87,
                'end': 92,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 3,
                'label': 'announcement_date',
                'label_id': 7,
                'start': 123,
                'end': 140,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 4,
                'label': 'per_share_amount',
                'label_id': 11,
                'start': 195,
                'end': 198,
                'confidence': 1.0,
                'annotatedby': 'user'
            },
            {
                'tagid': 5,
                'label': 'currency',
                'label_id': 0,
                'start': 200,
                'end': 205,
                'confidence': 1.0,
                'annotatedby': 'user'
            }
        ]
    }    
]

Once the annotations are extracted from the mongo it's necessary to tokenize an align the labels with the tokens for future trainings. To tokenize the text I use spacy but other packages or methods can be used.

The following chunk of code uses the offset of the labels to aling the sequence of labels to the sequence of tokens.

In [None]:
nlp = spacy.load("en_core_web_sm")
docs = [nlp(annt["text"]) for annt in annot]
doc_tok = []

for num, doc in enumerate(docs):
    tags = [
        (tag["start"],tag["end"], tag["label"]) 
        for tag in annot[num]["userannotations"]
    ]
    tok = []
    
    for sent in doc.sents:    
        tok_sent = []
        last_ent = "O"
        for token in sent:
            token_end = token.idx + len(token.text)
            for tag in tags:
                if token.idx>=tag[0] and token_end<=tag[1]:
                    label = tag[2]
                    if last_ent==label:
                        label = f"I-{label}"
                    elif last_ent!=label:
                        label = f"B-{label}"
                    break
                else:
                    label = "O"
            last_ent = label[2:]
            t = {
                    "token": token,
                    "start": token.idx,
                    "end": token_end,
                    "label": label,
                }
            tok_sent.append(t)
        tok.append(tok_sent)
    doc_tok.append(tok)

# if using the annot example:
assert len(doc_tok)==2

For the example I'm going to use the model pre-trained offered by Tensorflow Hub *https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/2*. This selection has not been based in some kind of results or any other performance metric. It's just used to show the concepts.

In [None]:
bert_layer = hub.KerasLayer(
    'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/2',
    trainable=False
)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer =  text.BertTokenizer(f"../..{vocab_file.decode()}")
#tokenizer = text.BertTokenizer("/workspace/data/tf_vocab/vocab.txt")

In [None]:
def tokenize_with_labels(tokens, text_labels):
    tokenized_sentence_id = []
    labels = []

    for word, label in zip(tokens, text_labels):
        if isinstance(word, str):
            tokenized_word = tokenizer.tokenize(word)
        else:
            if word.text == "\n\n":
                continue
            tokenized_word = tokenizer.tokenize(word.text)
        try:
            tokenized_word = tokenized_word.to_list()[0][0]
        except IndexError:
            continue
        n_subwords = len(tokenized_word)
        
        tokenized_sentence_id.extend(tokenized_word)
        labels.extend([label] * n_subwords)

    return tokenized_sentence_id, labels

In [None]:
sents = [[token["token"] for token in sent] for doc in doc_tok for sent in doc]
labels = [[token["label"] for token in sent] for doc in doc_tok for sent in doc]

tokenized_texts_and_labels = [
    tokenize_with_labels(sent, labs)
    for sent, labs in zip(sents, labels)
]

In [None]:
tokenized_texts = [token_label_pair[0] for token_label_pair in tokenized_texts_and_labels]
labels = [token_label_pair[1] for token_label_pair in tokenized_texts_and_labels]

In [None]:
tags = [
    'B-currency','B-symbol','B-company_name',
    'B-security_description','B-type','B-frequency',
    'B-announcement_date','B-record_date','B-ex_date',
    'B-payment_date','B-per_share_amount',
    'I-currency','I-symbol','I-company_name',
    'I-security_description','I-type','I-frequency',
    'I-announcement_date','I-record_date','I-ex_date',
    'I-payment_date','I-per_share_amount','O',
]

In [None]:
maxlen = 68
tags.append("PAD")
tag2idx = {t: i for i, t in enumerate(tags)}

tags = pad_sequences(
    [[tag2idx.get(l) for l in lab] for lab in labels],
    maxlen=maxlen, value=tag2idx["PAD"], padding="post",
    dtype="long", truncating="post"
)

n_tags = len(tag2idx)
pad_tags = [to_categorical(i, num_classes=n_tags) for i in tags]

In [None]:
input_ids = tf.convert_to_tensor(
    pad_sequences(
        [txt for txt in tokenized_texts],
        maxlen=maxlen, dtype="long", value=0.0,
        truncating="post", padding="post"
    )
)
attention_masks = tf.convert_to_tensor([[float(i != 0.0) for i in ii] for ii in input_ids.numpy()])
input_type_ids = tf.convert_to_tensor([[0 for _ in ii] for ii in input_ids.numpy()])

Once the documents are segmented by sentence, tokenized and the padding is added is necessary to create the input structure for BERT models.

In [None]:
dataset = {
    "input_word_ids":input_ids,
    "input_mask":attention_masks,
    "input_type_ids":input_type_ids,
}

For the model, all information included in the PAD token is irrelevant. To avoid the computation of the loss in those PAD tokens the following function will transform the error in those tokens to 0. By doing this mutation on the loss function the model should disregard the error in those elements and turn its attention to the actual text of the documents.

In [None]:
loss_object = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    print(real)
    print(pred)
    reales = tf.math.reduce_sum(
       tf.cast(tf.equal(real, to_categorical(n_tags-1, num_classes=n_tags)), tf.float32),
       axis=-1, keepdims=False, name=None
    )
    mask = tf.equal(reales,n_tags)
    mask = tf.math.logical_not(mask)
    loss_ = loss_object(real,pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_*= mask
    return tf.reduce_mean(loss_)

Here, the model is just for demostration purposes. To show how the inputs are introduced an the output generated. 

In [None]:
def create_model():
    
    input_word_ids = tf.keras.layers.Input(
        shape=(maxlen,),
        dtype=tf.int32,
        name="input_word_ids"
    )
    input_mask = tf.keras.layers.Input(
        shape=(maxlen,),
        dtype=tf.int32,
        name="input_mask"
    )
    input_type_ids = tf.keras.layers.Input(
        shape=(maxlen,),
        dtype=tf.int32,
        name="input_type_ids"
    )
    pooled_output, sequence_output = bert_layer(
        [input_word_ids, input_mask, input_type_ids]
    )
    
    output = tf.keras.layers.TimeDistributed(
        tf.keras.layers.Dense(64, activation="tanh")
    )(sequence_output)
   
    output = tf.keras.layers.TimeDistributed(
        tf.keras.layers.Dense(len(tag2idx.keys()), activation="softmax")
    )(output)

    model = tf.keras.Model(
      inputs={
        'input_word_ids': input_word_ids,
        'input_mask': input_mask,
        'input_type_ids': input_type_ids}, 
      outputs=output)
    return model

In [None]:
classifier_model = create_model()

In [None]:
classifier_model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=loss_function,
    metrics=[tf.keras.metrics.Precision()]
)

In [None]:
history = classifier_model.fit(
    dataset,
    np.array(pad_tags),
    epochs=200,
    verbose=1)

In [None]:
bert_raw_result = classifier_model(dataset)
print(bert_raw_result)

In [None]:
bert_result=[np.argmax(i) for i in bert_raw_result[0]]
bert_result

In [None]:
#tokenizer.detokenize(tokenizer.tokenize('Declaration'))