<a href="https://colab.research.google.com/github/Neilus03/NLP-2023/blob/main/NER_and_Sequence_Labeling_Simple_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Sequence Labeling - NER

The extraction of relevant information from historical handwritten document collections is one of the key steps in order to make these manuscripts available for access and searches. In this context, instead of a pure transcription, the objective is to move towards document understanding. Concretely, the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.


A typical application scenario of named entity recognition is demographic documents, since they contain people's names, birthplaces, occupations, etc. In this scenario, the extraction of the key contents and its storage in databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.

<p style = 'text-align: center'>
<img src = "http://dag.cvc.uab.es/wp-content/uploads/2016/07/esposalla_detall.jpg">
</p>

For further doubts and questions, refer to oriol.ramos@uab.cat and alicia.fornes@uab.cat.

Usage of Google Colab is not mandatory, but highly recommended as most of the behaviors are expected for a Linux VM with IPython bindings.



## First, we will install the unmet dependencies.

This will download some packages and the required data, it may take a while.

In [None]:
#@title 
from IPython.display import clear_output

!git clone https://github.com/EauDeData/nlp-resources
!cp -r nlp-resources/ resources/
!rm -rf nlp-resources/


!pip install nltk 
!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite
clear_output()

from typing import * 

import nltk
import numpy as np
import copy
import random
from collections import Counter

import sklearn_crfsuite as crfs
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

from resources.data.dataloaders import EsposallesTextDataset

## Data Curation
Loading the dataset

In [None]:
random.seed(42)
train_loader = EsposallesTextDataset('resources/data/esposalles/') 
test_loader = copy.deepcopy(train_loader)
test_loader.test()

Example of data from each loader:
> Format string: ```word```:```label```


In [None]:
print([f"{x}:{y}" for x,y in zip(*train_loader[0])])
print([f"{x}:{y}" for x,y in zip(*test_loader[0])])

['Dilluns:other', 'a:other', '5:other', 'rebere:other', 'de:other', 'Hyacinto:name', 'Boneu:surname', 'hortola:occupation', 'de:other', 'Bara:location', 'fill:other', 'de:other', 'Juan:name', 'Boneu:surname', 'parayre:occupation', 'defunct:other', 'y:other', 'de:other', 'Maria:name', 'ab:other', 'Anna:name', 'donsella:state', 'filla:other', 'de:other', 't:name', 'Cases:surname', 'pages:occupation', 'de:other', 'Bara:location', 'defunct:other', 'y:other', 'de:other', 'Peyrona:name']
['Divendres:other', 'a:other', '18:other', 'rebere:other', 'de:other', 'Juan:name', 'Torres:surname', 'pages:occupation', 'habitant:other', 'en:other', 'Sabadell:location', 'fill:other', 'de:other', 'Bernat:name', 'Torres:surname', 'pages:occupation', 'de:other', 'Moya:location', 'bisbat:location', 'de:location', 'Vich:location', 'y:other', 'de:other', 'Antiga:name', 'defucts:other', 'ab:other', 'Margarida:name', 'donsella:state', 'filla:other', 'de:other', 'Juan:name', 'Argemir:surname', 'pages:occupation',

If the dataset is correctly downloaded you will see two different samples above, and both tests passed below.

In [None]:
#@title

# Dataset ckeck

for idx in range(len(train_loader)):

  x, y = train_loader[idx]
  if len(x) != len(y): 
    print('train_set test not passed')
    break

else: print('train_set test passed')

for idx in range(len(test_loader)):

  x, y = test_loader[idx]
  if len(x) != len(y):
    print('test_set test not passed')
    break

else: print('test_set test passed')

train_set test passed
test_set test passed


Since most of the computation won't be done with strings, the following function will create a Look Up Table (LUT) that transforms string tokens into ```int``` tokens. 

In [None]:
def create_tokens_lut(train_dataset) -> Dict:
    '''
    Input:
        train_dataset: Training dataset. 

        Don't tokenize test_set as later on,
        we will be considering out-of-vocabulary words as <unk> tokens.

        NOTE: Tokens MUST be lowered (.lower()) before considering them. 

    Ouput:
        LUT[Dict]: {
        word1: 0,
        word2: 1,
            ...
        wordn: n - 1
        }

    '''
    token_to_id = {}
    current_id = 0
    
    for i in range(len(train_dataset)):
        for word, label in zip(*train_dataset[i]):
            token = word.lower()  # Lowercase the token

            if token not in token_to_id:
                token_to_id[token] = current_id # Each new token is saved with the corresponding ID
                current_id += 1

    return token_to_id

LUT = create_tokens_lut(train_loader)
LUT

{'dilluns': 0,
 'a': 1,
 '5': 2,
 'rebere': 3,
 'de': 4,
 'hyacinto': 5,
 'boneu': 6,
 'hortola': 7,
 'bara': 8,
 'fill': 9,
 'juan': 10,
 'parayre': 11,
 'defunct': 12,
 'y': 13,
 'maria': 14,
 'ab': 15,
 'anna': 16,
 'donsella': 17,
 'filla': 18,
 't': 19,
 'cases': 20,
 'pages': 21,
 'peyrona': 22,
 'dit': 23,
 'dia': 24,
 'bernat': 25,
 'call': 26,
 'st': 27,
 'esteva': 28,
 'palau': 29,
 'tordera': 30,
 'guille': 31,
 'catherina': 32,
 'guillem': 33,
 'boix': 34,
 'la': 35,
 'mora': 36,
 'bisbat': 37,
 'vich': 38,
 'margarida': 39,
 'jua': 40,
 'perega': 41,
 'comissari': 42,
 'real': 43,
 'habitant': 44,
 'en': 45,
 'viudo': 46,
 'gratia': 47,
 'carbonell': 48,
 'negociant': 49,
 'defuncts': 50,
 'dijous': 51,
 '10': 52,
 'montserrat': 53,
 'gibert': 54,
 'treballador': 55,
 'traginer': 56,
 'mar': 57,
 'janota': 58,
 'magdalena': 59,
 'sabater': 60,
 'dissapte': 61,
 '26': 62,
 'jaume': 63,
 '#': 64,
 'oller': 65,
 'andreu': 66,
 'pa': 67,
 'lomar': 68,
 'eularia': 69,
 'pere': 

In [None]:
def check_oov_words(LUT = LUT, test_set = test_loader):
    # Out of vocabulary list
    oov_words = []

    for i in range(len(test_set)):
        for word, label in zip(*test_set[i]):
            token = word.lower()  # Lowercase the token

            if token not in LUT:
                oov_words.append(token) # Add the tokens of the test dataloader to the list if they don't belong to the LUT

    return oov_words
            
print(list(check_oov_words()))


['argemir', 'nyella', 'rianna', 'bonastra', 'angli', 'victo', 'theodora', 'payas', 'monllor', 'rotxe', 'moreno', 'islla', 'brasil', 'gusman', 'more', 'islla', 'sto', 'thome', 'habitants', 'melcior', 'bachs', 'pachs', 'plans', 'begas', 'rius', 'galeras', 'box', 'deseny', 'francesa', 'ortiz', 'faneca', 'faneca', 'sobrevila', 'sobrevila', 'cabrer', 'carantela', 'mso', 'tibau', 'monblanch', 'tibau', 'broquets', 'noguera', 'gassull', 'llondra', 'quart', 'pallissa', 'vivints', 'cabus', 'felis', 'buyra', 'scrivent', 'castigaleu', 'comptat', 'ribagossa', 'fontanilles', 'conteso', 'tatare', 'idrach', 'peramon', 'peramon', 'terre', 'arisart', 'arisart', 'faja', 'pinya', 'faliu', 'campanya', 'faliu', 'majol', 'majol', 'guardi#', 'debarca', 'constansa', 'llorenci', 'caxaler', 'llorenci', 'aguller', 'poses', 'miro', 'darder', 'darder', 'garces', 'imaginayre', 'juan#', 'villaro', 'vilademaser', 'muntells', 'muntells', 'sengermes', 'cebriana', 'tamuyell', 'cabaner', 'manader', 'manader', 'campprecios

Due to batch computation of some modules, sequences must have constant length. As a common practice, we will create three new tokens ```<bos>``` and ```<eos>``` for the start and the end of a given sequence and ```<unk>``` for unkown tokens in the application (test) layer or 0 padding during the training. Manually add those tokens to the ```LUT```. 
 

Under those constraints, fill the corresponding functions that will post-process each batch. Feel free to code more post-processing functions if you need it.





In [None]:
LUT['<unk>'] = len(LUT) + 1
LUT['<bos>'] = len(LUT) + 1
LUT['<eos>'] = len(LUT) + 1

The following lines are commented because they represent an alternative approach to preprocessing the input data

In [None]:
#Adds special tokens <bos> (beginning of sequence) and <eos> (end of sequence) to the beginning and end of a list of tokens, respectively.
"""
def add_bos_eos(tokens: List[str]) -> List[str]:
    return ['<bos>'] + tokens + ['<eos>']
"""

"\ndef add_bos_eos(tokens: List[str]) -> List[str]:\n    return ['<bos>'] + tokens + ['<eos>']\n"

In [None]:
#Pads the input list of tokens to a specified max_length using a specified padding_token. 
#If the input list is shorter than max_length, it appends the padding tokens to the list;
#if the input list is longer, it truncates the list to max_length.
"""
def pad_sequence(tokens: List[str], max_length: int, padding_token: str = '<unk>') -> List[str]:
    if len(tokens) < max_length:
        return tokens + [padding_token] * (max_length - len(tokens))
    else:
        return tokens[:max_length]
"""


"\ndef pad_sequence(tokens: List[str], max_length: int, padding_token: str = '<unk>') -> List[str]:\n    if len(tokens) < max_length:\n        return tokens + [padding_token] * (max_length - len(tokens))\n    else:\n        return tokens[:max_length]\n"

In [None]:
#Processes a batch of data, applying the add_bos_eos and pad_sequence functions to each list of words and labels in the batch.
"""
def process_batch(batch: List[Tuple[List[str], List[str]]], max_length: int) -> List[Tuple[List[str], List[str]]]:
    processed_batch = []

    for words, labels in batch:
        words = add_bos_eos(words)
        words = pad_sequence(words, max_length)

        labels = add_bos_eos(labels)
        labels = pad_sequence(labels, max_length, padding_token='other')

        processed_batch.append((words, labels))

    return processed_batch
"""

"\ndef process_batch(batch: List[Tuple[List[str], List[str]]], max_length: int) -> List[Tuple[List[str], List[str]]]:\n    processed_batch = []\n\n    for words, labels in batch:\n        words = add_bos_eos(words)\n        words = pad_sequence(words, max_length)\n\n        labels = add_bos_eos(labels)\n        labels = pad_sequence(labels, max_length, padding_token='other')\n\n        processed_batch.append((words, labels))\n\n    return processed_batch\n"

In [None]:
MAX_SEQUENCE_LENGTH = 50
def complete_seq(X) -> List[List]:
  '''

    Input: 
      X: A batch of N sequences [
        [word1, ..., wordn],
        [word1, ..., wordm]
      ]

    Output:
      A batch of N sequences with MAX_SEQUENCE_LENGTH tokens.
        - The starting token will always be <sos>
        - The last 'real' token <eos>
        - Tokens from <eos> until MAX_SEQUENCE_LENGTH will be <unk> as 0 padding.

  '''
  complete_seq = []
  for seq in X:
      seq = ['<bos>'] + seq + ['<eos>']
      seq += ['<unk>'] * (MAX_SEQUENCE_LENGTH - len(seq))
      complete_seq.append(seq[:MAX_SEQUENCE_LENGTH])
  return complete_seq

def post_process(X, functions = [complete_seq,]):
  for f in functions: X = f(X)
  return X

## NER - Baseline Approach

The first approach we will try is based on computing the probabilities for each word in our training corpus. This means computing the most likely category for each word in the dictionary.

Compute the test categories predictions and measure the performance for this simple model.

In [None]:
def compute_emissions_dict(dataloader) -> Dict:
  '''

    Given the train loader ```dataloader```
     this function will compute the max likelihood dictionary for each word.

  Input:
    dataloader: train loader with EsposallesTextDataset
  
  Outputs:
    Dict: {
      pagès: {name: X occupation: X}, # REMEMBER TO LOWER YOUR TOKENS!
            ...
      LUT - wordn: {category: x%, ...}
    }

  '''
  emissions_dict = {}
  
  token_counts = Counter()
  label_counts = Counter()
  token_label_counts = Counter()
  

  for i in range(len(dataloader)):
      for word,label in zip(*dataloader[i]):
          token = word.lower()
          token_counts[token] += 1
          label_counts[label] += 1
          token_label_counts[(token,label)] += 1
  
  for token, label_count in token_label_counts.items():
    token, label = token
    if token not in emissions_dict:
        emissions_dict[token] = {}
    emissions_dict[token][label] = label_count / token_counts[token]

  return emissions_dict


In [None]:
priors = compute_emissions_dict(train_loader)

At this point, as an example, your emissions dictionary should yield the following emission:

$P(location |$ ```Prats``` $) = 18\%$

$P(surname |$ ```Prats``` $) = 72\%$

$P(other |$ ```Prats``` $) = 9\%$

In [None]:
priors['prats']

{'location': 0.18181818181818182,
 'surname': 0.7272727272727273,
 'other': 0.09090909090909091}

This method has its limitations in terms of lack of context and, therefore, low expresivity. 

The following function will compute the confusion matrix for the predictions in the ```test_set``` in order to find the most problematic words. 

* What do they all have in common? 
* What kind of words are the least performers?
* What's your solution for out-of-vocabulary words? Can you provide a prediction for those?

In [None]:
def predict_test_set(emissions, test_set):
  
  '''

  s: casament eduard pages

  prediccio: [None, nom, ofici]

  Important: 
    Remember to check if you can provide a label for each word (OOVs?).
    What's your solution for those you cannot classify? Justify.

      tip: be as creative as you want.

  '''
  predictions = []
  for i in range(len(test_set)):
        seq_predictions = []
        for word, label in zip(*test_set[i]):
            token = word.lower()
            if token in emissions:
                pred_label = max(emissions[token], key=emissions[token].get)
            else:
                # Use the most frequent label for OOV words (simple but effective)
                pred_label = 'other'
            seq_predictions.append(pred_label)

        predictions.append(seq_predictions)

  return predictions


predictions = predict_test_set(priors, test_loader)



In [None]:
def find_common_errors(x_test: List[List], y_pred: List[List], y_true: List[List]) -> Dict:
    '''
        Input: 
        x_test: A list with each sample in the corpus with the words for which we
    ran each prediction
            [
            ['lorem', 'ipsum', 'dolor', 'sit', 'amet'],
            ['Hello', 'world', '!!!'],
            ]
        
        y_pred: A list with the predicted labels for each word in x_test corpus.
            [
            ['1', '0', '0', '1', '2'],
            ['2', '1', '0'],
            ]
        
        y_true: GT for the x_test sample
            [
            ['0', '0', '0', '1', '2'],
            ['0', '1', '0'],
            ]
        {
        pages: [{'pred': prediction, gt: label}, {'pred': prediction, 'gt': label}, ...]
        }
    '''
    x_test_words = [word for List in x_test for word in List]
    y_pred_labels = [label for List in y_pred for label in List]
    y_true_labels = [label for List in y_true for label in List]

    errors = {word: {"pred": pred, 'gt': gt} for word, pred, gt in zip(x_test_words, y_pred_labels, y_true_labels) if pred != gt}

    return errors

Following code cell is for testing

In [None]:
x_test = [['lorem', 'ipsum', 'dolor', 'sit', 'amet'], ['Hello', 'world', '!!!']]
y_pred = [['1', '0', '0', '1', '2'],['2', '1', '0']]
y_true = [['0', '0', '0', '1', '2'],['0', '1', '0']]

find_common_errors(x_test, y_pred, y_true)

{'lorem': {'pred': '1', 'gt': '0'}, 'Hello': {'pred': '1', 'gt': '2'}}

In [None]:
def compute_token_precision(x_test: List[List], y_pred: List[List], y_true: List[List]):
    errors = find_common_errors(x_test, y_pred, y_true)
    x_test_words = [word for List in x_test for word in List]
    error_rate = len(errors)/len(x_test_words)
    return 1 - error_rate

Following code cell is for testing

In [None]:
x_test = [['lorem', 'ipsum', 'dolor', 'sit', 'amet'], ['Hello', 'world', '!!!']]
y_pred = [['1', '0', '0', '1', '2'],['2', '1', '0']]
y_true = [['0', '0', '0', '1', '2'],['0', '1', '0']]

compute_token_precision(x_test, y_pred, y_true)

0.75

In [None]:
find_common_errors([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])['Esteva']
compute_token_precision([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])

0.9308014161570647

Analyze the results for the common errors.

## Conclusions

Here, write a brief conclusion for this notebook reffering to the main differences, advantages and disadvantages for each approach.




> Your conclusions here

After analyzing the given code, we can draw the following conclusions:

The task at hand is Named Entity Recognition (NER) and sequence labeling in historical handwritten documents. The goal is to extract information from these documents and categorize the entities into semantic categories such as family names, places, occupations, etc.

The code uses a simple baseline approach to predict the named entity categories, which involves computing the most likely category for each word based on the training data.

The baseline approach does not take context into consideration and is limited in its expressiveness. It also faces challenges in handling out-of-vocabulary (OOV) words. The current solution for OOV words is to assign them the most frequent label, which may not be the best approach in all cases.

The confusion matrix and common errors are computed to analyze the model's performance. The most problematic words are identified and analyzed for any common patterns or characteristics.

Based on these conclusions, the baseline approach can be improved upon by incorporating context-aware methods or other techniques for handling OOV words. These improvements could potentially lead to better performance and more accurate predictions for the NER task in historical handwritten documents.