<a href="https://colab.research.google.com/github/Neilus03/NLP-2023/blob/main/NER_and_Sequence_Labeling_Neil.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Sequence Labeling - NER

The extraction of relevant information from historical handwritten document collections is one of the key steps in order to make these manuscripts available for access and searches. In this context, instead of a pure transcription, the objective is to move towards document understanding. Concretely, the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.


A typical application scenario of named entity recognition is demographic documents, since they contain people's names, birthplaces, occupations, etc. In this scenario, the extraction of the key contents and its storage in databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.

<p style = 'text-align: center'>
<img src = "http://dag.cvc.uab.es/wp-content/uploads/2016/07/esposalla_detall.jpg">
</p>

For further doubts and questions, refer to oriol.ramos@uab.cat and alicia.fornes@uab.cat.

Usage of Google Colab is not mandatory, but highly recommended as most of the behaviors are expected for a Linux VM with IPython bindings.

## First, we will install the unmet dependencies.

This will download some packages and the required data, it may take a while.

In [None]:
#@title 
from IPython.display import clear_output

!git clone https://github.com/EauDeData/nlp-resources
!cp -r nlp-resources/ resources/
!rm -rf nlp-resources/


!pip install nltk 
!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite
clear_output()

from typing import * 

import nltk
import numpy as np
import copy
import random
from collections import Counter

import sklearn_crfsuite as crfs
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

from resources.data.dataloaders import EsposallesTextDataset

## Data Curation
Loading the dataset

In [None]:
random.seed(42)
train_loader = EsposallesTextDataset('resources/data/esposalles/') 
test_loader = copy.deepcopy(train_loader)
test_loader.test()

Example of data from each loader:
> Format string: ```word```:```label```


In [None]:
print([f"{x}:{y}" for x,y in zip(*train_loader[0])])
print([f"{x}:{y}" for x,y in zip(*test_loader[0])])

['Dilluns:other', 'a:other', '5:other', 'rebere:other', 'de:other', 'Hyacinto:name', 'Boneu:surname', 'hortola:occupation', 'de:other', 'Bara:location', 'fill:other', 'de:other', 'Juan:name', 'Boneu:surname', 'parayre:occupation', 'defunct:other', 'y:other', 'de:other', 'Maria:name', 'ab:other', 'Anna:name', 'donsella:state', 'filla:other', 'de:other', 't:name', 'Cases:surname', 'pages:occupation', 'de:other', 'Bara:location', 'defunct:other', 'y:other', 'de:other', 'Peyrona:name']
['Divendres:other', 'a:other', '18:other', 'rebere:other', 'de:other', 'Juan:name', 'Torres:surname', 'pages:occupation', 'habitant:other', 'en:other', 'Sabadell:location', 'fill:other', 'de:other', 'Bernat:name', 'Torres:surname', 'pages:occupation', 'de:other', 'Moya:location', 'bisbat:location', 'de:location', 'Vich:location', 'y:other', 'de:other', 'Antiga:name', 'defucts:other', 'ab:other', 'Margarida:name', 'donsella:state', 'filla:other', 'de:other', 'Juan:name', 'Argemir:surname', 'pages:occupation',

If the dataset is correctly downloaded you will see two different samples above, and both tests passed below.

In [None]:
#@title

# Dataset ckeck

for idx in range(len(train_loader)):

  x, y = train_loader[idx]
  if len(x) != len(y): 
    print('train_set test not passed')
    break

else: print('train_set test passed')

for idx in range(len(test_loader)):

  x, y = test_loader[idx]
  if len(x) != len(y):
    print('test_set test not passed')
    break

else: print('test_set test passed')

train_set test passed
test_set test passed


Since most of the computation won't be done with strings, the following function will create a Look Up Table (LUT) that transforms string tokens into ```int``` tokens. 

In [None]:
def create_tokens_lut(train_dataset) -> Dict:
  '''
  Input:
    train_dataset: Training dataset. 

    Don't tokenize test_set as later on,
       we will be considering out-of-vocabulary words as <unk> tokens.

    NOTE: Tokens MUST be lowered (.lower()) before considering them. 

  Ouput:
    LUT[Dict]: {
      word1: 0,
      word2: 1,
        ...
      wordn: n - 1
    }

  '''

  pass

  
LUT = create_tokens_lut(train_loader)

In [None]:
def check_oov_words(LUT = LUT, test_set = test_loader):
  pass

print(list(check_oov_words()))


Due to batch computation of some modules, sequences must have constant length. As a common practice, we will create three new tokens ```<bos>``` and ```<eos>``` for the start and the end of a given sequence and ```<unk>``` for unkown tokens in the application (test) layer or 0 padding during the training. Manually add those tokens to the ```LUT```. 
 

Under those constraints, fill the corresponding functions that will post-process each batch. Feel free to code more post-processing functions if you need it.





In [None]:
LUT['<unk>'] = len(LUT) + 1
LUT['<bos>'] = len(LUT) + 1
LUT['<eos>'] = len(LUT) + 1

In [None]:
MAX_SEQUENCE_LENGTH = 50
def complete_seq(X) -> List[List]:
  
  '''

    Input: 
      X: A batch of N sequences [
        [word1, ..., wordn],
        [word1, ..., wordm]
      ]

    Output:
      A batch of N sequences with MAX_SEQUENCE_LENGTH tokens.
        - The starting token will always be <sos>
        - The last 'real' token <eos>
        - Tokens from <eos> until MAX_SEQUENCE_LENGTH will be <unk> as 0 padding.

  '''
  pass

def post_process(X, functions = [complete_seq,]):
  for f in functions: X = f(X)
  return X

## NER - Baseline Approach

The first approach we will try is based on computing the probabilities for each word in our training corpus. This means computing the most likely category for each word in the dictionary.

Compute the test categories predictions and measure the performance for this simple model.

In [None]:
def compute_emissions_dict(dataloader) -> Dict:
  '''

    Given the train loader ```dataloader```
     this function will compute the max likelihood dictionary for each word.

  Input:
    dataloader: train loader with EsposallesTextDataset
  
  Outputs:
    Dict: {
      pagès: {name: X occupation: X}, # REMEMBER TO LOWER YOUR TOKENS!
            ...
      LUT - wordn: {category: x%, ...}
    }

  '''

  pass

In [None]:
priors = compute_emissions_dict(train_loader)

At this point, as an example, your emissions dictionary should yield the following emission:

$P(location |$ ```Prats``` $) = 18\%$

$P(surname |$ ```Prats``` $) = 72\%$

$P(other |$ ```Prats``` $) = 9\%$

In [None]:
priors['prats']

{'location': 0.1818, 'surname': 0.7273, 'other': 0.0909}

This method has its limitations in terms of lack of context and, therefore, low expresivity. 

The following function will compute the confusion matrix for the predictions in the ```test_set``` in order to find the most problematic words. 

* What do they all have in common? 
* What kind of words are the least performers?
* What's your solution for out-of-vocabulary words? Can you provide a prediction for those?

In [None]:
def predict_test_set(emissions, test_set):
  
  '''

  s: casament eduard pages

  prediccio: [None, nom, ofici]

  Important: 
    Remember to check if you can provide a label for each word (OOVs?).
    What's your solution for those you cannot classify? Justify.

      tip: be as creative as you want.

  '''
  predictions = []
  pass


predictions = predict_test_set(priors, test_loader)

def find_common_errors(x_test: List[List], y_pred: List[List], y_true: List[List]) -> Dict:

  '''
    Input: 
      x_test: A list with each sample in the corpus with the words for which we
ran each prediction
          [
           ['lorem', 'ipsum', 'dolor', 'sit', 'amet'],
           ['Hello', 'world', '!!!'],
          ]
      
      y_pred: A list with the predicted labels for each word in x_test corpus.
          [
           ['1', '0', '0', '1', '2'],
           ['2', '1', '0'],
          ]
      
      y_true: GT for the x_test sample
          [
           ['0', '0', '0', '1', '2'],
           ['0', '1', '0'],
          ]
    {
      pages: [{'pred': prediction, gt: label}, {'pred': prediction, 'gt': label}, ...]
    }
  '''

  errors = {}
  pass

def compute_token_precision(x_test: List[List], y_pred: List[List], y_true: List[List]):
    pass



In [None]:
find_common_errors([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])['Esteva']
compute_token_precision([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])

0.8834888960411973

Analyze the results for the common errors.

## Conclusions

Here, write a brief conclusion for this notebook reffering to the main differences, advantages and disadvantages for each approach.




> Your conclusions here

