<a href="https://colab.research.google.com/github/Neilus03/NLP-2023/blob/main/%5BGIA%5D_%5BNLP%5D_Seq_Labeling_and_NER_with_HMM_and_CRF_DANI_%26_NEIL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Daniel Vidal, NIU: 1634599
# Neil de la Fuente, NIU: 1630223



# Sequence Labeling With CRF and HMM - NER

The extraction of relevant information from historical handwritten document collections is one of the key steps in order to make these manuscripts available for access and searches. In this context, instead of a pure transcription, the objective is to move towards document understanding. Concretely, the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.


A typical application scenario of named entity recognition is demographic documents, since they contain people's names, birthplaces, occupations, etc. In this scenario, the extraction of the key contents and its storage in databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.

<p style = 'text-align: center'>
<img src = "http://dag.cvc.uab.es/wp-content/uploads/2016/07/esposalla_detall.jpg">
</p>

For further doubts and questions, refer to oriol.ramos@uab.cat and alicia.fornes@uab.cat.

Usage of Google Colab is not mandatory, but highly recommended as most of the behaviors are expected for a Linux VM with IPython bindings.

## First, we will install the unmet dependencies.

This will download some packages and the required data, it may take a while.

In [1]:
#@title 
from IPython.display import clear_output

!git clone https://github.com/EauDeData/nlp-resources
!cp -r nlp-resources/ resources/
!rm -rf nlp-resources/


!pip install nltk 
!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite
clear_output()

from typing import * 

from itertools import chain
import nltk
import numpy as np
import copy
import random
from collections import Counter

import pycrfsuite as crfs
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer

from resources.data.dataloaders import EsposallesTextDataset

## Data Curation
Loading the dataset - From here you could re-use your previous work.

In [2]:
random.seed(42)
train_loader = EsposallesTextDataset('resources/data/esposalles/') 
test_loader = copy.deepcopy(train_loader)
test_loader.test()

Example of data from each loader:
> Format string: ```word```:```label```


In [3]:
print([f"{x}:{y}" for x,y in zip(*train_loader[0])])
print([f"{x}:{y}" for x,y in zip(*test_loader[0])])

['Dilluns:other', 'a:other', '5:other', 'rebere:other', 'de:other', 'Hyacinto:name', 'Boneu:surname', 'hortola:occupation', 'de:other', 'Bara:location', 'fill:other', 'de:other', 'Juan:name', 'Boneu:surname', 'parayre:occupation', 'defunct:other', 'y:other', 'de:other', 'Maria:name', 'ab:other', 'Anna:name', 'donsella:state', 'filla:other', 'de:other', 't:name', 'Cases:surname', 'pages:occupation', 'de:other', 'Bara:location', 'defunct:other', 'y:other', 'de:other', 'Peyrona:name']
['Divendres:other', 'a:other', '18:other', 'rebere:other', 'de:other', 'Juan:name', 'Torres:surname', 'pages:occupation', 'habitant:other', 'en:other', 'Sabadell:location', 'fill:other', 'de:other', 'Bernat:name', 'Torres:surname', 'pages:occupation', 'de:other', 'Moya:location', 'bisbat:location', 'de:location', 'Vich:location', 'y:other', 'de:other', 'Antiga:name', 'defucts:other', 'ab:other', 'Margarida:name', 'donsella:state', 'filla:other', 'de:other', 'Juan:name', 'Argemir:surname', 'pages:occupation',

If the dataset is correctly downloaded you will see two different samples above, and both tests passed below.

In [4]:
#@title

# Dataset ckeck

for idx in range(len(train_loader)):

  x, y = train_loader[idx]
  if len(x) != len(y): 
    print('train_set test not passed')
    break

else: print('train_set test passed')

for idx in range(len(test_loader)):

  x, y = test_loader[idx]
  if len(x) != len(y):
    print('test_set test not passed')
    break

else: print('test_set test passed')

train_set test passed
test_set test passed


Since most of the computation won't be done with strings, the following function will create a Look Up Table (LUT) that transforms string tokens into ```int``` tokens. 

In [5]:
def create_tokens_lut(train_dataset) -> Dict:
    '''
    Input:
        train_dataset: Training dataset. 

        Don't tokenize test_set as later on,
        we will be considering out-of-vocabulary words as <unk> tokens.

        NOTE: Tokens MUST be lowered (.lower()) before considering them. 

    Ouput:
        LUT[Dict]: {
        word1: 0,
        word2: 1,
            ...
        wordn: n - 1
        }

    '''
    token_to_id = {}
    current_id = 0
    
    for i in range(len(train_dataset)):
        for word, label in zip(*train_dataset[i]):
            token = word.lower()  # Lowercase the token

            if token not in token_to_id:
                token_to_id[token] = current_id # Each new token is saved with the corresponding ID
                current_id += 1

    return token_to_id

LUT = create_tokens_lut(train_loader)
LUT

{'dilluns': 0,
 'a': 1,
 '5': 2,
 'rebere': 3,
 'de': 4,
 'hyacinto': 5,
 'boneu': 6,
 'hortola': 7,
 'bara': 8,
 'fill': 9,
 'juan': 10,
 'parayre': 11,
 'defunct': 12,
 'y': 13,
 'maria': 14,
 'ab': 15,
 'anna': 16,
 'donsella': 17,
 'filla': 18,
 't': 19,
 'cases': 20,
 'pages': 21,
 'peyrona': 22,
 'dit': 23,
 'dia': 24,
 'bernat': 25,
 'call': 26,
 'st': 27,
 'esteva': 28,
 'palau': 29,
 'tordera': 30,
 'guille': 31,
 'catherina': 32,
 'guillem': 33,
 'boix': 34,
 'la': 35,
 'mora': 36,
 'bisbat': 37,
 'vich': 38,
 'margarida': 39,
 'jua': 40,
 'perega': 41,
 'comissari': 42,
 'real': 43,
 'habitant': 44,
 'en': 45,
 'viudo': 46,
 'gratia': 47,
 'carbonell': 48,
 'negociant': 49,
 'defuncts': 50,
 'dijous': 51,
 '10': 52,
 'montserrat': 53,
 'gibert': 54,
 'treballador': 55,
 'traginer': 56,
 'mar': 57,
 'janota': 58,
 'magdalena': 59,
 'sabater': 60,
 'dissapte': 61,
 '26': 62,
 'jaume': 63,
 '#': 64,
 'oller': 65,
 'andreu': 66,
 'pa': 67,
 'lomar': 68,
 'eularia': 69,
 'pere': 

In [6]:
def check_oov_words(LUT = LUT, test_set = test_loader):
    # Out of vocabulary list
    oov_words = []

    for i in range(len(test_set)):
        for word, label in zip(*test_set[i]):
            token = word.lower()  # Lowercase the token

            if token not in LUT:
                oov_words.append(token) # Add the tokens of the test dataloader to the list if they don't belong to the LUT

    return oov_words
            
print(list(check_oov_words()))


['argemir', 'nyella', 'rianna', 'bonastra', 'angli', 'victo', 'theodora', 'payas', 'monllor', 'rotxe', 'moreno', 'islla', 'brasil', 'gusman', 'more', 'islla', 'sto', 'thome', 'habitants', 'melcior', 'bachs', 'pachs', 'plans', 'begas', 'rius', 'galeras', 'box', 'deseny', 'francesa', 'ortiz', 'faneca', 'faneca', 'sobrevila', 'sobrevila', 'cabrer', 'carantela', 'mso', 'tibau', 'monblanch', 'tibau', 'broquets', 'noguera', 'gassull', 'llondra', 'quart', 'pallissa', 'vivints', 'cabus', 'felis', 'buyra', 'scrivent', 'castigaleu', 'comptat', 'ribagossa', 'fontanilles', 'conteso', 'tatare', 'idrach', 'peramon', 'peramon', 'terre', 'arisart', 'arisart', 'faja', 'pinya', 'faliu', 'campanya', 'faliu', 'majol', 'majol', 'guardi#', 'debarca', 'constansa', 'llorenci', 'caxaler', 'llorenci', 'aguller', 'poses', 'miro', 'darder', 'darder', 'garces', 'imaginayre', 'juan#', 'villaro', 'vilademaser', 'muntells', 'muntells', 'sengermes', 'cebriana', 'tamuyell', 'cabaner', 'manader', 'manader', 'campprecios

Due to batch computation of some modules, sequences must have constant length. As a common practice, we will create three new tokens ```<bos>``` and ```<eos>``` for the start and the end of a given sequence and ```<unk>``` for unkown tokens in the application (test) layer or 0 padding during the training. Manually add those tokens to the ```LUT```. 
 

Under those constraints, fill the corresponding functions that will post-process each batch. Feel free to code more post-processing functions if you need it.





In [7]:
LUT['<unk>'] = len(LUT) + 1
LUT['<bos>'] = len(LUT) + 1
LUT['<eos>'] = len(LUT) + 1

In [8]:
MAX_SEQUENCE_LENGTH = 50
def complete_seq(X) -> List[List]:
    '''

        Input: 
        X: A batch of N sequences [
            [word1, ..., wordn],
            [word1, ..., wordm]
        ]

        Output:
        A batch of N sequences with MAX_SEQUENCE_LENGTH tokens.
            - The starting token will always be <sos>
            - The last 'real' token <eos>
            - Tokens from <eos> until MAX_SEQUENCE_LENGTH will be <unk> as 0 padding.

    '''
    complete_seq = []
    for seq in X:
        seq = ['<bos>'] + seq + ['<eos>']
        seq += ['<unk>'] * (MAX_SEQUENCE_LENGTH - len(seq))
        complete_seq.append(seq[:MAX_SEQUENCE_LENGTH])
    return complete_seq

def post_process(X, functions = [complete_seq,]):
  for f in functions: X = f(X)
  return X

## NER - Baseline Approach

The first approach we will try is based on computing the probabilities for each word in our training corpus. This means computing the most likely category for each word in the dictionary.

Compute the test categories predictions and measure the performance for this simple model.

In [9]:
def compute_emissions_dict(dataloader) -> Dict:
    '''

        Given the train loader ```dataloader```
        this function will compute the max likelihood dictionary for each word.

    Input:
        dataloader: train loader with EsposallesTextDataset
    
    Outputs:
        Dict: {
        pagès: {name: X occupation: X}, # REMEMBER TO LOWER YOUR TOKENS!
                ...
        LUT - wordn: {category: x%, ...}
        }

    '''
    emissions_dict = {}
    
    token_counts = Counter()
    label_counts = Counter()
    token_label_counts = Counter()
    

    for i in range(len(dataloader)):
        for word,label in zip(*dataloader[i]):
            token = word.lower()
            token_counts[token] += 1
            label_counts[label] += 1
            token_label_counts[(token,label)] += 1
    
    for token, label_count in token_label_counts.items():
        token, label = token
        if token not in emissions_dict:
            emissions_dict[token] = {}
        emissions_dict[token][label] = label_count / token_counts[token]

    return emissions_dict


In [10]:
priors = compute_emissions_dict(train_loader)

At this point, as an example, your emissions dictionary should yield the following emission:

$P(location |$ ```Prats``` $) = 18\%$

$P(surname |$ ```Prats``` $) = 72\%$

$P(other |$ ```Prats``` $) = 9\%$

In [11]:
priors['prats']

{'location': 0.18181818181818182,
 'surname': 0.7272727272727273,
 'other': 0.09090909090909091}

This method has its limitations in terms of lack of context and, therefore, low expresivity. 

The following function will compute the confusion matrix for the predictions in the ```test_set``` in order to find the most problematic words. 

* What do they all have in common? 
* What kind of words are the least performers?
* What's your solution for out-of-vocabulary words? Can you provide a prediction for those?


1. What do they all have in common?

The problematic words in the baseline model often have commonalities such as sharing similar features or having ambiguous meanings. These words might have multiple possible tags, making it difficult for the model to correctly predict their tags without considering context.

2. What kind of words are the least performers?

The least performing words are generally those that have multiple possible tags or meanings depending on the context in which they are used. For example, words like "bank" could be an organization (B-ORG) or a location (B-LOC) depending on the context. These words are challenging for the baseline model since it does not take context into account when making predictions.

3. What's your solution for out-of-vocabulary words?

For this problem we decide to use the most frequent label for OOV words.

In [15]:
def predict_test_set(emissions, test_set):
    
    '''

    s: casament eduard pages

    prediccio: [None, nom, ofici]

    Important: 
        Remember to check if you can provide a label for each word (OOVs?).
        What's your solution for those you cannot classify? Justify.

        tip: be as creative as you want.

    '''
    predictions = []
    for i in range(len(test_set)):
            seq_predictions = []
            for word, label in zip(*test_set[i]):
                token = word.lower()
                if token in emissions:
                    pred_label = max(emissions[token], key=emissions[token].get)
                else:
                    # Use the most frequent label for OOV words (simple but effective)
                    pred_label = 'other'
                seq_predictions.append(pred_label)

            predictions.append(seq_predictions)

    return predictions


predictions = predict_test_set(priors, test_loader)

def find_common_errors(x_test: List[List], y_pred: List[List], y_true: List[List]) -> Dict:
    '''
        Input: 
        x_test: A list with each sample in the corpus with the words for which we
    ran each prediction
            [
            ['lorem', 'ipsum', 'dolor', 'sit', 'amet'],
            ['Hello', 'world', '!!!'],
            ]
        
        y_pred: A list with the predicted labels for each word in x_test corpus.
            [
            ['1', '0', '0', '1', '2'],
            ['2', '1', '0'],
            ]
        
        y_true: GT for the x_test sample
            [
            ['0', '0', '0', '1', '2'],
            ['0', '1', '0'],
            ]
        {
        pages: [{'pred': prediction, gt: label}, {'pred': prediction, 'gt': label}, ...]
        }
    '''
    x_test_words = [word for List in x_test for word in List] #this would be equivalent to:     x_test_words = []
                                                              #                                 for List in x_test:
                                                              #                                     for word in list:
                                                              #                                         x_test_words.append(word)
    y_pred_labels = [label for List in y_pred for label in List]
    y_true_labels = [label for List in y_true for label in List]

    errors = {word: {"pred": pred, 'gt': gt} for word, pred, gt in zip(x_test_words, y_pred_labels, y_true_labels) if pred != gt}

    return errors

def compute_token_precision(x_test: List[List], y_pred: List[List], y_true: List[List]):
    
    errors = find_common_errors(x_test, y_pred, y_true)
   
    x_test_words = [word for List in x_test for word in List]
    error_rate = len(errors)/len(x_test_words)
    
    return 1 - error_rate

In [16]:
find_common_errors([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])['Esteva']
compute_token_precision([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])

0.9308014161570647

## Conclusions

Here, write a brief conclusion for this notebook reffering to the main differences, advantages and disadvantages for each approach.


> Your conclusions here

After analyzing the given code, we can draw the following conclusions:

The task at hand is Named Entity Recognition (NER) and sequence labeling in historical handwritten documents. The goal is to extract information from these documents and categorize the entities into semantic categories such as family names, places, occupations, etc.

The code uses a simple baseline approach to predict the named entity categories, which involves computing the most likely category for each word based on the training data.

The baseline approach does not take context into consideration and is limited in its expressiveness. It also faces challenges in handling out-of-vocabulary (OOV) words. The current solution for OOV words is to assign them the most frequent label, which may not be the best approach in all cases.

The confusion matrix and common errors are computed to analyze the model's performance. The most problematic words are identified and analyzed for any common patterns or characteristics.

Based on these conclusions, the baseline approach can be improved upon by incorporating context-aware methods or other techniques for handling OOV words. These improvements could potentially lead to better performance and more accurate predictions for the NER task in historical handwritten documents.

## HMM Approach

As demonstrated in the previous experiment, using just the priors have not enough expresivity for managing both out of vocabulary words and polysemic words. Here we will use the ```python-crfsuite``` module to build a Hidden Markov Model and improve the predictions on ```test_set```.

Check <a href = 'https://python-crfsuite.readthedocs.io/en/latest/'>here</a> the  ```python-crfsuite``` documentation.

First, we will set up the parameters for our CRF model.


In [17]:
def get_word_to_hmm_features(sent, i):


    '''
     Reminder: 
        The Markov assumption states that the transition for the i-th token
          depends on the (i-1)-th token. 


    '''
    word, _ = sent[i]
    #emission probilities
    features = [
        'bias',
        'word.lower=' + word.lower(),
    ]
    if i == 0:
        features.append('bos')

    if (i == len(sent) - 1):
        features.append('eos')
                
    return features


def sent2HMMfeatures(sent):
    return [get_word_to_hmm_features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]    



In [18]:
# transform the dataset
# to the (token, gt) tuple format
train_sents = [  [(x,y) for x,y in zip(*train_loader[idx])] for idx in range(len(train_loader))]
test_sents =  [  [(x,y) for x,y in zip(*train_loader[idx])] for idx in range(len(test_loader))]

In [19]:
X_train = [sent2HMMfeatures(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2HMMfeatures(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [20]:
trainer = crfs.Trainer(verbose=False) # Instance a CRF trainer

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq) # Stack the data

In [21]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # Max Number of iterations for the iterative algorithm

    # include transitions that are possible, but not observed (smoothing)
    'feature.possible_transitions': True
})

In [22]:
%%time
trainer.train('npl_ner_crf.crfsuite') # Train the model and save it locally.

CPU times: user 396 ms, sys: 3.21 ms, total: 399 ms
Wall time: 403 ms


In [23]:
tagger = crfs.Tagger()
tagger.open('npl_ner_crf.crfsuite') # Load the inference API

<contextlib.closing at 0x7f4601ed8040>

In [24]:
example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2HMMfeatures(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent))) # Inference

Dilluns a 5 rebere de Hyacinto Boneu hortola de Bara fill de Juan Boneu parayre defunct y de Maria ab Anna donsella filla de t Cases pages de Bara defunct y de Peyrona

Predicted: other other other other other name surname occupation other location other other name surname occupation other other other name other name state other other name surname occupation other location other other other name
Correct:   other other other other other name surname occupation other location other other name surname occupation other other other name other name state other other name surname occupation other location other other other name


In [25]:
from sklearn.preprocessing import LabelBinarizer
def bio_classification_report(y_true, y_pred):
    """

    Classification report.
    You can use this as evaluation for both in the baseline model and new model.

    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [26]:
# Compute the predictions 
y_pred = [tagger.tag(xseq) for xseq in X_test]
y_pred

[['other',
  'other',
  'other',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'other',
  'location',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'other',
  'other',
  'other',
  'name',
  'other',
  'name',
  'state',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'other',
  'location',
  'other',
  'other',
  'other',
  'name'],
 ['other',
  'other',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'other',
  'location',
  'location',
  'location',
  'location',
  'location',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'other',
  'other',
  'name',
  'other',
  'other',
  'name',
  'state',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'other',
  'location',
  'location',
  'location',
  'location',
  'location',
  'other',
  'other',
  'name'],
 ['other',
  'other',
  'other',
  'other',
  'name',
  'surname',
  'occupation',
  'occupation',
  'other',
  'other',
  'location',
  'st

* Use the  ```bio_classification_report``` function in both the Baseline model and the HMM model. Do you observe any improvement? In which cases does it still fail?



In [27]:
baseline_predictions = predict_test_set(priors, test_loader)
baseline_true = [labels for _, labels in [test_loader[idx] for idx in range(len(test_loader))]]

hmm_predictions = y_pred


In [28]:
print("Baseline model classification report:")
print(bio_classification_report(baseline_true, baseline_predictions))

print("\nHMM model classification report:")
print(bio_classification_report(y_test, hmm_predictions))


Baseline model classification report:
              precision    recall  f1-score   support

    location       0.94      0.71      0.81       462
        name       0.94      0.95      0.94       494
  occupation       0.94      0.85      0.90       294
       other       0.85      0.99      0.91      1493
       state       0.98      0.95      0.96       113
     surname       0.84      0.47      0.60       251

   micro avg       0.89      0.89      0.89      3107
   macro avg       0.91      0.82      0.85      3107
weighted avg       0.89      0.89      0.88      3107
 samples avg       0.89      0.89      0.89      3107


HMM model classification report:
              precision    recall  f1-score   support

    location       0.95      0.95      0.95       416
        name       0.98      0.98      0.98       487
  occupation       0.93      0.97      0.95       296
       other       0.99      0.98      0.98      1486
       state       0.97      0.98      0.97       114
     s

We can notice many improvements in the HMM model because the baseline model is a simple approach that primarily relies on emission probabilities, which are the probabilities of observing a particular word given a label. While this approach may work well for some cases, it still has several limitations:

1. **Lack of context:** The baseline model does not consider the context in which a word appears. It only considers the individual word and its relationship with the label. This can lead to incorrect predictions when the meaning of a word or its role in a sentence depends on the surrounding words or the overall context.

2. **Out-of-vocabulary words:** The baseline model may perform poorly on out-of-vocabulary (OOV) words, i.e., words that are not present in the training set. For OOV words, the model assigns the most frequent label for all such words, which is a simplistic approach and may not always be accurate.

3. **Label transitions:** The baseline model does not take into account the transitions between labels, which may provide valuable information for Named Entity Recognition (NER) tasks. For instance, certain labels may be more likely to follow or precede specific labels, but the baseline model does not leverage this information.

4. **No higher-order dependencies:** The baseline model doesn't consider higher-order dependencies between words and labels in a sentence. For example, it can't capture relationships between non-adjacent words, which might be important for determining the correct label.

Quantitative and Qualitative analysis


Generally, the HMM model is expected to outperform the Baseline model as it takes into account the dependencies between the tags in the sequence. However, there might still be cases where the HMM model fails. These can include:

1. Ambiguity in the data: If there is ambiguity in the data itself or if there are inconsistencies in the way the data has been annotated, both the Baseline and HMM models may struggle to perform well in these cases.

2. Rare or unseen words/entities: The HMM model may fail in cases where it encounters rare or previously unseen words/entities during testing. In such cases, the model may not have enough information to accurately predict the correct tags.

3. Complex dependencies between tags: If there are complex dependencies between tags in the sequence that the HMM model is unable to capture, it may still fail to make accurate predictions.

To better understand the specific cases where the HMM model fails, you can analyze the confusion matrix and identify the most common misclassifications. This can give you insights into potential areas of improvement for your model or the need for additional training data.

* Can this model provide a solution for out-of-vocabulary words?
* Can you provide examples of words which changed its category compared to the max-likelihood prior when introducing context? See the following example.

$P($ ```Noun |people``` $) = 80\%$ 

$P($ ``` people, Noun | "a  planet" ``` $) = 5\%$

$P($ ``` people, Verb | "a  planet" ``` $) = 90\%$

* How does it perform with respect to the out-of-vocabulary words? e.g. what's the precision for those?

*  The following function shows the less likely and most likely transitions. Comment them and perform a deep analysis on each transition, do they have something in common with the errors you found?

The Hidden Markov Model (HMM) can partially address the issue of out-of-vocabulary (OOV) words, but not entirely. One way to handle OOV words in HMMs is to assign them an average or uniform emission probability across all labels. While this approach is better than the baseline model, which assigns the most frequent label, it's still not perfect, as it doesn't capture the true relationship between the OOV word and the labels.



Regarding the second part of your question, here's an example illustrating how context can influence label predictions when compared to the max-likelihood prior:

Consider the sentence: "John visited San Francisco and loved the Golden Gate Bridge."

Without context, the max-likelihood prior might label "Golden" and "Gate" as follows:

"**Golden**" - B-MISC (miscellaneous entities)
"**Gate**" - O (no entity)
These labels are based on the highest emission probability for the individual words, ignoring the context.

However, when we introduce context with an HMM, we can capture the relationship between adjacent words and labels, and we might get the following labels:

"**Golden**" - B-LOC (location)
"**Gate**" - I-LOC (location)
In this case, the HMM is able to recognize that "Golden Gate" together forms a location entity, "Golden Gate Bridge."

In [29]:
from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Top likely transitions:
surname -> occupation 5.019786
location -> location 4.249178
occupation -> occupation 4.194908
name   -> surname 3.918272
surname -> surname 3.294527
other  -> name    2.642988
name   -> name    2.280106
other  -> location 1.814986
occupation -> other   1.799188
state  -> state   1.515564
name   -> state   1.182575
other  -> other   1.181984
state  -> other   0.735261
occupation -> state   0.467426
location -> other   0.403282

Top unlikely transitions:
state  -> occupation 0.214062
surname -> state   0.070542
location -> occupation -0.072833
name   -> other   -0.506374
occupation -> name    -0.617364
other  -> surname -0.626089
occupation -> location -0.735001
surname -> name    -0.998662
other  -> occupation -1.030786
other  -> state   -1.296538
name   -> occupation -1.650451
occupation -> surname -2.111763
name   -> location -2.166658
location -> surname -2.771394
location -> name    -3.351583


In [30]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
10.471530 state  word.lower=viudo
9.201264 state  word.lower=donsella
9.171407 other  word.lower=ab
9.073835 other  word.lower=fill
9.043569 other  word.lower=defuncts
8.975021 other  word.lower=#
8.538504 state  word.lower=viuda
8.204700 other  word.lower=defunct
7.974212 other  word.lower=y
7.971931 other  word.lower=defuncta
7.816721 location word.lower=frances
7.655355 state  word.lower=dosella
7.522998 other  word.lower=rebere
7.259508 other  word.lower=habitant
7.223392 location word.lower=bara
7.056602 other  word.lower=a
6.963770 other  word.lower=de
6.655697 other  word.lower=filla
6.586878 other  word.lower=habitat
6.535491 occupation word.lower=llana

Top negative:
0.009387 occupation word.lower=pastisser
0.006814 surname word.lower=pere
-0.000147 name   word.lower=sr
-0.000667 location word.lower=dels
-0.015598 location word.lower=menat
-0.061897 surname word.lower=vila
-0.086558 surname word.lower=del
-0.118397 surname word.lower=toni
-0.285311 occupation bia

In this case, we can see that there is a strong association between some of the label pairs:

**surname** -> **occupation** and **occupation** -> **occupation**: Indicates that the model recognizes that an occupation often follows a surname and that consecutive words can belong to the occupation category.
**location** -> **location**: Shows that the model can identify consecutive location-related words.
**name** -> **surname** and **surname** -> **surname**: Indicates that the model understands the relationship between names and surnames, and that consecutive surnames may appear.
Top unlikely transitions show the transitions between labels that the model believes are least probable. Some interesting observations here are:

**location** -> **name**, **location** -> **surname**, and **location** -> **occupation**: These imply that the model has learned that locations rarely transition directly to names, surnames, or occupations.
**name** -> **occupation**, **occupation** -> **surname**, and **occupation** -> **name**: The model has learned that transitioning from a name to an occupation or vice versa is not common, as well as from an occupation to a surname.
These unlikely transitions might be related to some errors found in the model. For instance, if the model has learned that transitioning from a location to a surname is unlikely, it may struggle to identify correct sequences where this transition happens.

## Hyperparameter Exploration

In the definition of the model we used some default hyperparameters related to the training algorithm.



```
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # Max Number of iterations for the iterative algorithm

    # include transitions that are possible, but not observed (smoothing)
    'feature.possible_transitions': True
})
```
Can you improve the precision by better parametrization? Feel free to explore more parameters through [the documentation](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html#sklearn_crfsuite.CRF).


In [31]:
# New parameters and results here
## Use the functions above, no need to re-work.
## There is no need of a deep analysis nor qualitative evaluation


# You can define your mode with differente parameters such that:
param_grid = {
    'c1': [0.01, 0.1, 1, 10], # Coefficient for L1 regularization penalty
    'c2': [0.01, 0.1, 1, 10], # Coefficient for L2 regularization penalty
    
    # Different algorithms (Gradient descent using the L-BFGS method

    'algorithm': ['lbfgs', # 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
                  'ap', # 'ap' - Averaged Perceptron
                  'pa', # 'pa' - Passive Aggressive (PA)
                  'arow'], # 'arow' - Adaptive Regularization Of Weight Vector (AROW) )

    'max_iterations': 100, # Maximum number of iterations
    'epsilon':00.1         # The epsilon parameter that determines the condition of convergence.                        
}   # And many other parameters

## CRF Approach

Additionaly, we can address the problem by adding complexity to the transition probabilities. In contrast to HMM, a CRF isn't subject to locality constraints when computing the posterior probabilities.

Implement a CRF word featurer that takes into account tokens beyond adjacent ones and your expected needs given the qualitative evaluation.



In [32]:
def get_word_to_crf_features(sentence, word_idx):
    word, _ = sentence[word_idx]

    features = [
        'bias',
        'word.lower=' + word.lower(),
    ]

    if word_idx > 0:
        prev_word, _ = sentence[word_idx - 1]
        features.extend([
            '-1:word.lower=' + prev_word.lower(),
        ])
    else:
        features.append('bos')

    if word_idx < len(sentence) - 1:
        next_word, _ = sentence[word_idx + 1]
        features.extend([
            '+1:word.lower=' + next_word.lower(),
        ])
    else:
        features.append('eos')

    return features


def get_sent_to_crf_features(sentence):
    return [get_word_to_crf_features(sentence, i) for i in range(len(sentence))]


In [33]:
X_train = [get_sent_to_crf_features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [get_sent_to_crf_features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

trainer_crf = crfs.Trainer(verbose=False) # Instance a CRF trainer

for xseq, yseq in zip(X_train, y_train):
    trainer_crf.append(xseq, yseq) # Stack the data

In [34]:
trainer_crf.set_params({
    'c1': 0.15,  # coefficient for L1 penalty
    'c2': 0.15,  # coefficient for L2 penalty
})

In [35]:
trainer_crf.train('npl_ner_crf-improved.crfsuite') # Train the model and save it locally.
tagger_crf = crfs.Tagger()
tagger_crf.open('npl_ner_crf.crfsuite') # Load the inference API

<contextlib.closing at 0x7f4601ebeb50>

* Did the results improve? Comment your decision on the features used in the CRF approach. Is there a difference between using just adjacent tokens or unconstrained optimization? 

Using just adjacent tokens in the CRF model might limit its ability to capture the context in which a word appears. By using an unconstrained optimization, the CRF model can consider a wider range of tokens and better understand the relationships between words and labels, which could lead to improved results. However, it is important to evaluate the performance of the model on the test set to see if this approach actually leads to better predictions.

In [36]:
## Quantitative and qualitative evaluation

# Predict on the test set
y_pred_crf = [tagger_crf.tag(x) for x in X_test]

# Evaluate the CRF model
report_crf = bio_classification_report(y_test, y_pred_crf)
print("CRF Model:")
print(report_crf)


CRF Model:
              precision    recall  f1-score   support

    location       0.95      0.95      0.95       416
        name       0.98      0.98      0.98       487
  occupation       0.93      0.97      0.95       296
       other       0.99      0.98      0.98      1486
       state       0.97      0.98      0.97       114
     surname       0.97      0.94      0.95       269

   micro avg       0.97      0.97      0.97      3068
   macro avg       0.96      0.97      0.97      3068
weighted avg       0.97      0.97      0.97      3068
 samples avg       0.97      0.97      0.97      3068



In [37]:
# For the HMM model

# Predict on the test set
y_pred_hmm = [tagger.tag(x) for x in X_test]

# Evaluate the HMM model
report_hmm = bio_classification_report(y_test, y_pred_hmm)
print("HMM Model:")
print(report_hmm)


HMM Model:
              precision    recall  f1-score   support

    location       0.95      0.95      0.95       416
        name       0.98      0.98      0.98       487
  occupation       0.93      0.97      0.95       296
       other       0.99      0.98      0.98      1486
       state       0.97      0.98      0.97       114
     surname       0.97      0.94      0.95       269

   micro avg       0.97      0.97      0.97      3068
   macro avg       0.96      0.97      0.97      3068
weighted avg       0.97      0.97      0.97      3068
 samples avg       0.97      0.97      0.97      3068



The results for both models were the same for all labels

## Conclusions

Here, write a brief conclusion for this notebook reffering to the main differences, advantages and disadvantages for each approach.




> In this notebook, we explored three different methods for Named Entity Recognition (NER) using the BIO tagging scheme: a baseline model, a Hidden Markov Model (HMM), and a Conditional Random Field (CRF) model. We analyzed the performance of each model and discussed their strengths and weaknesses.

>In conclusion, the choice of the NER model depends on the specific requirements of the task and the resources available. The baseline model can be used for quick and simple tasks, while the HMM and CRF models offer better performance at the cost of increased complexity and computational resources. The CRF model is particularly well-suited for complex NER tasks that require the incorporation of diverse features and non-local context.

