# ReFinED use case

This notebook documents multiple executions of the code in `refined_test.py` and summarizes the main findings.

## Conclusions

1. Great tool. WSD and entity linking from scratch. NER presupposed.
2. Results:
   1. **Apparently** good results with the Venice example (invariant entities, conventional entities, general-domain NER and WSD).
   2. Massive drop on the titles dataset.
      - Possibly due to non-natural text lacking a canonical linguistic input structure for the part of speech tagging and NER processes. As a result, few entities are detected and disambiguated.
3. NER is not optimal, tokenizer seems an issue.
4. WSD is okay but not impressive.
5. Performance (throughout) is not great given the results.

# Experiments

In [386]:
%load_ext autoreload
%autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [387]:
import dtale
import pandas as pd

The `WikipediaAnnotator` class implements an entity extractor that uses fuzzy string matching to retrieve candidate entities from Wikipedia. We expect this to provide a relatively strong baseline but it is uncompetitive as of now, given that it leverages Wikipedia's public search API, which would not be possible in a production environment, as well as cached look-up, which would be acceptable in any production setting but is actually indispensable to run the baseline:

In [388]:
from wikipedia_annotator import WikipediaAnnotator
BASELINE = WikipediaAnnotator()

Target system: ReFinED

In [389]:
from refined_test import ReFinED
RFED = ReFinED()

The test examples are defined in a separate module both for tracking and convenience and are imported in the cell below. These texts have been manually hand-picked and they meet no specific set of selection criteria:

In [390]:
from texts import TEXT__PAPER_TITLES, TEXT_VENICE

We also need evaluation metrics:

In [391]:
from evaluation import character_coverage
from evaluation import diff_annotations
from evaluation import f_score

## _Venice_ text

Let us check the results for the text about the city of Venice first. We start with the baseline annotation pipeline:

In [392]:
print(TEXT_VENICE)

From the 9th to the 12th centuries, Venice developed into a powerful maritime empire (an Italian thalassocracy known also as repubblica marinara). In addition to Venice there were seven others: the most important ones were Genoa, Pisa, and Amalfi; and the lesser known were Ragusa, Ancona, Gaeta and Noli. Its own strategic position at the head of the Adriatic made Venetian naval and commercial power almost invulnerable. With the elimination of pirates along the Dalmatian coast, the city became a flourishing trade centre between Western Europe and the rest of the world, especially with the Byzantine Empire and Asia, where its navy protected sea routes against piracy. The Republic of Venice seized a number of places on the eastern shores of the Adriatic before 1200, mostly for commercial reasons, because pirates based there were a menace to trade. The doge already possessed the titles of Duke of Dalmatia and Duke of Istria. Later mainland possessions, which extended across Lake Garda as f

In [393]:
baseline_entities_venice = BASELINE.extract_terms(TEXT_VENICE)

rfed_entities_venice = RFED.extract_terms(TEXT_VENICE)


User provided device_type of 'cuda', but CUDA is not available. Disabling



In [394]:
baseline_coverage = character_coverage(TEXT_VENICE, baseline_entities_venice)
test_coverage = character_coverage(TEXT_VENICE, rfed_entities_venice)
print(f'baseline coverage{baseline_coverage:7.2f}\ntest coverage {test_coverage:10.2f}')

baseline coverage   0.29
test coverage       0.18


In [395]:
baseline_diff_venice, rfed_diff_venice = diff_annotations(baseline_entities_venice, rfed_entities_venice)

In [396]:
for x in baseline_diff_venice[:3]:
    print(x)

(78, 84, 'empire', 'Empire')
(150, 158, 'addition', 'Addition')
(340, 344, 'head', 'Head')


In [397]:
for x in rfed_diff_venice[:3]:
    print(x)

(125, 144, 'repubblica marinara', 'Maritime republics')
(274, 280, 'Ragusa', 'Ragusa, Sicily')
(352, 360, 'Adriatic', 'Adriatic Sea')


The baseline's Recall is 11% higher than the ReFinED pipeline (61% higher in relative terms), which is substantial. However, out of the

In [398]:
len(baseline_diff_venice)

43

total entities, 29 were found to be correct, and the remaining 14 were false positives. These could be further categorized into two main error types:
1. Verbs
2. Generic nouns

This results in a Precision of 67%, which stands in contrast with ReFinED's Precision, which was determined to be a substantial 91% after manual review.

In [407]:
venice_baseline_recall = 0.29
venice_test_recall = 0.18

venice_baseline_precision = 0.67
venice_test_precision = 0.91

Considering these estimates Precision and Recall, their respective F-$\beta$ scores with $\beta$ = `0.5, 1.0, 2.0` are:

**F-$\beta$ score with $\beta$** = _0.5_

In [404]:
f = 0.5
baseline_f = f_score(venice_baseline_precision, venice_baseline_recall, f=f)
test_f = f_score(venice_test_precision, venice_test_recall, f=f)
print(f'baseline F_beta = {baseline_f:6.2f}\ntest F_beta = {test_f:10.2f}')

baseline F_beta =   0.53
test F_beta =       0.50


**F-$\beta$ score with $\beta$** = _1.0_

In [405]:
f = 1.0
baseline_f = f_score(venice_baseline_precision, venice_baseline_recall, f=f)
test_f = f_score(venice_test_precision, venice_test_recall, f=f)
print(f'baseline F_beta = {baseline_f:6.2f}\ntest F_beta = {test_f:10.2f}')

baseline F_beta =   0.40
test F_beta =       0.30


**F-$\beta$ score with $\beta$** = _2.0_

In [406]:
f = 2.0
baseline_f = f_score(venice_baseline_precision, venice_baseline_recall, f=f)
test_f = f_score(venice_test_precision, venice_test_recall, f=f)
print(f'baseline F_beta = {baseline_f:6.2f}\ntest F_beta = {test_f:10.2f}')

baseline F_beta =   0.33
test F_beta =       0.21


### Error handling

If the two error categories detected during the error analysis were handled, the baseline system may decrease the number of false positives. Each category comprised 7 cases. If both categories are resolved, the error rate would lowered by twice as much.

Out of the two error categories, **verbs** might be easier to handle using part-of-speech tagging and restricting Wikipedia look-up to noun phrases.

The second error type, **generic nouns**, could be addressed either by extending the list of stopwords manually (easy, but it may require maintenance), implementing word sense disambiguation, or using some type of context-based filtering. Generally speaking, single terms above a certain frequency should either be avoided entirely, or should undergo further validation before being extracted.

More specifically, any single words that, after being looked up on Wikipedia, result in a list with multiple candidate senses whose Wikipedia articles all have a similar title after removing any parentheticals (likely after triggering a `DisambiguationError`), should be assigned the sense with the highest semantic similarity to the term's current context of occurrence. The semantic similarity measurement could be done between the context, as the target, and any of the following sources of information for the candidate (non-exhaustive list):
- the term's Wikipedia summary
- each of the term's Wikipedia categories
- entities from pages linked to the term's
- the term's full Wikipedia page
- and so on.

If both **verb** and **generic noun** errors were successfully handled and all false positives were removed, the top maximum performance expected for the baseline setting would increase by 14% up to 67% for $\beta = 0.5$ (26% relative improvement) and by 5% up to 45% for $\beta = 1.0$ (12.5% relative improvement):

In [413]:
venice_baseline_precision_revised = 1.0

for f in [0.5, 1.0, 2.0]:
    baseline_f = f_score(venice_baseline_precision_revised, venice_baseline_recall, f=f)
    test_f = f_score(venice_test_precision, venice_test_recall, f=f)
    print(f'baseline F_beta = {baseline_f:6.2f}\ntest F_beta = {test_f:10.2f}\n')

baseline F_beta =   0.67
test F_beta =       0.50

baseline F_beta =   0.45
test F_beta =       0.30

baseline F_beta =   0.34
test F_beta =       0.21



## AI/NLP titles texts

In [414]:
print(TEXT__PAPER_TITLES)

GRAMMAR: Grounded and Modular Evaluation of Domain-Specific Retrieval-Augmented Language Models. Rumour Evaluation with Very Large Language Models. A Legal Framework for Natural Language Processing Model Training in Portugal. Investigating Automatic Scoring and Feedback using Large Language Models. When Quantization Affects Confidence of Large Language Models. Better & Faster Large Language Models via Multi-token Prediction. Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models. Computational Job Market Analysis with Natural Language Processing. A Survey of Generative Search and Recommendation in the Era of Large Language Models. Octopus v4: Graph of language models. Utilizing Large Language Models to Identify Reddit Users Considering Vaping Cessation for Digital Interventions. BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers. ChatGPT Is Here to Help, Not to Replace Anybody -- An Evaluation of Students' Opinions On Integrating ChatGPT In

In [415]:
baseline_entities_paper = BASELINE.extract_terms(TEXT__PAPER_TITLES)

rfed_entities_paper = RFED.extract_terms(TEXT__PAPER_TITLES)

In [417]:
baseline_coverage = character_coverage(TEXT__PAPER_TITLES, baseline_entities_paper)
test_coverage = character_coverage(TEXT__PAPER_TITLES, rfed_entities_paper)
print(f'baseline coverage{baseline_coverage:7.2f}\ntest coverage {test_coverage:10.2f}')

baseline coverage   0.45
test coverage       0.05
