<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/3_NER_EL_PoS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced NLP, phrases and semantics
In this lecture, we focus on Named Entity Recognition, Entity Linking, and Part of Speech Tagging.


In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import requests
import json
import spacy
from spacy import displacy

## Part of Speech (PoS) Tagging


Let's now use spaCy to see the PoS tagging process, and how we can find dependencies between words in a sentence programmatically.

In [None]:
en_model = spacy.load("en_core_web_sm")
sentence = "He was being opposed by her without any reason.\
	    A plan is being prepared by charles for next project"
processed_sent = en_model(sentence)
for token in processed_sent:
    print(f'{token.text:{10}} {token.tag_:>{10}}\t{spacy.explain(token.tag_):<{50}} 			{token.pos_:>{5}}')

He                PRP	pronoun, personal                                  			 PRON
was               VBD	verb, past tense                                   			  AUX
being             VBG	verb, gerund or present participle                 			  AUX
opposed           VBN	verb, past participle                              			 VERB
by                 IN	conjunction, subordinating or preposition          			  ADP
her               PRP	pronoun, personal                                  			 PRON
without            IN	conjunction, subordinating or preposition          			  ADP
any                DT	determiner                                         			  DET
reason             NN	noun, singular or mass                             			 NOUN
.                   .	punctuation mark, sentence closer                  			PUNCT
	                 _SP	whitespace                                         			SPACE
A                  DT	determiner                                         			  DET
plan            

In [None]:
displacy.render(processed_sent, style="dep", jupyter=True)

## Named Entity Recognition

Now we will load the spacy english model, and test its default NER pipeline on a sample text

In [None]:
en_model = spacy.load("en_core_web_sm") # Loading the English model

text = "Apple is looking at buying U.K. startup for $1 billion"
doc = en_model(text)
displacy.render(doc, style="ent", jupyter=True)
#for entity in doc.ents:
#    print(entity.text, entity.label_)

In [None]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


As we're able to see, verbs and less stopwords are ignored, while named entities are properly categorized

Let's test it on another, longer, phrase

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/main/data/Leonardo.txt

--2025-01-28 10:03:07--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/main/data/Leonardo.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 604 [text/plain]
Saving to: ‘Leonardo.txt.1’


2025-01-28 10:03:07 (26.5 MB/s) - ‘Leonardo.txt.1’ saved [604/604]



In [None]:
with open("Leonardo.txt", 'r') as txt_file:
  leonardo = txt_file.read()
doc = en_model(leonardo)
displacy.render(doc, style="ent", jupyter=True)

Something goes wrong here, why is Renaissance labeled as ORG? Let's see the meaning of the various categories

In [None]:
for entity in doc.ents:
  print(entity.text, entity.label_, " - ", spacy.explain(entity.label_))

Leonardo da Vinci PERSON  -  People, including fictional
Italian NORP  -  Nationalities or religious or political groups
Renaissance ORG  -  Companies, agencies, institutions, etc.
Leonardo PERSON  -  People, including fictional


This shows how crucial it is, in some situations, to train your application to your specific need.

### Finetuning NER

Now we download the Anatomical Entity Mention (AnEM) corpus from Github. It is a NER dataset on the biomedical field, stored using the CONLL format.  

In [None]:
!wget https://raw.githubusercontent.com/juand-r/entity-recognition-datasets/master/data/AnEM/CONLL-format/data/AnEM.train
!wget https://raw.githubusercontent.com/juand-r/entity-recognition-datasets/master/data/AnEM/CONLL-format/data/AnEM.test

--2025-01-28 09:51:59--  https://raw.githubusercontent.com/juand-r/entity-recognition-datasets/master/data/AnEM/CONLL-format/data/AnEM.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1220623 (1.2M) [text/plain]
Saving to: ‘AnEM.train’


2025-01-28 09:52:00 (18.5 MB/s) - ‘AnEM.train’ saved [1220623/1220623]

--2025-01-28 09:52:00--  https://raw.githubusercontent.com/juand-r/entity-recognition-datasets/master/data/AnEM/CONLL-format/data/AnEM.test
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 772970 (755K) [text/plain]
Saving

spaCy does not accept files in the CONLL format, so we use its convert function. We need to rename files to avoid name collisions

In [None]:
!mv AnEM.train AnEMTrain
!mv AnEM.test AnEMTest
!spacy convert AnEMTrain -c conll .
!spacy convert AnEMTest -c conll .

[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (2815 documents): AnEMTrain.spacy[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (1882 documents): AnEMTest.spacy[0m


**NOW IT IS TIME TO GENERATE THE base_config.cfg file**
INSTRUCTIONS ARE HERE: https://spacy.io/usage/training#quickstart

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/main/base_config.cfg
!python -m spacy init fill-config base_config.cfg config.cfg

--2025-01-28 10:03:39--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/main/base_config.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1713 (1.7K) [text/plain]
Saving to: ‘base_config.cfg’


2025-01-28 10:03:39 (22.4 MB/s) - ‘base_config.cfg’ saved [1713/1713]

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


For reasons of lecture pacing, I limit the epochs to 6.

In [None]:
!mkdir output
#!python -m spacy train config.cfg --output ./output --paths.train ./AnEMTrain.spacy --paths.dev ./AnEMTest.spacy
!python -m spacy train config.cfg --output ./output --paths.train ./AnEMTrain.spacy --paths.dev ./AnEMTest.spacy --training.max_epochs 2

mkdir: cannot create directory ‘output’: File exists
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     27.87    0.00    0.00    0.00    0.00
  0     200        308.78   1644.50    8.64   13.06    6.45    0.09
  0     400         85.48    866.24   15.94   33.76   10.43    0.16
  1     600        139.98   1002.82   25.36   33.55   20.38    0.25
  1     800        236.84   1008.65   37.06   50.41   29.30    0.37
[38;5;2m✔ Saved pipeline to output directory[0m
output/model-last


In [None]:
trained_model = spacy.load("output/model-best")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = trained_model(text)
displacy.render(doc, style="ent", jupyter=True)



As we expect, a general purpose phrase yields no Named Entities

In [None]:
text = "In the posterior (dorsal) cavity, the cranial cavity houses the brain, and the spinal cavity (or vertebral cavity) encloses the spinal cord."
doc = trained_model(text)
displacy.render(doc, style="ent", jupyter=True)

And as expected, the new biological categories appear!

## Entity Linking

In [None]:
KEY = "6ab7daea-c174-4254-a9d2-d85f6117bf20-843339462" # this is the key we will be using for REST calls

In [None]:
TAGME_ENDPOINT = "https://tagme.d4science.org/tagme/tag"
LANG = "en" # Also works in italian and german

Now create the function that will "wrap" the REST call. It needs a textual input

In [None]:
def query_tagme(text):
    payload = {"text": text, "gcube-token": KEY, "lang": LANG}
    # Now we issue a post HTTP request
    r = requests.post(TAGME_ENDPOINT, payload)
    if r.status_code != 200:
        # this means something went wrong with the query
        raise Exception("Error on text: {}\n{}".format(text, r.text))
    return r.json()

And now we display the result for a simple textual query. The interesting part, for us, is under the key _annotations_.
This will be a list of annotations containing the following fields:
- **spot (string)**: how the anchor appears in the text.
- **start (int)**: the index of the starting character of the anchor.
- **end (int)**: the index of the ending character of the anchor.
- **link_probability (float ∈[𝟎,𝟏])**: number of times that the spot is an anchor in Wikipedia / number of occurrences of the spot in Wikipedia.
- **rho (float ∈[𝟎,𝟏])**: semantic coherency of the entity with respect to the query.
- **id (int)**: the Wikipedia identifier of the page _(https://en.wikipedia.org/?curid=<>)_.
- **title (string)**: title of the Wikipedia page.

In [None]:
short_text = "Italy will not be competing in the 2022 world cup"
resp = query_tagme(short_text)
resp

{'test': '5',
 'annotations': [{'spot': 'Italy',
   'start': 0,
   'link_probability': 0.4437723457813263,
   'rho': 0.4525856375694275,
   'end': 5,
   'id': 362466,
   'title': 'Italy national football team'},
  {'spot': 'will',
   'start': 6,
   'link_probability': 0.0036389119923114777,
   'rho': 0.06729841977357864,
   'end': 10,
   'id': 32828260,
   'title': 'Will (2011 film)'},
  {'spot': '2022 world cup',
   'start': 35,
   'link_probability': 0.3492063581943512,
   'rho': 0.3398236632347107,
   'end': 49,
   'id': 17742072,
   'title': '2022 FIFA World Cup'}],
 'time': 38,
 'api': 'tag',
 'lang': 'en',
 'timestamp': '2025-01-28T10:08:47'}

### Handle longer texts / filtering noisy annotations

TagME has been designed for handling short texts, but we also have a way to obtain competitive results on longer ones.
This requires modifying the window of spots that are checked by TagME when doing disambiguation.

Now open a new text file with a slightly longer text and annotate it with TagME

In [None]:
with open("Leonardo.txt", 'r') as long_file:
    # the text is not a json object, it is just a plaintext, so just read it regularly with read()
    text = long_file.read()
text

'Leonardo da Vinci was an Italian Renaissance polymath whose areas of interest included invention, painting, sculpting, architecture, science, music, mathematics, engineering, literature, anatomy, geology, astronomy, botany, writing, history, and cartography. \nHe has been variously called the father of palaeontology, ichnology, and architecture, and is widely considered one of the greatest painters of all time. Leonardo is revered for his technological ingenuity. He conceptualised flying machines, a type of armoured fighting vehicle, concentrated solar power, an adding machine, and the double hull.'

Now we will change the tagging function we made before, by adding an optional boolean parameter. If true, this means that the text is long, otherwise it is short and we can avoid changing the window.

In [None]:
def query_tagme(text, long_text=False):
    payload = {"text": text, "gcube-token": KEY, "lang": LANG}
    if long_text:
        # long_text is by defaul false, but if specified by the user, we set the window size at 5
        payload["long_text"] = 5
    r = requests.post(TAGME_ENDPOINT, payload)
    if r.status_code != 200:
        raise Exception("Error on text: {}\n{}".format(text, r.text))
    return r.json()

But how do we deal with noisy annotations? TagME gives us a "content relevance" score in the form of the **Rho-score**.
We can filter the lowest ranked annotations on relevancy to remove noise. A common threshold for this task is 0.3.

In [None]:
# Try changing the min_rho parameter and see how it impacts the returned entities
def get_tagme_entities(tagme_response, min_rho=0.3):
    ann = tagme_response["annotations"]
    ann = [a for a in ann if a["rho"] > min_rho] # filter all the annotations with a rho score lower than the threshold
    return [a["title"] for a in ann if "title" in a] # return just the page titles

Now see which entities _disappear_ when filtering

In [None]:
print("BEFORE FILTERING")
resp = query_tagme(text, long_text=True)
before_filtering = [a["title"] for a in resp['annotations'] if "title" in a]
before_filtering

BEFORE FILTERING


['Leonardo da Vinci',
 'Leonardo da Vinci',
 'Italian Renaissance',
 'Polymath',
 'Attention',
 'Invention',
 'Painting',
 'Sculpture',
 'Architecture',
 'Science',
 'Music and mathematics',
 'Engineering',
 'Literature',
 'Anatomy',
 'Geology',
 'Astronomy',
 'Botany',
 'Writing',
 'History',
 'Cartography',
 'Clergy',
 'Paleontology',
 'Ichnology',
 'Architecture',
 'Neoplatonism',
 'Greatest!',
 'Painting',
 'Time (magazine)',
 'Leonardo da Vinci',
 'Canonization',
 'Technology',
 'Ingenuity',
 'Concept',
 'Flying Machines s.r.o.',
 'Granite',
 'Stellar classification',
 'Armoured fighting vehicle',
 'Concentrated solar power',
 'Adding machine',
 'Double hull']

In [None]:
print("AFTER FILTERING")
after_filtering = get_tagme_entities(resp)
after_filtering

AFTER FILTERING


['Leonardo da Vinci',
 'Leonardo da Vinci',
 'Italian Renaissance',
 'Polymath',
 'Music and mathematics',
 'Geology',
 'Astronomy',
 'Botany',
 'Cartography',
 'Paleontology',
 'Ichnology',
 'Armoured fighting vehicle',
 'Concentrated solar power',
 'Adding machine']

In [None]:
print("The annotations that were filtered out are:")
[a for a in before_filtering if a not in after_filtering]

The annotations that were filtered out are:


['Attention',
 'Invention',
 'Painting',
 'Sculpture',
 'Architecture',
 'Science',
 'Engineering',
 'Literature',
 'Anatomy',
 'Writing',
 'History',
 'Clergy',
 'Architecture',
 'Neoplatonism',
 'Greatest!',
 'Painting',
 'Time (magazine)',
 'Canonization',
 'Technology',
 'Ingenuity',
 'Concept',
 'Flying Machines s.r.o.',
 'Granite',
 'Stellar classification',
 'Double hull']

### TRY OTHER ANNOTATORS: SWAT

TagME is not the only available annotator. There are several more, each one with its own strenghts and weaknesses.
Most of the available annotators are available at [this](https://sobigdata.d4science.org/web/tagme/service-overview) page on the SoBigData Infrastructure

**SWAT** is specifically a salient entity linker, which works best on long, well-constructed texts.
The fields returned are:
- **salience_class (int)**: 1 if the entity is deemed salient, 0 otherwise
- **salience_score (float ∈[𝟎,𝟏])**: the saliency of the enitity in the text (similar to the rho-score in tagme)
- **spans (list)**: list of times where this entity appears, they are described as:
    - *start (int)*: the index of the starting character of the anchor
    - *end (int)*: the index of the ending character of the anchor
- **wiki_id (int)**: the Wikipedia identifier of the page
- **wiki_title (string)**: title of the Wikipedia page

In [None]:
# this is the new URL of the annotator on the SoBigData Infrastructure
SWAT_ENDPOINT = "https://swat.d4science.org/salience"

# SWAT also requires a title of the content
def query_swat(title, content):
    document = json.dumps({"title": title, "content": content})
    r = requests.post(SWAT_ENDPOINT, data = document, params={'gcube-token': KEY})
    if r.status_code != 200:
        raise Exception("Error on article titled: {}\n{}".format(title, r.text))
    return r.json()["annotations"]

query_swat("Leonardo da Vinci", text)[:7]

[{'salience_class': 1.0,
  'salience_score': 0.9471508264541626,
  'spans': [{'end': 17, 'start': 0}, {'end': 422, 'start': 414}],
  'wiki_id': 18079,
  'wiki_title': 'Leonardo_da_Vinci'},
 {'salience_class': 1.0,
  'salience_score': 0.5190669894218445,
  'spans': [{'end': 32, 'start': 25}],
  'wiki_id': 14532,
  'wiki_title': 'Italy'},
 {'salience_class': 1.0,
  'salience_score': 0.5682003498077393,
  'spans': [{'end': 44, 'start': 33}],
  'wiki_id': 25532,
  'wiki_title': 'Renaissance'},
 {'salience_class': 0.0,
  'salience_score': 0.4803982079029083,
  'spans': [{'end': 65, 'start': 60}],
  'wiki_id': 9630,
  'wiki_title': 'Ecology'},
 {'salience_class': 0.0,
  'salience_score': 0.35197311639785767,
  'spans': [{'end': 77, 'start': 69}],
  'wiki_id': 146738,
  'wiki_title': 'Interest'},
 {'salience_class': 0.0,
  'salience_score': 0.42167073488235474,
  'spans': [{'end': 96, 'start': 87}],
  'wiki_id': 44312,
  'wiki_title': 'Invention'},
 {'salience_class': 1.0,
  'salience_score':

### RELATEDNESS
Ok but now that I have entities, how do I deal with them? I need to know which are similar and which are not.
If we don't see any way of "dealing with the entities", how do we unlock its full potential? How is this method more powerful than dealing with generic words as tokens?

There are several ways in which we can obtain the relatedness of couples of entities.
The main one that is shown in this notebook is by querying TagME itself. TagME has an internal relatedness computation framework, so I can ask TagME itself how close two entities are to one another. This metric is computed directly on the Wikipedia Knowledge Graph.

In [None]:
# The URL where the relatedness is given
ENDPOINT_RELATEDNESS = "https://tagme.d4science.org/tagme/rel"

# In case I need efficiency I can do batch queries of 100 couples per HTTP call
def query_relatedness(e1, e2):
    # Entities require underscores in-place of the spaces. The space is between entity one and entity two
    tt = e1.replace(" ", "_") + " " + e2.replace(" ", "_")
    payload = {"tt": tt, "gcube-token": KEY, "lang": LANG}
    r = requests.post(ENDPOINT_RELATEDNESS, payload)
    if r.status_code != 200:
        raise Exception("Error on relatedness computation: {}\n{}".format(tt, r.text))
    return r.json()

Now let's test the relatedness of three entities.
Two are closely related to one-another (biology and biotechnology).
The last one is completely out of context.

In [None]:
first = query_relatedness("Biology", "Biotechnology")
second = query_relatedness("Barack Obama", "Biotechnology")
thirds = query_relatedness("Barack Obama", "Biology")
print(first['result'])
print(second['result'])
print(thirds['result'])

[{'couple': 'Biology Biotechnology', 'rel': 0.6070536971092224}]
[{'couple': 'Barack_Obama Biotechnology', 'rel': 0.23863035440444946}]
[{'couple': 'Barack_Obama Biology', 'rel': 0.16491788625717163}]


Let us now go back to that sentence we used for the PoS tagging, and see what TagME would find

In [None]:
sentence

'He was being opposed by her without any reason.\t    A plan is being prepared by charles for next project'

In [None]:
resp = query_tagme(sentence)
[(a["spot"], a["rho"]) for a in resp["annotations"]]

[('reason', 0.09262093156576157),
 ('plan', 0.07667145878076553),
 ('charles', 0.0429387167096138),
 ('for next', 0.07545013725757599),
 ('next', 0.001524612889625132),
 ('project', 0.04952556639909744)]

As expected, the results are not exceptional, this is because the phrase has no entities. Let's see a visual representation of the PoS tagging, which works much better in this context.