# Named Entity Recognition and Disambiguation

**Named-entity recognition (NER**) is a subtask of information extraction that seeks to locate and classify named entities mentioned in text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

**Named-entity disambiguation (NED)**, also known as named-entity linking, is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text.

## Named entity recognition

![](images/ner.png)

Ok, but how does it work?

### Rule-based approaches

1) E.g., regular expression to extract:

- telephone numbers
- e-mails
- dates
- prices
- locations (e.g., word + “river” indicates a river)

2) Gazetteers: list of proper names of people, locations, organisations, etc.

3) Context patterns, such as:
- [Person] earns [Money]
- [PERSON] joined [ORGANIZATION]
- [PERSON] flies to [LOCATION]


### Feature engineering-based Machine Learning

First, you need to have (a lot of) textual data annotated with the NER tags you are interested in discovering. Then you train an algorithm that, provided with the word + additional information regarding it, for instance:
- if it starts with a capital letter
- if it's at the beginning of a sentence
- if it is made by numbers and letters, etc

### Deep Learning

Instead of providing the algorithm additional information that you think is relevant for the task, you provide a vector representation of each unit under study (for instance each token) and the algorithm will learn how to perform the task. We'll cover this in more details on the last day.


In [13]:
import spacy

#spacy.cli.download("en_core_web_sm")

# Load the large English model
nlp = spacy.load("en_core_web_sm")

def ner_text(text:str)->str:
    assert type(text) is str
    processed_text = nlp(text)
    ner = [(token.text,token.label_) for token in processed_text.ents]
    return ner


In [14]:
import pandas as pd

sample_blbooks_df = pd.read_pickle('data/bl_books_sample.pickle')
clean_blbooks_df = sample_blbooks_df[sample_blbooks_df['mean_wc_ocr']>0.8]
blbooks_content = clean_blbooks_df['text'].to_list()
print (len(blbooks_content))

1397


In [18]:
for content in blbooks_content[:2]:
    # Find named entities
    ners = ner_text(content)
    print (content)
    print (ners)
    print (" ")

HISTORY OF ENGLAND. 8 chap, of Vezelay on the borders of Burgundy :h Philip v^^and Richard, on their arrival there, found their 1190. combined army amount to 100,000 men ;' a mighty setbJttne. force) animated with glory and religion, conducted by two warlike monarch?, provided with every thing which their several dominions could supply, and not to be overcome but by their own mis- conduct, or by the unsurmountable obstacles of nature. THE French prince and the English here re °XZ^e iterated their promises of cordial friendship, pledged their faith not to invade each other's dominions during the crusade, mutually exchanged the oaths of all their barons and prelates to the same effect, and subjected themselves to the penalty of interdicts and excommunications, if they should ever violate this public and solemn engagement. They then separated ; Philip took the road to Genoa, Richard that to Marseilles, with a view of meeting their fleets, which were severally appointed to rendezvous 14th 

✏️ **Exercise:**

Write a programme that given a string of text it finds the named entities (like `ner_text`) but returns the original text with the identified entities replaced by (string, ner_tag). For instance:

input: "New York is a big city"

output: "(New York, GPE) is a big city"

## Entity Linking



In [9]:
import spacy,spacyfishing

text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."

nlp_model_en = spacy.load("en_core_web_sm")

nlp_model_en.add_pipe("entityfishing")

doc_en = nlp_model_en(text_en)

for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)


In [11]:
for content in blbooks_content[:10]:
    doc_en = nlp_model_en(content)
    for ent in doc_en.ents:
            print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

('8', 'CARDINAL', 'Q5349142', 'https://www.wikidata.org/wiki/Q5349142', 0.0826)
('Burgundy', 'PERSON', 'Q1173', 'https://www.wikidata.org/wiki/Q1173', 0.1795)
('Philip v^^and', 'ORG', None, None, None)
('1190', 'DATE', 'Q19736', 'https://www.wikidata.org/wiki/Q19736', 0.0889)
('100,000', 'CARDINAL', 'Q720751', 'https://www.wikidata.org/wiki/Q720751', 0.1341)
('setbJttne', 'ORG', None, None, None)
('two', 'CARDINAL', 'Q200', 'https://www.wikidata.org/wiki/Q200', 0.0826)
('French', 'NORP', 'Q142', 'https://www.wikidata.org/wiki/Q142', 0.1047)
('English', 'LANGUAGE', 'Q1860', 'https://www.wikidata.org/wiki/Q1860', 0.1033)
('Philip', 'PERSON', 'Q43675', 'https://www.wikidata.org/wiki/Q43675', 0.0826)
('Genoa', 'GPE', 'Q1449', 'https://www.wikidata.org/wiki/Q1449', 0.2786)
('Richard', 'PERSON', None, None, None)
('Marseilles', 'PERSON', 'Q23482', 'https://www.wikidata.org/wiki/Q23482', 0.3798)
('14th', 'ORDINAL', 'Q188116', 'https://www.wikidata.org/wiki/Q188116', 0.0826)
('tnese harbours',