# Named Entity Recognition and Disambiguation

**Named-entity recognition (NER**) is a subtask of information extraction that seeks to locate and classify named entities mentioned in text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

**Named-entity disambiguation (NED)**, also known as named-entity linking, is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text.

## Named Entity Recognition

![](images/ner_pipeline.png)

Ok, but how does it work?

### Rule-Based Approaches

1) E.g., regular expression to extract:

- telephone numbers
- e-mails
- dates
- prices
- locations (e.g., word + “river” indicates a river)

2) Gazetteers: list of proper names of people, locations, organisations, etc.

3) Context patterns, such as:
- [Person] earns [Money]
- [PERSON] joined [ORGANIZATION]
- [PERSON] flies to [LOCATION]


### Feature Engineering-Based Machine Learning

First, you need to have (a lot of) textual data annotated with the NER tags you are interested in discovering. Then you train an algorithm that, provided with the word + additional information regarding it, for instance:
- if it starts with a capital letter
- if it's at the beginning of a sentence
- if it is made by numbers and letters, etc

### Neural Network Approaches

Instead of providing the algorithm additional information that you think is relevant for the task, you provide a vector representation of each unit under study (for instance each token) and the algorithm will learn how to perform the task. We'll cover this in more details on the last day.


In [None]:
import spacy

#spacy.cli.download("en_core_web_sm")

# Load the large English model
nlp = spacy.load("en_core_web_sm")

def ner_text(text:str)->list:
    """
    Return the named entities identified in a string

    Args:
        text (str): a string

    Returns:
        list: A list containing a series of NER tuples (mention, label)
    """
    assert type(text) is str
    processed_text = nlp(text)
    ners = [(token.text,token.label_) for token in processed_text.ents]
    return ners


In [None]:
import pandas as pd

# We read a small sample of 19th Century British Library Books
sample_blbooks_df = pd.read_csv('data/bl_books_sample.csv')
sample_blbooks_df.head()

In [None]:

# We keep only books with a high OCR quality
clean_blbooks_df = sample_blbooks_df[sample_blbooks_df['mean_wc_ocr']>0.8]
# We convert the column text to a list
blbooks_content = clean_blbooks_df['text'].to_list()
print (len(blbooks_content))

In [None]:
first_content = blbooks_content[0]
print (first_content)
ners = ner_text(first_content)
print (ners)

✏️ **Exercise:**

Write a function that given a string of text it finds the named entities (like `ner_text`) but returns the original text with the identified entities replaced by the ner_tag. For instance:

input: "London is a big city"

output: "GPE is a big city"

## Entity Linking



In [None]:
import spacy,spacyfishing

text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."

nlp_model_en = spacy.load("en_core_web_sm")

nlp_model_en.add_pipe("entityfishing")

doc_en = nlp_model_en(text_en)

for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

In [None]:
doc_en = nlp_model_en(first_content)
for ent in doc_en.ents:
    print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))

✏️ **Exercise:**

Write a function that given a string of text disambiguates the named entities but returns the original text with the identified entities replaced by their wikidata id. For instance:

input: "London is a big city"

output: "Q84 is a big city"

✏️ **Exercise:**

Add a functionality to your function above. A threshold to filter entities with a very low score. This can be hardcoded (for instance over 0.2) or a parameter of your function.