# Normalization

**Previously**:

* Introduction to Information Extraction
* Named Entity Recognition

Named Entity Recognition (NER) methods recognize mentions of target entities in text and (typically) assign each a type (e.g. `PERSON`, `LOCATION`)

---

<img width="90%" src="https://raw.githubusercontent.com/TurkuNLP/turku-ner-corpus/master/docs/example.png">

---

While NER is important as a starting point for structured information extraction, it doesn't identify the real-word entities referred to in text, i.e. make connections such as these:

---

<img src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/entity-linking-marin.png">

---

Issues associating mentions in text to the things they refer to include:

* **Ambiguity** of names: e.g. _George Bush_ can refer either to the 41st or 43rd US president (among others)
    * common names like _Emma Korhonen_ have dozens of potential referents even in comparatively small Finland
* **Variability** of mentions: e.g. _George Bush_, _George Walker Bush_, _Bush Jr._, _GWB_ and _Dubya_ referring to the same person
    * Morphological variability: _Turku_, _Turun_, _Turkua_, _Turkuun_, _Turkuhan_, _Turkukaan_, _Turkukin_, _Turkuunkin_, _Turkuunkaan_, _Turkuakin_, ...

---

<a href="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/george-bush.png"><img src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/george-bush.png"></a>

---


Tasks related these challenges are variously termed _(named entity) normalization_, _grounding_, _entity linking_, and [_wikification_](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.cikm07.pdf). We can separate two closely related tasks, here termed for clarity

* **Mention normalization**: restoring strings in text to a standardized surface form (e.g. _Turkuunkaan_ → _Turku_)
* **Entity linking**: associating mentions in text with entries representing them in a knowledge base (e.g. Wikipedia/Wikidata)

We'll introduce these in detail in the following.

Overall, we'll be sketching a normalization approach applying the following steps for each entity mention in text:

1. Normalize mention to a standardized ("dictionary") form
2. Find candidate entities in a knowledge base whose names or aliases match that form
3. Disambiguate between candidate entities based on e.g. mention context

Note that in an NLP pipeline, these steps would follow NER, and we assume NE tagging as a starting point.

---

# Mention normalization

Restoring text strings to standardized surface forms

Challenges include

* Lemmatization: _Turkuunkaan_ → _Turku_
* Truecasing: _TURKU_, _turku_ → _Turku_
* Multi-word names: _Turun yliopistonkin_, _Turunkin yliopisto_ → _Turun yliopisto_ (not _Turku yliopisto_!) 

## Lemmatization

Mapping inflected forms of words to their dictionary forms:

* _voi olla niin_ → _voida olla niin_
* _voi on pilalla_ → _voi olla pilalla_

Lemmatization as a task has been covered previously; here we'll simply use the [Turku neural parser](http://turkunlp.org/Turku-neural-parser-pipeline/) for lemmatization as a remotely set up service. You can easily set up similar services by following the [installation instructions](http://turkunlp.org/Turku-neural-parser-pipeline/docker.html).

If you're interested in the technical aspects of the lemmatization method, these are presented in detail by [Kanerva et al. 2020](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/9341ECA9B562DAF55E2F3F966554A667/S1351324920000224a.pdf/div-class-title-universal-lemmatizer-a-sequence-to-sequence-model-for-lemmatizing-universal-dependencies-treebanks-div.pdf). [PDF]

In [1]:
!pip install --quiet requests conllu

import requests
import conllu


def parse_sentence(sentence):
    SERVER_URL = 'http://86.50.253.19:8002/parser/parse'
    response = requests.post(SERVER_URL, data={ 'text': sentence })
    return conllu.parse(response.text)[0]

(**NOTE**: if you're running this after the 2021 spring course, this service is likely no longer available at the above URL.)

In [2]:
for sentence in ('voi olla niin', 'voi on pilalla'):
    tokens = parse_sentence(sentence)
    print(sentence.split(), '→', [t['lemma'] for t in tokens])

['voi', 'olla', 'niin'] → ['voida', 'olla', 'niin']
['voi', 'on', 'pilalla'] → ['voi', 'olla', 'pilalla']


We can use a lemmatizer to identify the dictionary forms of simple names:

In [3]:
for form in ('Turun', 'Turkua', 'Turkuun', 'Turusta', 'TURKUUNKOHAN'):
    print(form, '→', parse_sentence(form)[0]['lemma'])

Turun → Turku
Turkua → Turku
Turkuun → Turku
Turusta → Turku
TURKUUNKOHAN → turkuunko


Note that statistical and machine learning-based lemmatizers may require context to work well, and may fail in particular for rare and nonstandard forms.

## Truecasing

Restoring "normal" case to text with non-standard case (e.g. _ALL UPPERCASE_)

* _GEORGE BUSH WENT TO WASHINGTON_ → _George Bush went to Washington_
* _the bush that george planted_ → _The bush that George planted_

Can be part of lemmatization or considered as a separate task: lemmatization methods do not necessarily perform well on input text with non-standard case.

In [4]:
sentence = 'TÄSSÄ ON PELKÄSTÄÄN ISOJA KIRJAIMIA' 
tokens = parse_sentence(sentence)
print(sentence.split(), '→', [t['lemma'] for t in tokens])

['TÄSSÄ', 'ON', 'PELKÄSTÄÄN', 'ISOJA', 'KIRJAIMIA'] → ['tämä', 'olla', 'pelkästään', 'iso', 'kirjain']


Given a sufficient amount of correctly cased text, can be performed highly reliably using a language modeling approach (See [Lita et al. 2018](https://www.aclweb.org/anthology/P03-1020.pdf) [PDF]).

Truecasing is rarely considered as a separate step in recent lemmatization methods, but may be of value in particular when working with irregularly cased documents.

## Multi-word names

Multi-word names can inflect in unpredictable ways, and lemmatizing words separately is not always correct:

* _Turun Energian_ → _Turku Energia_ ✅ 
* _Turun yliopiston_ → _Turku yliopisto_ ❌ 

In [5]:
for sentence in ('Turun Energian', 'Turun yliopiston'):
    tokens = parse_sentence(sentence)
    print(sentence.split(), '→', [t['lemma'] for t in tokens])

['Turun', 'Energian'] → ['Turku', 'energia']
['Turun', 'yliopiston'] → ['Turku', 'yli#opisto']


Note that lemmatizing _Turun_ → _Turku_ in _Turun yliopiston_ is not a mistake by the lemmatizer: this is the correct dictionary form. The issue is rather that the multi-word name does not use the dictionary form.

Only lemmatizing the head word will likewise fail in some cases:

* _Turun yliopiston_ → _Turun yliopisto_ ✅
* _Turun yliopistonkin_ → _Turun yliopisto_ ✅
* _Turunkin yliopiston_ → _Turunkin yliopisto_ ❌

There is no "standard" NLP task setting for this particular challenge (and some of the issues noted here are somewhat specific to Finnish). However, we can consider some options:

* Knowledge-based approach: gather standard forms of names from resources such as Wikipedia or Wikidata (see below)
* Statistical approach: identify most common forms of names in large corpora of automatically tagged text

We'll below briefly sketch the latter approach using summary data from the [Finnish internet parsebank](https://turkunlp.org/finnish_nlp.html#finnish-internet-parsebank-) tagged using the [Turku NER tagger](https://turkunlp.org/fin-ner.html).

In [6]:
!wget -nc https://a3s.fi/TKO_8964_2021/parsebank-freq-type-form.tsv

File ‘parsebank-freq-type-form.tsv’ already there; not retrieving.



The above file contains a frequency-sorted list of tagged strings and the most common types assigned to each by the tagger:

In [7]:
with open('parsebank-freq-type-form.tsv') as f:
    for i in range(5):
        print(next(f).rstrip())

4244115	CARDINAL	yksi
3024280	CARDINAL	kaksi
2842207	GPE	Suomen
2733604	GPE	Suomessa
2414878	CARDINAL	2


The idea is to find the most frequent form of the words other than the head word (here, heuristically, the last word) to determine the most likely form.

We'll here use stemming to minimize the computational cost. A serious implementation would lemmatize the mentions in context.

In [8]:
from snowballstemmer import stemmer


mention = 'Turunkin yliopiston'

finnish_stemmer = stemmer('finnish')

stems = finnish_stemmer.stemWords(mention.split())

with open('parsebank-freq-type-form.tsv') as f:
    for line in f:
        freq, type_, form = line.rstrip('\n').split('\t')
        if finnish_stemmer.stemWords(form.split()) == stems:
            print(f'{freq} {form} → {form.split()[:-1]}')
        if int(freq) < 1000:
            break

19642 Turun yliopiston → ['Turun']
10805 Turun yliopisto → ['Turun']
7597 Turun yliopistossa → ['Turun']
4757 Turun yliopistosta → ['Turun']


This concludes our brief look into mention (string) normalization. Let's next look at entity linking and how these techniques relate to that task.

---

# Entity linking

Associating mentions in text with identifiers that represent the real-world entities that they refer to

---

<img src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/entity-linking-marin.png">

---

In the following, we'll assume that the mentions have been normalized to standardized forms (covered above).

By linking entities in text to representations of entities it is possible to create fully structured representations of statements, such no part of the representation requires human interepretion.

We can illustrate the difference between unstructured (textual) and structured resources by comparing the [Wikipedia](https://en.wikipedia.org/wiki/Douglas_Adams) and [Wikidata](https://www.wikidata.org/wiki/Q42) entries for the same person:

---

<a href="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/douglas-adams.png"><img src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/douglas-adams.png"></a>

---

The information that is stored in structured form is conveyed on the Wikipedia page in natural language, e.g.

* _Douglas Noel Adams [...] was an English author_ ⇒ `instance_of(Douglas_Adams, human)`
* _Douglas Noel Adams (11 March 1952 – 11 May 2001)_ ⇒ `date_of_birth(Douglas_Adams, 11 March 1952)`

This mapping is an information extraction task, specifically relation extraction. However, we won't consider this task in detail yet, and will instead focus on the use of these resources for entity linking.

Resources such as Wikipedia and Wikidata provide two key pieces of information for entity linking:

1. Unique identifiers associating mentions in text with real-world entities
2. The names and synonyms of those real-world entitites

As en example, let's again have a look at the Wikidata page for Douglas Adams, <https://www.wikidata.org/wiki/Q42>:

---

<img src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/douglas-adams-wikidata.png">

---

We find among many other pieces of information an ID, here `Q42`, and various names and aliases, including in other languages; via Wikipedia links, we can infer that Douglas Adams can be referred to (among many others) as e.g.

* Дуглас Адамс (Russian)
* ダグラス・アダムズ (Japanese)
* 더글러스 애덤스 (Korean)
* Ντάγκλας Άνταμς (Greek)
* דאגלס אדמס (Hebrew)

(Note that while these transliterations could mostly be generated straightforwardly from _Douglas Adams_, the same doesn't apply to all aliases, such as _Douglas Noel Adams_.)

With these pieces of information -- the ID and names -- we can characterize the stages of a common approach to entity linking:

* **Candidate generation**: for each mention in text, identify which entities (IDs) it may refer to
* **Entity mention disambiguation**: given a mention in text and a set of candidate entities, identify the mentioned entity (ID).

While these two subproblems can be addressed jointly in a single system, we'll treat them separately here for simplicity.

---

## Candidate generation for entity normalization

Candidate generation primarily seeks to address the _variability_ of entity names, i.e. the many possible ways in which people, places, etc. can be referred do.

We can formalize the candidate generation problem e.g. as follows:

- Given a knowledge base $K$ containing representations of real-world entities, and
- Given an entity mention $m$ occurring in a document $d$
- Return a subset ${ k_1, k_2, \ldots k_n } \subset K$ of representations that includes the representation of the entity referred to by $m$

Note that there is a trivial solution: always return the full knowledge base $K$. While this would never omit the referred entity if it is included in the knowledge base, this is not normally viable in practice as knowledge bases can be very large (e.g. Wikidata contains nearly 100 million entries) and disambiguation is costly. Candidate generation should thus strike a balance between recall and efficiency.

---

Practically, consider the following example, where the knowledge base $K$ is Wikidata, the document $d$ = 
```
Former President George Bush on Sunday congratulated President-elect Joe Biden and Vice President-elect Kamala Harris on their election. The 43rd president of the United States said ...
```

and the typed mention $m$ = (`George Bush`, `PERSON`). That is, we need to determine which entries in Wikidata the person name _George Bush_ could refer to in this context.

For simplicity, let's ignore the document context and try to find people named _George Bush_ in Wikidata. We'll consider two options: querying the knowledge base directly, and reading a database dump.

---

### Querying Wikidata

With more than 100 million entries, Wikidata is the most extensive freely available broad-coverage resource with information on real-world entities. While the details of using and querying this resource fall out of our scope here (and _you do not nead to know_ any RDF/SPARQL syntax), it's good to know some of the basic ideas.

Wikidata is represented in [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework), and (simplifying) we can think of its entries as triples of the format

```
ENTITY-1    RELATION    ENTITY-2
```

So e.g. the information that `Douglas Adams` is a `Person` could be written in the abstract as

```
Douglas Adams    instance_of    Person
```

However, instead of human-readable (and potentially ambiguous!) strings, Wikidata uses unique IDs. The ID for Douglas Adams, the English author, happens to be [`Q42`](https://www.wikidata.org/entity/Q42), the ID for the `instance_of` relation is [`P31`](https://www.wikidata.org/wiki/Property:P31), and the ID for humans (the species _Homo sapiens_) is [`Q5`](https://www.wikidata.org/entity/Q5). So, we would more specifically write

```
Q42    P31    Q5
```

to assert that Douglas Adams is a person (a member of the species _Homo sapiens_).

Wikidata can be queried using the [SPARQL](https://en.wikipedia.org/wiki/SPARQL) query language, where simple queries can be though of in the abstract as taking forms such as

```
ENTITY-1     RELATION    ?
```

to ask what entities `ENTITY-1` is related via `RELATION`. In practice, we can query e.g.

```
SELECT ?country WHERE {
    wd:Q42 wdt:P27 ?country.
}
```

To ask what was the country of citizenship ([`P27`](https://www.wikidata.org/wiki/Property:P27)) of Douglas Adams ([`Q42`](https://www.wikidata.org/entity/Q42)) (<a href="https://query.wikidata.org/#SELECT %3Fcountry %0AWHERE { wd%3AQ42 wdt%3AP27 %3Fcountry. }">Try this query!</a>). This gives as response _United Kingdom_ ([`Q145`](https://www.wikidata.org/wiki/Q145)).

We can similarly query e.g. for the names of cities in Finland (<a href="https://query.wikidata.org/#SELECT%20%3Fcity%20%3Fname%20WHERE%20%7B%0A%20%20%3Fcity%20wdt%3AP31%20wd%3AQ515.%0A%20%20%3Fcity%20wdt%3AP17%20wd%3AQ33.%0A%20%20%3Fcity%20wdt%3AP1705%20%3Fname%20%20%0A%7D%0ALIMIT%20100">Try this query!</a>):

```
SELECT ?city ?name WHERE {
  ?city wdt:P31 wd:Q515.        # explanation: ?city instance_of City.
  ?city wdt:P17 wd:Q33.         # explanation: ?city country Finland.
  ?city wdt:P1705 ?name         # explanation: ?city native_label ?name
}
```

(Queries such as these could be used to generate dictionaries for entity normalization.)

Coming back to our motivating example above, we can query Wikidata for people who have the name (`rdfs:label`) or alias (`skos:altLabel`) "George Bush" as follows (<a href="https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fperson%20%3Fdescription%0AWHERE%0A%7B%0A%20%20%3Fperson%20wdt%3AP31%20wd%3AQ5.%0A%20%20%7B%20%3Fperson%20rdfs%3Alabel%20%22George%20Bush%22%40en.%20%7D%20UNION%0A%20%20%7B%20%3Fperson%20skos%3AaltLabel%20%22George%20Bush%22%40en.%20%20%7D%0A%20%20%3Fperson%20schema%3Adescription%20%3Fdescription.%0A%20%20FILTER%28LANG%28%3Fdescription%29%20%3D%20%22en%22%29%0A%7D%0A">Try this query!</a>):

```
SELECT DISTINCT ?p ?d
WHERE
{
  ?p wdt:P31 wd:Q5.                            # explanation: ?p instance-of Human
  { ?p rdfs:label "George Bush"@en. } UNION    # explanation: ?p has-name "George Bush" (in English) or
  { ?p skos:altLabel "George Bush"@en.  }      # explanation: ?p has-alias "George Bush" (in English)
  ?p schema:description ?d.                    # explanation: ?p has-description ?d
  FILTER(LANG(?d) = "en")                      # explanation: the language of ?d is English
}
```

Producing the following result:

	
| person        | description
|:--------------|:-----------
| wd:Q5537484	| racing driver
| wd:Q100766406 | college basketball player (1950–1950) Toledo
| wd:Q5537488	| American biblical scholar and pastor
| wd:Q28445429  | association football player (1883-1936)|
| wd:Q207       | 43rd president of the United States
| wd:Q23505     | 41st president of the United States (1924-2018)

---

### Wikidata dumps

In practical systems, we likely don't want to make a SPARQL query every time we want to look up candidate names. Further, RDF triples are (arguably) not the most approachable of representations. Fortunately, the entire Wikidata knowledgebase is [available for download](https://dumps.wikimedia.org/wikidatawiki/entities/), also in JSON format: https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 .

This data is over 60G packed, so we won't be demonstrating the use of the entire knowledge base here. Instead, let's first look at the JSON structure of the Douglas Adams entry: https://www.wikidata.org/wiki/Special:EntityData/Q42.json

---

<img width="35%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/wikidata-q42.png">

---

Note here the keys `id`, `labels`, and `aliases`. These are all we need here: the unique ID, and the names and aliases associated with that ID. Let's look at this in code:

In [9]:
import requests


response = requests.get('https://www.wikidata.org/wiki/Special:EntityData/Q42.json')
data = response.json()
entity = data['entities']['Q42']

id_ = entity['id']
label = entity['labels']['en']['value']
aliases = [a['value'] for a in entity['aliases']['en']]

print(f"entity {id_} has label '{label}' and aliases {aliases}")

entity Q42 has label 'Douglas Adams' and aliases ['Douglas Noel Adams', 'Douglas Noël Adams', 'Douglas N. Adams']


We've prepared the subset of Wikidata people (entities that have `instance_of human`) filtered just to these pieces of information in a JSON lines format ([conversion script](https://github.com/TurkuNLP/Text_Mining_Course/)):

In [10]:
!wget -nc https://a3s.fi/TKO_8964_2021/wikidata-people.jsonl

File ‘wikidata-people.jsonl’ already there; not retrieving.



We can use this data as above to (relative quickly) access information about all representations of people in Wikidata:

In [11]:
import json


with open('wikidata-people.jsonl') as f:
    for i in range(10):
        entity = json.loads(f.readline())
        print(entity['id'], '\t', entity['labels']['en']['value'])

Q23 	 George Washington
Q42 	 Douglas Adams
Q1868 	 Paul Otlet
Q207 	 George W. Bush
Q297 	 Diego Velázquez
Q368 	 Augusto Pinochet
Q501 	 Charles Baudelaire
Q619 	 Nicolaus Copernicus
Q633 	 Neil Young
Q640 	 Harald Krichel


(There are also a wealth of off-the-shelf tools for working with Wikidata available at https://www.wikidata.org/wiki/Wikidata:Tools/For_programmers)

### Approximate string matching

The combination of mention normalization methods and access to names and aliases in a knowledgebase offers one possible solution to candidate generation:

* Normalize the given mention $m$ (potentially using context $d$) to a standard ("dictionary") form $s$
* Return all entries in the knowledge base that contain an name or alias matching $s$

However, no knowledge base is ever absolutely complete, and even if one were, mentions can have typos or irregularities. If our knowledge base contains e.g. the name _George W. Bush_ but not the form _George W Bush_ (without the dot), we would still like match the latter.

To solve minor deviations in string forms, we can apply _approximate_ (or _fuzzy_) _string matching_ methods. A core set of algorithms for this task involves the notion of _edit distance_, i.e. how many insertions, deletions or substitutions need to be made to edit one string into another.

For example, the [Levenshtein_distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings can be  calculated using tools available for `pip install`:

In [12]:
!pip install --quiet python-Levenshtein

from Levenshtein import distance

for s1, s2 in [('George W. Bush', 'George W Bush'),
               ('George W. Bush', 'George Bush'),
               ('Levenshtein', 'Lehvenstien')]:
    print(f'distance("{s1}", "{s2}") = {distance(s1, s2)}')

distance("George W. Bush", "George W Bush") = 1
distance("George W. Bush", "George Bush") = 3
distance("Levenshtein", "Lehvenstien") = 4


For large sets of strings, the computational cost of calculating pairwise edit distances for all strings can be too high. To allow approximate candidate generation from very large knowledge bases, we can use e.g. [Locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) methods for strings, which aim to map similar strings to the same hash value, or dedicated methods such as [simstring](http://www.chokkan.org/software/simstring/).

We'll here demonstrate approximate matching using a Python implementation of simstring, [`simstring-pure`](https://pypi.org/project/simstring-pure/).

In [13]:
!pip install --quiet simstring-pure

We'll first build a simstring database with character N-gram features (this takes a while):

In [14]:
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.database.dict import DictDatabase


db = DictDatabase(CharacterNgramFeatureExtractor(2))


with open('wikidata-people.jsonl') as f:
    count = 0
    for line in f:
        entity = json.loads(line)
        try:
            db.add(entity['labels']['en']['value'])
            count += 1
        except:
            pass    # skip people without an English name

print(f'Added {count} entries to DB')

Added 3153159 entries to DB


We can then query that database for approximate string matches using e.g. n-gram cosine similarity very quickly:

In [15]:
from simstring.measure.cosine import CosineMeasure
from simstring.searcher import Searcher


THRESHOLD = 0.8    # minimum similarity for retrieved strings

searcher = Searcher(db, CosineMeasure())
for mention in ['George W. Bush', 'George W Bush']:
    print(mention, '→', searcher.search(mention, THRESHOLD))

George W. Bush → ['George Bush', 'George W. Buck', 'George W. Bush', 'George W. Brush']
George W Bush → ['George Bush', 'George W. Bush', 'George W. Brush']


With mention normalization providing us with standardized forms of strings appearing in text and approximate matching finding minor variations, we're practically done with candidate generation (right?)

### Alternative forms of reference

We have so far focused on cases where a mention can be approximately matched to a full known name alias. However, entities are not always referenced by their full names. Common exceptions include

* Only first or last name after first mention in a document (e.g. _George Bush_ → _Bush_)
    * In particular in infomal writing, first or last names can be used exclusively (e.g. _Obama_)
* Short "local" abbreviations for repeatedly mentioned names (e.g. _Emma Korhonen_ → _EK_)
* Reference by title or position (e.g. _The president_, _the Member for Cambridge_)

Some of these issues are best resolved by [coreference resolution](https://nlp.stanford.edu/projects/coref.shtml), i.e. using the same or similar methods as for deciding the referent of _he_ or _she_. However, it should be noted that some forms of reference not typically found in knowledge bases can be identified using resources such as Wikipedia where any referring string of text can potentially be linked to an entry associated with the full name of the referenced entity:

---

<img width="75%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/paaministeri-marin.png">

---

Linked texts such as these can be used to augment the aliases found in knowledge bases for candidate generation.

## Entity mention disambiguation

Where candidate generation aimed to capture the _variability_ of names, their _ambiguity_ is addressed by disambiguation.

The final step of our normalization approach is to disambiguate between the options provided by the candidate generation step:

- Given an entity mention $m$ occurring in a document $d$ and
- Given a subset ${ k_1, k_2, \ldots k_n } \subset K$ of entity representations from a knowledge base $K$
- Select the representation most likely referenced by the mention $m$, or `NIL` if the reference is not included

The option to select an "empty" value (`NIL`) is included to cover the common case where an entity referenced in a text is not found in a knowledge base. 

A wealth of methods have been proposed to this task, with various combinations or rule-based approaches, machine learning with explicitly engineered features, and recently feature learning-based methods (See e.g. [Shen et al. 2015](http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf) for a survey of the field, and [Sevgili et al. 2021](https://arxiv.org/pdf/2006.00575.pdf)).

We will here discuss string similarity, one effective heuristic (popularity), and ranking-based machine learning methods.

### String similarity

Assuming our candidate generation includes some form of approximate matching, the string we queried with may not exactly match any of the names and aliases in our knowledge base. Consider our example from above:

In [16]:
searcher.search('George W Bush', THRESHOLD)

['George Bush', 'George W. Bush', 'George W. Brush']

It seems intuitively clear that it's more likely that our mention just lacks a dot (i.e. should match _George W. Bush_) than that it lacks a dot _and_ mispells _Brush_ as _Bush_.

Here we can again look at edit distance to identify likely candidates:

In [17]:
from Levenshtein import distance


s1 = 'George W Bush'

for s2 in searcher.search(mention, THRESHOLD):
    print(f'distance("{s1}", "{s2}") = {distance(s1, s2)}')

distance("George W Bush", "George Bush") = 2
distance("George W Bush", "George W. Bush") = 1
distance("George W Bush", "George W. Brush") = 2


We can also further refine this approach by assigning different costs to different edit operations: for example, removing or inserting a dot could have a lower cost than removing or inserting an alphabetic character.

### Entity popularity

Although the concept is hard to define objectively, various measures of entity "popularity" provide both a strong baseline method for disambiguation as well as a valuable feature for machine learning approaches. We could measure the popularity of an entity e.g. by

* The number of words on the Wikipedia page for the entity
* The number of facts (relations) recorded on the Wikidata entry of the entity
* The number of incoming links to the Wikipedia/Wikidata pages (either internal, or in a web crawl)

As an example, consider our previous query for <a href="https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fperson%20%3Fdescription%0AWHERE%0A%7B%0A%20%20%3Fperson%20wdt%3AP31%20wd%3AQ5.%0A%20%20%7B%20%3Fperson%20rdfs%3Alabel%20%22George%20Bush%22%40en.%20%7D%20UNION%0A%20%20%7B%20%3Fperson%20skos%3AaltLabel%20%22George%20Bush%22%40en.%20%20%7D%0A%20%20%3Fperson%20schema%3Adescription%20%3Fdescription.%0A%20%20FILTER%28LANG%28%3Fdescription%29%20%3D%20%22en%22%29%0A%7D%0A">people named George Bush</a> in Wikidata: the candidates include e.g. a [professional racing driver](https://en.wikipedia.org/wiki/George_Bush_(racing_driver)) of that name. While some references of the name do undoubtedly refer to this person, _in the absence of other evidence_ most people would likely assume that the name references a former US president.

We can implement something along these intuitive lines by ranking the entries by the number of [Wikidata sitelinks](https://www.wikidata.org/wiki/Help:Sitelinks), i.e. the number of links to an entry from any Wiki resource (<a href="https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fperson%20%3Fdescription%20%3Flinkcount%0AWHERE%0A%7B%0A%20%20%3Fperson%20wdt%3AP31%20wd%3AQ5.%0A%20%20%7B%20%3Fperson%20rdfs%3Alabel%20%22George%20Bush%22%40en.%20%7D%20UNION%0A%20%20%7B%20%3Fperson%20skos%3AaltLabel%20%22George%20Bush%22%40en.%20%20%7D%0A%20%20%3Fperson%20schema%3Adescription%20%3Fdescription.%0A%20%20%3Fperson%20wikibase%3Asitelinks%20%3Flinkcount%0A%20%20FILTER%28LANG%28%3Fdescription%29%20%3D%20%22en%22%29%0A%7D%0AORDER%20BY%20DESC%28%3Flinkcount%29">try this query!</a>):

```
SELECT DISTINCT ?person ?description ?linkcount
WHERE
{
  ?person wdt:P31 wd:Q5.
  { ?person rdfs:label "George Bush"@en. } UNION
  { ?person skos:altLabel "George Bush"@en.  }
  ?person schema:description ?description.
  ?person wikibase:sitelinks ?linkcount               # this line is new
  FILTER(LANG(?description) = "en")
}
ORDER BY DESC(?linkcount)                             # this line is new
```

Running this updated query, we get the following result:

| person        | description                                     | linkcount
|:--------------|:------------------------------------------------|----
| wd:Q207	    | 43rd president of the United States	          | 266
| wd:Q23505     | 41st president of the United States (1924-2018) | 177
| wd:Q5537488	| American biblical scholar and pastor            | 6
| wd:Q5537484	| racing driver                                   | 2
| wd:Q28445429	| association football player (1883-1936)         | 2
| wd:Q100766406	| college basketball player (1950–1950) Toledo    | 0

We see the expected result with the two presidents ranked higher than the scholar, racing driver, and the football and basketball players named _George Bush_.

While imperfect and biased in various ways, popularity heuristics such as these can provide a strong baseline method: nevermind the context, just always return the most popular candidate.


### Machine learning for entity linking

In our discussion of entity linking, we have so far ignored the document context $d$ in which a mention $m$ occurs in. However, in some cases context is absolutely required to disambiguate: consider again our previous example with the typed mention typed = (`George Bush`, `PERSON`). If our document is $d_1$ = 

```
Former President George Bush on Sunday congratulated President-elect [...] The 43rd
president of the United States said ...
```

we can infer that the correct entity ID in Wikidata is [Q207](https://www.wikidata.org/wiki/Q207), _George W. Bush, 43rd president of the United States_. However, if our context were instead $d_2$ =

```
Former President George Bush on Sunday congratulated President-elect [...] The 41st
president of the United States said ...
```

we should link to [Q23505](https://www.wikidata.org/wiki/Q23505), _George H. W. Bush, 41st president of the United States_.

While heuristics can be written for specific cases (e.g. search for "41" or "43" to disambiguate the Bushes), it is very difficult to write general rules to answer the question _which of these entities does this mention refer to in this context_. As is common for cases where we cannot code a solution to a problem, we can apply machine learning to approximate a solution.

Machine learning for entity linking is an active area or research with many proposed methods (see e.g. [Sevgili et al. 2021](https://arxiv.org/pdf/2006.00575.pdf)) and few off-the-shelf tools. However, many state-of-the-art approaches can be broadly characterized as follows:

* Pre-train a _neural language model_ (LM) on a large corpus of unannotated data (e.g. BERT)
* Train an _entity encoder_ to create representations of knowledge-base entries
* Calculate the similarity of a mention-in-context vector encoded by the LM with the entity vector created by the entity encoder for each candidate entity
* (Optionally add in prior information such as entity popularity)
* Fine-tune with labelled data (mention-entity pairs)
* Use similarity of mention-in-context and entity representations to rank candidates
* Return top candidate if similarity higher than threshold, or `NIL` otherwise

---

<img width="75%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/entity-ranking-sevgili-et-al-2021.png">
<div style="text-align:center; color:gray; font-size:80%">(Figure from <a href="https://arxiv.org/pdf/2006.00575.pdf">Sevgili et al. 2021</a>)</div>

---

To the best of my knowledge, no system of this type currently exists for Finnish, but perhaps one will have been created by the next time we give this course!