<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/relation_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Relation extraction

This notebook presents different approaches to relation extraction.

---

Start by setting up the usual libraries and loading a named entity recognition pipeline.

In [1]:
!pip install --quiet transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import transformers

(**Note**: the model we use here is intended for teaching purposes. It not been carefully optimized, and is not recommeded for any serious use.)

In [4]:
pipe = transformers.pipeline(
    'ner',
    'spyysalo/example-turku-ner-model',
    aggregation_strategy='simple',
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/424k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Simple wrapper function for convenience

In [5]:
from collections import namedtuple

Mention = namedtuple('Mention', 'text type start end')

def tag(text):
    output = pipe(text)
    return [Mention(o['word'], o['entity_group'], o['start'], o['end']) for o in output]

---

# Co-occurrence-based relation extraction

Relation extraction based primarily on the assumption that when entity mentions **occur together** in some context (e.g. sentence, paragraph, document), some relation is likely to hold between them.

The simplest form of co-occurrence based relation extraction would be to always return a relation for all pairs of co-occurring entities. Let's illustrate this for a very simple sentence:

In [7]:
def cooccurrence_relations(mentions):
    relations = []
    for i, mention1 in enumerate(mentions):
        for j, mention2 in enumerate(mentions[i+1:]):
            relations.append((mention1, mention2))
    return relations

mentions = tag('Paavo Nurmi syntyi Turussa vuonna 1897.')

for m1, m2 in cooccurrence_relations(mentions):
  print((m1.text, m2.text))

('Paavo Nurmi', 'Turussa')
('Paavo Nurmi', 'vuonna 1897')
('Turussa', 'vuonna 1897')


For this simple example, we get two valid relations:

* (Paavo Nurmi, Turussa)
* (Paavo Nurmi, vuonna 1897)

and one erroneous (or meaningless) relation:

* (Turussa, vuonna 1897)

In general, the naive co-occurrence based approach has _perfect recall_ (all relevant relations are included) but typically very _poor precision_: most of the returned relations are wrong. This is evident already with slightly longer text:

In [14]:
text = (
    'Nurmi juoksi kesällä 1920 Suomen olympiakarsinnoissa suomenennätyksiä, '
    'ja hänet valittiin Suomen joukkueeseen vuoden 1920 olympialaisiin. '
    'Nurmen ensimmäinen olympiajuoksu oli 17. elokuuta käyty 5 000 metrin '
    'kilpailu, jossa hän voitti hopeaa.'
)

mentions = tag(text)

for m1, m2 in cooccurrence_relations(mentions):
    print((m1.text, m2.text))

('Nurmi', '1920')
('Nurmi', 'Suomen')
('Nurmi', 'Suomen')
('Nurmi', 'vuoden 1920')
('Nurmi', 'Nurmen')
('Nurmi', '17')
('Nurmi', '. elokuuta')
('1920', 'Suomen')
('1920', 'Suomen')
('1920', 'vuoden 1920')
('1920', 'Nurmen')
('1920', '17')
('1920', '. elokuuta')
('Suomen', 'Suomen')
('Suomen', 'vuoden 1920')
('Suomen', 'Nurmen')
('Suomen', '17')
('Suomen', '. elokuuta')
('Suomen', 'vuoden 1920')
('Suomen', 'Nurmen')
('Suomen', '17')
('Suomen', '. elokuuta')
('vuoden 1920', 'Nurmen')
('vuoden 1920', '17')
('vuoden 1920', '. elokuuta')
('Nurmen', '17')
('Nurmen', '. elokuuta')
('17', '. elokuuta')


The combinatorial explosion of wrong or irrelevant pairs arising from naive co-occurrence based relation extraction can be alleviated by restricting the scope of considered co-occurrences to e.g. sentences, but this does compromise on the perfect recall of the approach: relations can involve entities mentioned in different sentences.

Note additionally that this approach does not provide any information about the _types_ of the relations.

---

## Metrics

As most combinations of entity pairs in a document do not represent meaningful relations, a trivial "method" that never returns any relations could have deceptively high _accuracy_: for example, if there are 100 possible pairs in a document and a relation holds for 5, returning an empty set of relations would give 95% accuracy.

For this reason, accuracy is rarely used to evaluate relation extraction; instead, the standard information retrieval metrics are used:

* **Precision**: the ratio of extracted relations that are correct
* **Recall**: the ratio of correct relations that are extracted
* **$F_1$-score**: the balanced harmonic mean of precision and recall

---

## Statistical approaches

Co-occurrence provides only weak evidence that the co-mentioned entities are associated. However, by accumulating such evidence over a large corpus, it is possible to identify associations with higher confidence than a single co-occurrence can give.

Let's illustrate this by looking at mentions tagged in a sample of English Wikipedia articles.

In [15]:
!wget -nc https://a3s.fi/TKO_8964_2021/en-wiki-sample-ontonotes.conll.gz

--2023-02-19 19:25:24--  https://a3s.fi/TKO_8964_2021/en-wiki-sample-ontonotes.conll.gz
Resolving a3s.fi (a3s.fi)... 86.50.254.18
Connecting to a3s.fi (a3s.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79442456 (76M) [application/gzip]
Saving to: ‘en-wiki-sample-ontonotes.conll.gz’


2023-02-19 19:25:30 (14.8 MB/s) - ‘en-wiki-sample-ontonotes.conll.gz’ saved [79442456/79442456]



We'll use a reader function for the two-column (text, tag) CoNLL format. You should already be familiar with this representation from previous material.

In [16]:
import gzip


def read_conll_entities(stream):
  words, tags = [], []
  for ln, line in enumerate(stream, start=1):
    if line.startswith('#'):
      continue    # skip comments
    elif line.isspace():
      if words and tags:
        yield words, tags
      words, tags = [], []
    else:
      word, tag = line.rstrip('\n').split('\t')
      words.append(word)
      tags.append(tag)
  if words and tags:
    yield words, tags


conll_sentences = []
with gzip.open('en-wiki-sample-ontonotes.conll.gz', 'rt', encoding='utf-8') as f:
  for words, tags in read_conll_entities(f):
    conll_sentences.append((words, tags))

This NER annotation is represented in the familiar IOB (in-out-begin) format:

In [18]:
for word, tag in zip(*conll_sentences[1]):
  print(f'{word}\t{tag}')

Ayn	B-PERSON
Rand	I-PERSON
(	O
;	O
born	O
Alisa	B-PERSON
Zinovyevna	I-PERSON
Rosenbaum	I-PERSON
;	O
–	O
March	B-DATE
6	I-DATE
,	I-DATE
1982	I-DATE
)	O
was	O
a	O
Russian	B-NORP
-	I-NORP
American	I-NORP
writer	O
and	O
philosopher	O
.	O


Let's next convert these IOB-tagged sentences into the mention format defined above, where each mention is defined by its start and end offsets, type, and text. (You don't need to understand the conversion steps in detail.)

In [22]:
def conll_to_mentions(words, tags):
  mentions, offset, start, label = [], 0, None, None
  sentence = ' '.join(words)
  for word, tag in zip(words, tags):
    if tag[0] in 'OB' and start is not None:    # current ends
      end = offset-1
      mentions.append(Mention(sentence[start:end], label, start, end))
      start, label = None, None
    if tag[0] == 'B':
      start, label = offset, tag[2:]
    elif tag[0] == 'I':
      if start is None:    # I without B, but nevermind
        start, label = offset, tag[2:]
    else:
      assert tag == 'O', 'unexpected tag {}'.format(tag)
    offset += len(word) + 1    # +1 for space
  if start is not None:    # span open at sentence end
    end = offset-1
    mentions.append(Mention(sentence[start:end], label, start, end))
  return mentions


sentences = []
mentions_by_sentence = []
for words, tags in conll_sentences:
  sentences.append(' '.join(words))
  mentions_by_sentence.append(conll_to_mentions(words, tags))

Now we have for the above example sentence

In [23]:
for mention in mentions_by_sentence[1]:
  print(mention)

Mention(text='Ayn Rand', type='PERSON', start=0, end=8)
Mention(text='Alisa Zinovyevna Rosenbaum', type='PERSON', start=18, end=44)
Mention(text='March 6 , 1982', type='DATE', start=49, end=63)
Mention(text='Russian - American', type='NORP', start=72, end=90)


This is the representation that we want for relation extraction.

Let's next see if we can find anything of interest just by looking at the most common co-occurrences.

In [24]:
from collections import Counter


relation_counts = Counter()
for mentions in mentions_by_sentence:
  for relation in cooccurrence_relations(mentions):
    relation_counts[relation] += 1

for relation, count in relation_counts.most_common(10):
  m1, m2 = relation[0], relation[1]
  print(f'{count}\t({m1.text}/{m1.type}, {m2.text}/{m2.type})')

3939	(every 100/CARDINAL, age 18/DATE)
3738	(Hispanic/NORP, Latino/NORP)
1177	(25 to 44/DATE, 45 to 64/DATE)
1095	(the age of 18/DATE, 45 to 64/DATE)
1087	(the age of 18/DATE, 25 to 44/DATE)
1020	(the age of 18/DATE, 65 years of age or older/DATE)
999	(45 to 64/DATE, 65 years of age or older/DATE)
986	(25 to 44/DATE, 65 years of age or older/DATE)
816	(18/CARDINAL, 24/CARDINAL)
807	(24/CARDINAL, 45 to 64/DATE)


These aren't really interesting; most of the pairs involve a pair of numeric mention types. On reflection, it seems reasonable to assume that at least one of each pair of potentially related entities should have something other than a numeric type. We might also get more relevant associations by focusing further on specific types. Let's try this out.

This data uses [OntoNotes types](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf#page=21):

In [25]:
ALL_TYPES = set(m.type for mentions in mentions_by_sentence for m in mentions)

ALL_TYPES

{'CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART'}

Let's see if we get more interesting pairings by filtering by type:

In [26]:
def filtered_cooccurrence_relations(mentions, types1, types2):
  relations = []
  for i, mention1 in enumerate(mentions):
    for mention2 in mentions[i+1:]:
      if (mention1.type in types1 and mention2.type in types2 and
          mention1.type != mention2.type):
        relations.append((mention1, mention2))
  return relations


# Try changing the following to other types! (see above for list)
TYPES1 = { 'PERSON' }
TYPES2 = { 'ORG' }

relation_counts = Counter()
for mentions in mentions_by_sentence:
  for relation in filtered_cooccurrence_relations(mentions, TYPES1, TYPES2):
    relation_counts[relation] += 1

for relation, count in relation_counts.most_common(20):
  m1, m2 = relation[0], relation[1]
  print(f'{count}\t({m1.text}/{m1.type}, {m2.text}/{m2.type})')

9	(Bosley Crowther/PERSON, `` The New York Times/ORG)
8	(Vincent Canby/PERSON, The New York Times/ORG)
8	(Janet Maslin/PERSON, The New York Times/ORG)
7	(Gene Siskel/PERSON, the `` Chicago Tribune/ORG)
6	(Charles Champlin/PERSON, the `` Los Angeles Times/ORG)
4	(Roger Ebert/PERSON, Chicago Sun-Times/ORG)
3	(Gary Arnold/PERSON, The Washington Post/ORG)
3	(A. O. Scott/PERSON, The New York Times/ORG)
3	(St John 's/PERSON, College/ORG)
3	(A. H. Weiler/PERSON, The New York Times/ORG)
3	(Häkkinen/PERSON, McLaren/ORG)
3	(Tuoba Xianbei/PERSON, the Northern Wei/ORG)
2	(Stephen Thomas Erlewine/PERSON, AllMusic/ORG)
2	(Martin Richards/PERSON, the University of Cambridge/ORG)
2	(Charles II/PERSON, the Hudson 's Bay Company/ORG)
2	(Charles II/PERSON, HBC/ORG)
2	(Cranmer/PERSON, Church/ORG)
2	(Cranmer/PERSON, Prayer Books/ORG)
2	(Roosevelt/PERSON, Congress/ORG)
2	(Mick LaSalle/PERSON, the `` San Francisco Chronicle/ORG)


That's better, there are definitely valid relations there that could potentially serve as entries in a knowledge base given normalization (and a relation type!)

In this simple statistical approach, we have ranked the relations by their total count in the corpus. While this simple ranking approach can work, it should be noted that it fails to account for the probability of chance co-occurrence.

An improved statistical approach would be to use the overall frequency of occurrence in the data to get an estimate of the expected number of co-occurrences for each pair, and divide the counts of actual co-occurrences by this number to determine which pairs co-occur more frequently than expected by chance.

---

# Rule-based relation extraction

Broadly, any method involving explicitly written instructions for how to extract relations. (Contrast with machine learning approaches, where the "rules" are learned from examples.)

Let's first consider a simple extension of naive co-occurrence based relation extraction to assign candidate types to co-occurring mentions by looking for keywords in their context -- here, the words in between the mentions.



In [27]:
relation_keywords = {
    'date_of_birth': ['born', 'birth'],
    'date_of_death': ['died', 'dead', 'killed'],
}

relation_counts = Counter()
for sentence, mentions in zip(sentences, mentions_by_sentence):
  for i, mention1 in enumerate(mentions):
    for mention2 in mentions[i+1:]:
      if mention1.type == 'PERSON' and mention2.type == 'DATE':
        between_words = sentence[mention1.end:mention2.start].split()
        for rel, keywords in relation_keywords.items():
          if any(k in between_words for k in keywords):
            relation_counts[(rel, mention1.text, mention2.text)] += 1

for relation, count in relation_counts.most_common(20):
  rel, m1, m2 = relation[0], relation[1], relation[2]
  print(f'{count}\t{rel}({m1}, {m2})')

4	date_of_death(Richard Montgomery, 1775)
3	date_of_birth(Afonso, 1109)
3	date_of_death(Queen Victoria, 22 January 1901)
3	date_of_death(Garret Hobart, 1899)
2	date_of_death(Afonso, 1185)
2	date_of_death(Isabella, 1103)
2	date_of_death(Peter, 1103)
2	date_of_birth(Mozart, about four months)
2	date_of_death(Gur, October 1967)
2	date_of_death(Gur, 11)
2	date_of_birth(Britney Jean Spears, December 2 , 1981)
2	date_of_birth(Billy Bob Thornton, August 4 , 1955)
2	date_of_death(DeMille, November 1959)
2	date_of_birth(David Andrew, August 28 , 1962)
2	date_of_birth(Leo Fincher, August 28 , 1962)
2	date_of_birth(Denis Colin Leary, 18 August 1957)
2	date_of_birth(Eve Arden, April 30 , 1908)
2	date_of_death(Franco, 1975)
2	date_of_death(Louis XVIII, September 1824)
2	date_of_death(Charles II, 1700)


There's a lot of noise here, but at least the following relation extracted for [Queen Victoria](https://en.wikipedia.org/wiki/Queen_Victoria) appears to hold:

```
date_of_death(Queen Victoria, 22 January 1901)
```

Like co-occurrence based extraction, this simple approach suffers from low precision. To improve on this, let's look at an important subclass of rule-based relation extraction methods, ones that involve explicit **patterns** for relation extraction.

Extraction patterns can be expressed over the linear sequence of words ("flat" patterns) or over some representation of syntactic structure (e.g. dependency patterns). We'll look into both in the following.


## Flat pattern matching

Flat pattern matching can be implemented using standard tools such as [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). (If you're not familiar with regular expression syntax or need a refresher, you may want to have a look at e.g. the detailed [regular expression HOWTO](https://docs.python.org/3/howto/regex.html) included in Python documentation.)

Let's first briefly revisit the extraction of is-a relations using [Hearst patterns](https://people.ischool.berkeley.edu/~hearst/papers/coling92.pdf). The basic idea is that we can infer _is-a_ (or _type-of_, or [_hyponym_](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy)) relations by looking for patterns such as the following:

```
authors such as Shakespeare    →  is-a(author, Shakespeare)
libraries and other buildings  →  is-a(building, library)
countries, including Finland   →  is-a(country, Finland)
```

Let's try a "such as" pattern on our example Wikipedia data. We're here fixing one participant in the relation to be an entity mention, and simply matching the other as a string using a regular expression.

In [28]:
import re


# Explanation for regular expression:
# \b         matches word boundary
# [a-z]+s    matches any sequence of "a" to "z" characters, ending with "s"
# ( ,)?      matches optional comma
# such as $  matches the text "such as" at the end of the string
SUCH_AS_RE = re.compile(r'\b([a-z]+s)( ,)? such as $')


relation_counts = Counter()
for sentence, mentions in zip(sentences, mentions_by_sentence):
  for mention2 in mentions:
    before = sentence[:mention2.start]
    m = SUCH_AS_RE.search(before) 
    if m:
      mention1 = m.group(1)
      relation = ('is-a', mention1, mention2.text)
      relation_counts[relation] += 1


for relation, count in relation_counts.most_common(10):
  type_, m1, m2 = relation[0], relation[1], relation[2]
  print(f'{count}:\t{type_}({m1}, {m2})')

8:	is-a(countries, the United States)
8:	is-a(countries, China)
5:	is-a(languages, Hindi)
5:	is-a(countries, India)
5:	is-a(countries, Japan)
5:	is-a(countries, France)
4:	is-a(languages, French)
4:	is-a(languages, English)
4:	is-a(countries, the United Kingdom)
4:	is-a(countries, Germany)


Not bad, but those plural forms really don't work here. Let's add a quick fix to lemmatize these:

In [31]:
import nltk
from nltk.stem import WordNetLemmatizer 

nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [32]:
lemmatizer = WordNetLemmatizer()

for relation, count in relation_counts.most_common(20):
  type_, m1, m2 = relation[0], relation[1], relation[2]
  m1 = lemmatizer.lemmatize(m1)
  print(f'{count}:\t{type_}({m1}, {m2})')

8:	is-a(country, the United States)
8:	is-a(country, China)
5:	is-a(language, Hindi)
5:	is-a(country, India)
5:	is-a(country, Japan)
5:	is-a(country, France)
4:	is-a(language, French)
4:	is-a(language, English)
4:	is-a(country, the United Kingdom)
4:	is-a(country, Germany)
3:	is-a(power, France)
3:	is-a(country, Australia)
3:	is-a(city, London)
3:	is-a(country, Egypt)
3:	is-a(country, Saudi Arabia)
3:	is-a(band, Korn)
3:	is-a(language, Haskell)
3:	is-a(city, Bukhara)
3:	is-a(language, Sanskrit)
2:	is-a(city, Athens)


Or, alternatively, we can group these by hypernym:

In [33]:
from collections import defaultdict


names_by_hypernym = defaultdict(list)
for relation in relation_counts:
  type_, m1, m2 = relation[0], relation[1], relation[2]
  names_by_hypernym[m1].append(m2)

# only print longer lists to avoid excessive output
for m1, m2_list in names_by_hypernym.items():
  if len(m2_list) > 25:
    print(f'{m1:14s}:', ', '.join(m2_list))

writers       : Erika Holzer, Hesiod, Francis Bacon 's, Richard Hakluyt, Edward A. Pollard, Lactantius, Frank Báez, Virgil, Gail Simone, Hippolytus, Ursula K. Le Guin, James Fordyce, the 17th - century, Allen Ginsberg, Constance Cumbey, H. G. Wells, Goldstücker, Alexander Pope, Ahmad Shamlou, Giovanni Villani, Machiavelli, William S. Burroughs, Clark Ashton Smith, Saint Jerome, Michael Field, Elizabeth Harrower, Beaumont, Glenn Reynolds, Stephen Crane, Paul Huson, Alejo Carpentier, Salman Rushdie, Albert Camus, Moritz August von Thümmel, Lao She, Plutarch, Remarque, Helen Stevenson, Robert E. Howard, Emile Danoën, Stephen White, Alan Watts, Kerouac, Miguel de Cervantes, Samuel Presbiter, Mathew, Kristine Kathryn Rusch, Isaac Asimov, Richard Dawkins, Anna Seward, Olive Schreiner, John McGahern, Humfrey Barwick
languages     : Hindi, Eiffel, Python, Jalapa Mazatec, French, English, Catalan, JavaScript, CSS, Spanish, Hausa, Sindhi, Banda, Serui, Nicaraguan Sign Language, Korean, Latin, Ar

Note that the relations above are still expressed in terms of strings, not IDs representing the relevant entities in a knowledgebase. 

To have fully structured relations, we would need to associate the strings such as "author" with knowledge base IDs such as Wikidata [Q482980](https://www.wikidata.org/wiki/Q482980) and make use of normalization information for the named entity mention.

## Dependency pattern matching

The "flat" Hearst patterns succeed in part because the expressions that they match have limited syntactic variability. Let's look at a relation that is commonly expressed in more varied ways, finding strings containing "married" between two `PERSON` mentions:

In [34]:
between_counts = Counter()
for sentence, mentions in zip(sentences, mentions_by_sentence):
  for i, mention1 in enumerate(mentions):
    for j, mention2 in enumerate(mentions[i+1:]):
      if mention1.type == 'PERSON' and mention2.type == 'PERSON':
        between = sentence[mention1.end:mention2.start]
        if 'married' in between:
          between_counts[between] += 1


for between, count in between_counts.most_common(10):
  print(f'{count}\tPERSON1{between}PERSON2')

214	PERSON1 married PERSON2
19	PERSON1 , married PERSON2
16	PERSON1 was married to PERSON2
7	PERSON1 married his first wife , PERSON2
6	PERSON1 had married PERSON2
6	PERSON1 , who married PERSON2
5	PERSON1 is married to PERSON2
5	PERSON1 married his second wife , PERSON2
5	PERSON1 , married to PERSON2
4	PERSON1 married actress PERSON2


Here we see much more variability in the linear word sequence. However, if we instead look at dependency analyses for these sentences: (try the [parser demo](http://bionlp-www.utu.fi/parser_demo/) live!):

<img width="65%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/john-married-jane.png">

We see that the _dependency paths_ connecting the two named entities are much more regular:

```
John  ←      nsubj ← married → obj → Jane
John  ← nsubj:pass ← married → obj → Jane
John  ← nsubj:pass ← married → obj → Jane
John  ←      nsubj ← married → obj → wife → appos → Jane
```

(The definitions for the dependency types can be found in the [Universal Dependencies](https://universaldependencies.org/) documentation: [nsubj](https://universaldependencies.org/u/dep/nsubj.html), [nsubj:pass](https://universaldependencies.org/u/dep/nsubj-pass.html), [obj](https://universaldependencies.org/u/dep/obj.html), [appos](https://universaldependencies.org/u/dep/appos.html))

The observation that the _shortest dependency path_ connecting two entity mentions contains most of the information regarding their relation (if any) has inspired [many relation extraction methods](https://scholar.google.com/scholar?q="shortest+dependency+path"+"relation+extraction").

We won't go into full code here, but we can sketch a dependency pattern-based relation extraction approach as follows:

---

* Write extraction patterns over dependency paths
* For each sentence $s$:
    * Parse sentence $s$, producing dependency graph $d$
    * For each pair of entity mentions in the sentence $(m_1, m_2)$:
        * Find the shortest path in $d$ connecting $m_1$ to $m_2$
        * Linearize the shortest path (e.g. to `ENTITY1 ← nsubj ← married → obj → ENTITY2`)
        * Match each pattern against the dependency path representation and return a relation for each pattern that matches

---

Note that because dependency paths can be linearized to strings, we can use standard methods such as regular expressions to match patterns also against dependency paths. Alternatively, we could use dedicated tools such as [Tregex](https://nlp.stanford.edu/software/tregex.shtml).

## Open Information Extraction

The rule-based approaches considered above have required us to specify the relation types in advance, e.g. through keywords or patters for each type. By contrast, Open Information Extraction (Open IE) methods aim to identify _any_ relation stated in text, extracting the relation type along with the participating entities.

Open IE methods can operate over the linear sequence of words (making use of e.g. part-of-speech patterns) or over syntactic structures, such as in the following example from the [Stanford Open IE system](https://nlp.stanford.edu/software/openie.html):

<img width="90%" src="https://nlp.stanford.edu/static/img/openie.png">

(Figure from https://nlp.stanford.edu/software/openie.html)


Open IE is at its best when applied to very large corpora. Instead of trying to run a large-scale implementation in this notebook, let's have a look at the results of the [ReVerb](http://reverb.cs.washington.edu/) Open IE system applied to 500 million web pages on this demo site: https://openie.allenai.org/

---

# Relation extraction using machine learning

The most accurate relation extraction methods are based on supervised machine learning, i.e. models trained on examples of which relations to extract from documents.

We'll here first look at two options for annotating data for relation extraction, and then sketch a machine learning approach for learning to extract relations.


## Relation annotation

Like all supervised machine learning approaches, ML-based relation extraction methods must be trained on examples of inputs with correct outputs. Recall that in our task setting, the input includes document texts $d$ and entity mentions $M_d = \{m_1, m_2, \ldots, m_n \}$ in each document.

We will then assume that the output consists of typed binary relations $R_d \subset T \times M_d \times M_d$ where $T$ is the set of types. We could for example have $T = \{$ `employee`, `owner`, `founder` $\}$ for person-organization relations.

Using a **document-oriented** annotation approach (here the open-source tool [brat](https://brat.nlplab.org)), our starting point could then look like the following:

<img width="65%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/relation-annotation-empty.png">

We would then read the text of the documents (here, sentences) and mark all applicable relations:

<img width="65%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/relation-annotation-complete.png">

The annotated relations are then positive examples for machine learning:

```
founder(Bill Gates, Microsoft)
founder(Paul Allen, Microsoft)
employee(Steve Ballmer, Microsoft)
```

Here, we have not explicitly marked any _negative_ examples. For training machine learning methods, negatives are instead generated using the _closed world assumption_: any relation that is not explicitly present is assumed to be negative, including e.g.

```
founder(Allen, IBM)
employee(Allen, IBM)
owner(Allen, IBM)
...
founder(Steve Ballmer, Microsoft)
owner(Steve Ballmer, Microsoft)
```

Alternatively, we can take a **relation-oriented** annotation approach (here using [Prodigy](https://prodi.gy/)), where the tool prompts the annotator to identify the type of relation (if any) that holds between two entities:

<img width="65%" src="https://github.com/TurkuNLP/Text_Mining_Course/raw/master/figs/prodigy-typed-relation-annotation.png">

Here negative examples are explicitly marked by the annotator (label `NONE`). While this approach thus requires the annotator to enter a larger number of labels, individual decisions can be faster to enter, and the explicitness can help reduce errors of omission.

This approach can be carried further by incorporating the type of the relation into the prompt, only asking the annotator whether a `(relation, mention-1, mention-2)` triple holds or not. In this case, the decision can be a simple binary yes/no, further accelerating the annotation process.

## Machine learning task formulation

Given manually annotated examples where each consists of

* Document text $d$
* Two entity mentions in that document, $m_1$ and $m_2$
* Relation type (e.g. `employee`) or `NONE` to signify no relation

relation extraction can be formulated as a classification task by creating an representation of the mentions $m_1$ and $m_2$ in their context ($d$) as input and the relation type (or `NONE`) as output. (Note that this assumes at most one relation type holds per entity pair.)

In previous state-of-the-art approaches, considerable effort was invested into creating representations of the mentions in context for ML methods, frequently involving e.g. carefully engineered representations of dependency paths.

Fortunately for us, with recent Transformer-based approaches such as BERT, these representations can be simplified into marking the entities in the original text in some way, and providing the text with marked entities to the model as input.

For example, given

* Document text $d$ = `Bill Gates and Paul Allen founded Microsoft`
* Mentions $m_1$ = `(0, 10, PERSON, Bill Gates)` and $m_2$ = `(34, 43, ORG, Microsoft)`
* Relation type `founder`

We could formulate the classification example e.g. as

* input: `PERSON and Paul Allen founded ORGANIZATION`
* output: `founder`

Here using the literal type strings `PERSON` and `ORGANIZATION` to mark the two mentions under consideration. (Note that other mentions in context are not marked.)