# Part 1 - Information Extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. Information Extraction can be a task like:

* **Named Entity Recognition**, retrieve entities (like Person, Location, etc.) in the text. 
* **Relation Extraction**, find the relation between two entities in the text.
* **Template Filling**, find the correct entity to fill a certain template.

In this BLU, we are going to learn some of the basic techniques to extract specific information from textual sources. We are going to focus on the task of **named-entity recognition (NER)** where our objective is to **retrieve all the mentions** of entities like people, location, time, etc.

We will first try a simple approach with regular expressions and then a more sophisticated one using SpaCy.

### Table of contents

[1. Information Extraction with Regular Expressions](#1.-Information-Extraction-with-Regular-Expressions)   
[2. Deeper look into information extraction using SpaCy](#2.-Deeper-look-into-information-extraction-using-SpaCy)   
&emsp;[2.1 SpaCy 101](#2.1-SpaCy-101)   
&emsp;[2.2 Information extraction with SpaCy](#2.2-Information-extraction-with-SpaCy)   
&emsp;&emsp;[2.2.1 Named entity recognition](#2.2.1-Named-entity-recognition)   
&emsp;&emsp;[2.2.2 Matcher](#2.2.2-Matcher)   
&emsp;&emsp;[2.2.3 Information extraction with complex patterns](#2.2.3-Information-extraction-with-complex-patterns)   
[3. Further directions and reading](#3.-Further-directions-and-reading)

![robot entities](./media/robot_entities.jpg)

In [1]:
import re
import json

import pandas as pd
import spacy

We are going to work on a corpus containing forum discussions. We extracted a sample from Reddit for this use. For more interesting examples, you may find more textual data available at https://files.pushshift.io/reddit/

Let's load the data:

In [2]:
docs = []
with open('./datasets/sample_data.json') as fp:
    for line in fp:
        entry = json.loads(line)
        docs.append(entry['body'])
        
print('I read {} documents'.format(len(docs)))

I read 1000 documents


## 1. Information Extraction with Regular Expressions

In BLU07, we became pros of regular expressions. In this BLU, we're going to try to use them for entity recognition. Take a moment to think about all the possibilities of entities that we can find in a text. Do you think such a task is achievable using only regular expressions?

![regex](./media/regex.gif "regex")

As a refresher, let's say that you need to retrieve all the **dates** mentioned in our sample corpus. We learned in BLU07 that, if we follow a certain pattern for the dates, it is easy to use a regular expression to extract them.

In [3]:
# Let's find all possible dates in the format xx/xx/xxxx
data = ' '.join(docs)
re.findall('\d{1,2}/\d{1,2}/\d{2,4}', data)

['14/09/30', '7/12/2007', '4/16/2007', '3/27/2007', '2/28/2007']

Ok, this looks like it's going to be a breeze. Next task is to retrieve all the **country names** from the corpus.

One possible approach is to get a list of all countries that exist and then look for the occurence of such elements in the corpus. Let's try that, shall we?

![alt text](./media/countries_meme.jpg)

In [4]:
countries = []
with open('./datasets/countries.txt') as fp:
    for line in fp:
        countries.append(line.rstrip())

Again, we'll try to use regular expressions:

In [5]:
# Sort country list by length. This is important to match longer names before short 
# ones (like in 'Papua New Guinea' vs. 'Papua')
countries.sort(key=len, reverse=True)

# Make a regex to recognize all possible names.
# '|' stands for the logical OR operation
# \b means word boundaries (punctuation or white spaces)
# re.escape returns all the non-alphanumeric characters backslashed to avoid 
# their misinterpretation as regex metacharacters
countries_regex = r'\b(' + '|'.join([re.escape(c) for c in countries]) + r')\b'

# finditer is similar to findall
# the flag re.I means to ignore casing (accept both lowercase and uppercase letters as the same)
for i, m in enumerate(re.finditer(countries_regex, data, flags=re.I)):
    print( (m.group(), m.start(), m.end()) )
    # just show the first 20
    if i > 20:
        break    

('us', 763, 765)
('United States', 827, 840)
('UK', 6971, 6973)
('US', 7000, 7002)
('Puerto rico', 8026, 8037)
('us', 8638, 8640)
('France', 19815, 19821)
('us', 21563, 21565)
('Puerto Rico', 27659, 27670)
('Puerto Rico', 27754, 27765)
('US', 28101, 28103)
('Canada', 29439, 29445)
('USA', 32880, 32883)
('Norway', 34749, 34755)
('Korea', 34837, 34842)
('USA', 35738, 35741)
('United States', 41060, 41073)
('us', 42290, 42292)
('us', 42403, 42405)
('Soviet', 44563, 44569)
('us', 49625, 49627)
('Chad', 51352, 51356)


**Is this approach working?**

It seems like the word **'us'**, for example, has caused some confusion. It could be the country _U.S._ or the pronoun _us_. In this case, we are not able to disambiguate the two forms by just comparing the word form. We will need either more **context** or more **linguistic information** and regular expressions won't give us none of that.

Luckily, you already know an NLP library which can provide the correct information to disambiguate the word 'us'. In the next examples, we will use SpaCy as our NLP toolkit to give us just that.

## 2. Deeper look into information extraction using SpaCy

<img src="media/spacy.jpg" alt="Spacy" width="600"> 

If you remember BLU08, we used SpaCy to understand word vectors (aka word embeddings). We will make use of the medium sized SpaCy English model once again. If you haven't downloaded it yet (so the code in the following cell throws an error), you can simply use this command in a Notebook cell:

```
!python -m spacy download en_core_web_md
```
   
SpaCy provides English models of different sizes - small, medium, large (en_core_web_sm, en_core_web_md, en_core_web_lg) - you can use the one that suits your needs and memory size.

In [6]:
nlp = spacy.load('en_core_web_md')

### 2.1 SpaCy 101
SpaCy is a NLP library designed for use in production. That means that it is designed to get things done instead of playing around, so it doesn't let you choose between many different algorithms like libraries designed for research and teaching, e.g. NLTK and CoreNLP.

SpaCy provides way more information than the word counting algorithms we were using until now. SpaCy sees the forest, not just the trees. It provides information derived from context, like is this word a noun or a verb? What is the word next to this word? Which role has this word in the sentence, is it a subject or an object? (Yes, you'll have to remember your language classes to use SpaCy). It can also recognize people and places, so-called named entities.

SpaCy contains pretrained models for different languages and also the option to pretrain your own model on your own corpus data. This is important if your text is very specific. For instance, if you'd like to analyze text from social media, a model trained on romantic novels will not work that well because these two types of language data will have a very different vocabulary and sentence structure.

This is the scheme of SpaCy's processing [pipeline](https://spacy.io/usage/processing-pipelines) that we loaded above. It consists of a tokenizer and other modules which add information about the tokens. The modules can be based on a model or on a set of rules. In a model based module, the results are predicted, whereas in s rule based one, they are determined following some rule. For instance, a lemmatizer can determine the base form of words based on grammatical rules.

<img src="media/spacy_pipeline.png" alt="Spacy_pipeline"> 

Let's look at the pipeline components one by one:
- The **tokenizer** segments the input text into tokens and converts it into a `Doc` object. All further steps operate on this object and also output it.
- The **lemmatizer** outputs the base form of each token.
- The **tagger** assigns part-of-speech (POS) tags like verb, proper noun, adjective.
- The **morphologizer** predicts coarse-grained POS and morphological information like singular or plural, verb tense, first/second/third person.
- The **parser** predicts syntactic information for each token - its role in the sentence like subject, direct object, determiner.
- The **named entities recognizer** (NER) recognizes real world objects like countries or companies.
- The **entity linker** links named entities to a knowledge base, e.g. the 'first female programmer' and 'Ada Lovelace' link to the same item in the knowledge base.
- The **Tok2Vec** assigns a word vector to each token.
- The **matcher** is a like a sophisticated regex that uses also lingustic information.
- The **sentence boundary recognizer** detects sentences.
- The **text categorizer** predicts labels for whole chunks of text.
- Any **custom components** that a user can add.

The information from each processing step is added as attributes to each token. Let's look at an example (shamelessly copied from the SpaCy documentation):

In [7]:
# Process a text with the nlp pipeline, the result is a Doc object
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# Show the token and some of its attributes - part-of-speech and syntactic dependency labels
for token in doc:
    print(token.text, token.pos_, token.dep_, token.tag_, token.shape_, token.is_stop, token.morph)

Apple PROPN nsubj NNP Xxxxx False Number=Sing
is AUX aux VBZ xx True Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
looking VERB ROOT VBG xxxx False Aspect=Prog|Tense=Pres|VerbForm=Part
at ADP prep IN xx True 
buying VERB pcomp VBG xxxx False Aspect=Prog|Tense=Pres|VerbForm=Part
U.K. PROPN compound NNP X.X. False Number=Sing
startup NOUN dobj NN xxxx False Number=Sing
for ADP prep IN xxx True 
$ SYM quantmod $ $ False 
1 NUM compound CD d False NumType=Card
billion NUM pobj CD xxxx False NumType=Card


Notice that some attribute names end in an underscore. SpaCy stores its attributes as hashes to minimize memory usage. When you use an attribute name, it will show the hash. To see the attribute in a human readable form, add the underscore.

In [8]:
doc[0].text, doc[0].tag, doc[0].tag_

('Apple', 15794550382381185553, 'NNP')

So, according to SpaCy, 'Apple' in this sentence is a proper noun (PROPN) and a nominal subject (nsubj). But what is NNP? Fortunately, SpaCy is nice and explains:

In [9]:
spacy.explain("NNP")

'noun, proper singular'

For the full list of attribute values that SpaCy can explain, look [here](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py).

We also got the information about the word shape and if it's a stopword. This is just a small selection of token attributes, see the whole list [here](https://spacy.io/api/token).

<img src="media/pronoun.jpg" alt="Pronoun meme" width="300"> 

### 2.2 Information extraction with SpaCy
We will first process the documents with the complete NLP pipeline using the [`pipe`](https://spacy.io/api/language#pipe) method. It will process our text, tokenize it and extract information from it using all the CPU cores on our machine. The output of `pipe` is a generator, so we will convert it to a list.

In [10]:
# We are going to use the function pipe to process all documents.
# One of the strenghts for SpaCy is the parallel processing using all your computer cores.

docs = list(nlp.pipe(docs))

Let's take one of the processed Docs. We want to detect country names, so we will look at the named entities detected by the NER pipeline component. This is our example:

In [11]:
example = docs[250]
print(example)

It was more like $10 million dollars after the previous government infringed on his rights within the Charter of Rights and Freedoms (basically the Canadian constitution). Trudeau knew Khadr would win in court, and settled for paying $10 million instead of an amount multiple times more. 

Not saying I'm happy with Khadr getting $10 million, but this was more of a fuck up on the previous government since they blatantly violated his rights as a Canadian.


#### 2.2.1 Named entity recognition
The `.ents` attribute of the `Doc` object holds the information about the [named entities](https://spacy.io/usage/linguistic-features#named-entities). Here we show the entities, the place where they start and end in the text, and their category. For the complete list of categories, look [here](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py#L326) or we can also ask SpaCy:

In [12]:
nlp.get_pipe("ner").labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [13]:
for ent in example.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

more like $10 million dollars 7 36 MONEY
the Charter of Rights and Freedoms 98 132 LAW
Canadian 148 156 NORP
Trudeau 172 179 PERSON
Khadr 185 190 PERSON
$10 million 234 245 MONEY
Khadr 316 321 PERSON
$10 million 330 341 MONEY
Canadian 447 455 NORP


SpaCy correctly labeled the entities in our example. 'Trudeau' and 'Khadr' are PERSON entities, 'Canadian' is a NORP entity (Nationalities or religious or political groups).

There is a named entities category `GPE`:

In [14]:
spacy.explain('GPE')

'Countries, cities, states'

So we'll take these and use the `Matcher` to sort out the countries.

#### 2.2.2 Matcher
A `Matcher` is SpaCy's version of a regular expression - it searches for patterns in your text according to the rules you give it. However, it is much more powerful since it has access to the outputs of the aforementioned NLP pipeline. That means we can search patterns that include certain named entities or part-of-speech tags. 

The `Matcher` is initialized with the vocabulary object from the pipeline which we used to analyze the documents. The vocabulary object is just what it sounds like - a list of words with some information about them.

In [15]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # Pass the vocabulary object to Matcher.__init__()

Similar to regex, `Matcher` operates with patterns. We will define a pattern for country detection using the list of country names. We will define one pattern for each country and pass it to `Matcher` with the `add` method. Notice that each added pattern has a name or ID, like this `matcher.add(ID, pattern)`. 

In [16]:
for country in countries:
    # Build a pattern from the country name. 
    # For example: United States -> [{'LOWER': 'united'}, {'LOWER': 'states'}]
    # LOWER means to match the words in the lowercased token.
    pattern = [[{'LOWER': c.lower()} for c in country.split()]]
    matcher.add(country, pattern)

Now we run the pattern on the whole list of documents and show the matches. The first number is the document number, followed by the beginning and end of the matched span.

In [17]:
# for screen economy, let's just show the matches for the first 400 documents.
for i, doc in enumerate(docs[:400]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        print(i, start, end, span)

10 1 2 us
12 4 6 United States
58 22 23 UK
58 28 29 US
64 18 20 Puerto rico
69 50 51 us
146 4 5 France
167 29 30 us
213 99 101 Puerto Rico
213 121 123 Puerto Rico
213 198 199 US
229 4 5 Canada
255 86 87 USA
263 80 81 Norway
263 103 104 Korea
267 2 3 USA
312 4 6 United States
320 35 36 us
320 58 59 us
335 38 39 Soviet
335 38 39 Soviet
349 4 5 us
367 7 8 Chad
369 11 12 Chad
369 18 19 Chad
369 41 42 Chad
386 4 6 United States


So far, we used `Matcher` just like a regex and of course the results are the same. Let's now take it up a notch and also use the linguistic information. 

We are interested in matching the country names that were tagged as **Proper Nouns** ('PROPN' POS tag). A Proper Noun is a specific (i.e., not generic) name for a particular person, place, or thing. We will add a `'POS'` entry to the pattern dictionary with the `'PROPN'` tag as the value.

In [18]:
# new matcher instance
matcher = Matcher(nlp.vocab)

for country in countries:
    # same as before, but now with one more restriction: the Part-of-speech should be a Proper Noun.
    pattern = [[{'LOWER': c.lower(), 'POS': 'PROPN'} for c in country.split()]]
    matcher.add(country, pattern)

Again, we look at matches in the first 400 documents:

In [19]:
for i, doc in enumerate(docs[:400]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id] 
        span = doc[start:end]
        print(i, start, end, span)

12 4 6 United States
58 22 23 UK
58 28 29 US
64 18 20 Puerto rico
146 4 5 France
213 99 101 Puerto Rico
213 121 123 Puerto Rico
213 198 199 US
229 4 5 Canada
255 86 87 USA
263 80 81 Norway
263 103 104 Korea
267 2 3 USA
312 4 6 United States
349 4 5 us
367 7 8 Chad
369 11 12 Chad
369 18 19 Chad
369 41 42 Chad
386 4 6 United States


Much better! We only see one incorrectly matched 'us' now. We could also try out another pattern, matching entity types instead of POS:

In [20]:
# new matcher instance
matcher = Matcher(nlp.vocab)

for country in countries:
    # same as before, but now with one more restriction: the Part-of-speech should be a Proper Noun.
    pattern = [[{'LOWER': c.lower(), 'ENT_TYPE': 'GPE'} for c in country.split()]]
    matcher.add(country, pattern)

In [21]:
for i, doc in enumerate(docs[:400]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id] 
        span = doc[start:end]
        print(i, start, end, span)

58 22 23 UK
58 28 29 US
64 18 20 Puerto rico
213 99 101 Puerto Rico
213 121 123 Puerto Rico
213 198 199 US
263 80 81 Norway
263 103 104 Korea
267 2 3 USA
367 7 8 Chad


Now the list is even shorter. Why do you think this is? We can look at the surrounding words of the matched token to understand it:

In [22]:
for i, doc in enumerate(docs[:400]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id] 
        span = doc[start-3:end+3]
        print(i, start, end, span)

58 22 23 shameless where the UK version is better
58 28 29 better but the US version exists.
64 18 20 white people in Puerto rico. Its a
213 99 101 his tweets about Puerto Rico from the past
213 121 123 number of his Puerto Rico related tweets in
213 198 199 tax code for US, The American
263 80 81 [EU] Norway DMG looking to
263 103 104 [ASIA] Korea, looking for
267 2 3 
367 7 8 was not a Chad you're fucking


#### 2.2.3 Information extraction with complex patterns

Let's now look into other types of information extraction methods which use complex structures. For example, let's say we want to extract places. Usually, places come up in text in structures similar to:

* go to xx
* went from xxx
* going to xx

Such patterns could also be interesting for the task of relation extraction we mentioned in the intro.

In order to build a SpaCy pattern for the proposed sentence structure, we are going to use the lemma 'go' which is invariant for all possible verb inflections, a preposition (POS tag ADP) and a proper noun (POS tag PROPN).

In [23]:
matcher = Matcher(nlp.vocab)
pattern = [[{'LEMMA': 'go'}, {'POS': 'ADP'}, {'POS': 'PROPN'}]]
matcher.add('LOC', pattern)

In [24]:
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        span_text = span.text  # the span as a string
        print(start, end, span_text)

24 27 goes to GTA
246 249 going to Osaka
81 84 gone to Irvine
91 94 going with Robbie


These sure aren't all the locations that are present in our corpus and we also got some false positives! Not what we expected then 🙃 Anyway, the possibilities of `Matcher` are [endless](https://spacy.io/usage/rule-based-matching) and this kind of patterns might work in simpler situations.

## 3. Further directions and reading

[SpaCy 101](https://spacy.io/usage/spacy-101) - a good, concise introduction to SpaCy

If you run into the limits of the pretrained models, you can [train your own model on your own corpus](https://spacy.io/usage/training).

Guide to all [linguistic features](https://spacy.io/usage/linguistic-features) used by SpaCy.

Another possible way to go is to annotate examples in a corpus. We can train machine learning systems from scratch to automatically extract patterns from annotated corpora. This class of machine learning methods is known as sequencial labeling and the most famous approaches are [CRFs](https://people.cs.umass.edu/~wallach/technical_reports/wallach04conditional.pdf) and [Seq2seq](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) and now most recently [Transformers](https://jalammar.github.io/illustrated-bert/). You are not expected to know what these models do since they are too complex right now for you to understand, but it's good to keep them in mind and maybe play a little bit if you wish 😊

And of course, the [large language models](https://developers.google.com/machine-learning/resources/intro-llms) which you most likely heard of or interacted with already.