# Named Entity Recognition

Named Entity Recognition (NER) is a highly applicable field of natural language processing that will likely be very useful for the leadership prize challenge. At a high level, the purpose of NER is to extract the meaningful words from a given sentence. This is done by extracting the name of the entities in question, along with the class. For example, in the sentence

`
Mark works for Facebook.
`

We have two named entities that can be extracted. These can be seen in the table below

|  Name    |    Class     |
|:--------:|:------------:|
|`Mark`  |`Person`   |
|`Facebook`|`Organization`|

# Modelling

The Mathematics behind NER models was difficult to grasp, even for an Apple-Mathematician such as myself, however in short most models involve Markov Chains, Conditional Probability and Graph Theory. What is more interesting is how the models work. The afforementioned mathematics is used to construct dependency graphs, where the relationship between words in the sentence is conveyed. An example sentence is shown below:

<p align="center">
    <img src="./img/dependency_graph.png" />
</p>

This sentence includes annotations with both dependency and named entity information. Inside-outside-beginning (IOB) tagging is used to show how specific words relate to entities. We see that `The` is the sole beginning word, `House`, `of`, and `Representatives` are classified as inside words, while `votes`, `on`, `the`, and `measure` are classified as outside words. Arrows are also included to illustrate the dependency graph between words.

# Information Extraction in Python
> This is kind of interesting but don't think it'll be that relevant, skip to the next section if you want ot see more relevant stuff

We can implement NER in Python using [NLTK](https://www.nltk.org/) and [SpaCy](https://spacy.io/). For the purpose of this example I will take some words that were said recently by a complete moron.

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# >:-(
ex = "President Donald Trump told four congresswomen to go back to the countries where they came from."


def preprocess(sent):
    """
    Apply word tokenization and part-of-speech tagging.
    """
    sent = word_tokenize(sent)
    sent = pos_tag(sent)
    return sent


sent = preprocess(ex)
print(sent)


[('President', 'NNP'), ('Donald', 'NNP'), ('Trump', 'NNP'), ('told', 'VBD'), ('four', 'CD'), ('congresswomen', 'NNS'), ('to', 'TO'), ('go', 'VB'), ('back', 'RB'), ('to', 'TO'), ('the', 'DT'), ('countries', 'NNS'), ('where', 'WRB'), ('they', 'PRP'), ('came', 'VBD'), ('from', 'IN'), ('.', '.')]


What we are left with is a list of tuples containing the individual words and their associated part-of-speech. To make this more useful, we'll implement noun phrase chunking. This will make use of *Regular Expressions* to find any named entities. Our regular expression states that a noun phrase (NP) should be formed when the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and finally a noun (NN).

In [15]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  President/NNP
  Donald/NNP
  Trump/NNP
  told/VBD
  four/CD
  congresswomen/NNS
  to/TO
  go/VB
  back/RB
  to/TO
  the/DT
  countries/NNS
  where/WRB
  they/PRP
  came/VBD
  from/IN
  ./.)


As with before, nltk allows us to see IOB tags as well.

In [16]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('President', 'NNP', 'O'),
 ('Donald', 'NNP', 'O'),
 ('Trump', 'NNP', 'O'),
 ('told', 'VBD', 'O'),
 ('four', 'CD', 'O'),
 ('congresswomen', 'NNS', 'O'),
 ('to', 'TO', 'O'),
 ('go', 'VB', 'O'),
 ('back', 'RB', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('countries', 'NNS', 'O'),
 ('where', 'WRB', 'O'),
 ('they', 'PRP', 'O'),
 ('came', 'VBD', 'O'),
 ('from', 'IN', 'O'),
 ('.', '.', 'O')]


# NER in Python

For extracting named entities from text, we turn to our good old friend `SpaCy`. We will now use a new example to see how we can extract named entities with `SpaCy`:

In [10]:
import spacy
from spacy import displacy
from collections import Counter
from pprint import pprint
import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Since reporting began in 1924, the federal government reports a total of 25 people in six provinces have died of rabies in Canada.")

pprint([(X.text, X.label_) for X in doc.ents])

[('1924', 'DATE'), ('25', 'CARDINAL'), ('six', 'CARDINAL'), ('Canada', 'GPE')]


We see here the named entities with their specific types, given to us by `SpaCy`. As per the docs, the types from the output are described as followed:

|Type  |Description                           |
|:-----|--------------------------------------|
|`DATE`|Absolute or relative dates or periods.|
|`CARDINAL`|Numerals that do not fall under another type.|
|`GPE`| Countries, cities, states. |

> See the full list [here](https://spacy.io/api/annotation)


If we now choose to focus on *tokens* instead of entities, we can do the following:

In [11]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(Since, 'O', ''),
 (reporting, 'O', ''),
 (began, 'O', ''),
 (in, 'O', ''),
 (1924, 'B', 'DATE'),
 (,, 'O', ''),
 (the, 'O', ''),
 (federal, 'O', ''),
 (government, 'O', ''),
 (reports, 'O', ''),
 (a, 'O', ''),
 (total, 'O', ''),
 (of, 'O', ''),
 (25, 'B', 'CARDINAL'),
 (people, 'O', ''),
 (in, 'O', ''),
 (six, 'B', 'CARDINAL'),
 (provinces, 'O', ''),
 (have, 'O', ''),
 (died, 'O', ''),
 (of, 'O', ''),
 (rabies, 'O', ''),
 (in, 'O', ''),
 (Canada, 'B', 'GPE'),
 (., 'O', '')]


We can see here that `IOB` tagging is provided, and types are provided only to named entities. As a better example if we pick a sentence where certain entities contain multiple words we can see the distinction between tokens and entities:

In [12]:
doc = nlp("Donald Trump is the president of the United States of America.")
print("ENTITIES:\n")

pprint([(X.text, X.label_) for X in doc.ents])

print("\nTOKENS:\n")

pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

ENTITIES:

[('Donald Trump', 'PERSON'), ('the United States of America', 'GPE')]

TOKENS:

[(Donald, 'B', 'PERSON'),
 (Trump, 'I', 'PERSON'),
 (is, 'O', ''),
 (the, 'O', ''),
 (president, 'O', ''),
 (of, 'O', ''),
 (the, 'B', 'GPE'),
 (United, 'I', 'GPE'),
 (States, 'I', 'GPE'),
 (of, 'I', 'GPE'),
 (America, 'I', 'GPE'),
 (., 'O', '')]


In conclusion, named entity recognition is really cool and we should definitely use it (with `SpaCy` over `nltk`) in some form in our final product. I hope you enjoyed reading this notebook!