# Week 9: Named Entity Recognition

This week we are going to looking at named entity recognition in the fiction genre. In doing so we will introduce the spaCy library (https://spacy.io/) which provides a number of very fast, state-of-the-art accuracy tools for carrying out NLP tasks including part-of-speech tagging, dependency parsing and named entity recognition.



In [None]:
#preliminary imports

from google.colab import drive
#mount google drive
drive.mount('/content/drive/')
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')

import pandas as pd
import operator

## Project Gutenberg

[Project Gutenberg electronic text archive](http://www.gutenberg.org/) contains around 25,000 free electronic books.

A small selection is made available through the NLTK. For the full list, run the following cell.

In [None]:
import nltk
nltk.download('gutenberg')

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

We can get the raw text of any of the novels using the `gutenberg.raw(fileid)` method.  This returns a String.

In [None]:
emma=gutenberg.raw('austen-emma.txt')
emma[:500]

Now, we carry out a little bit of cleaning of the text.  Check you understand what each line in the `clean_text()` function does.

In [None]:
import re
def clean_text(astring):
    #replace newlines with space
    newstring=re.sub("\n"," ",astring)
    #remove title and chapter headings
    newstring=re.sub("\[[^\]]*\]"," ",newstring)
    newstring=re.sub("VOLUME \S+"," ",newstring)
    newstring=re.sub("CHAPTER \S+"," ",newstring)
    newstring=re.sub("\s\s+"," ",newstring)
    #return re.sub("([^\.|^ ])  +",r"\1 .  ",newstring).lstrip().rstrip()
    return newstring.lstrip().rstrip()

clean_emma=clean_text(emma)
print(len(emma))
print(len(clean_emma))
clean_emma[:500]

## SpaCy

If working at home, you may need to install spaCy and download a set of English models.  at the command line:

```
pip install spacy
python -m spacy download en_core_web_sm
```

In the lab, or once you have done this at home, you should then be able to set up a spaCy processing pipeline as follows. If working on colab than this should work automatically.

In [None]:
import spacy
#nlp=spacy.load('en')
nlp=spacy.load('en_core_web_sm')
type(nlp)

Now we can run any text string through the language processing pipeline stored in `nlp`
This next cell might take a few minutes to run since it carries out all of the SpaCy NLP functionality on the input text.  It will return a SpaCy `Doc` object which contains the text plus various annotations.  See the SpaCy documentation https://spacy.io/api/doc

In [None]:
nlp_emma=nlp(clean_emma)

In [None]:
type(nlp_emma)

For example, we can now iterate over sentences in the document.

In [None]:
for s in nlp_emma.sents:
    print(s)
    break

We can iterate over tokens in sentences and find out the labels added by SpaCy to each token.

In [None]:
emma_sents=list(nlp_emma.sents)

In [None]:
print(emma_sents[0])

In [None]:
def display_sent(asent):
    headings=["token","lower","lemma","pos","NER"]
    info=[]
    for t in asent:
        info.append([t.text,t.lower_,t.lemma_,t.pos_,t.ent_type_])
    return(pd.DataFrame(info,columns=headings))
        
display_sent(emma_sents[3])

### Exercise 1.1
Run the `display_sent()` function on each of the first ten sentences of Emma (as stored in `emma_sents`).
* What errors do you see in the named entity recognition?
* Can you see any patterns in the words, lemmas or part-of-speech tags which might be used to improve the named entity recognition on these sentences?


### Exercise 1.2
Write a function 'make_tag_lists()' which takes a list of sentences as input and which returns 3 lists:
1. tokens
2. POS tags
3. Named Entity tags

These lists should be the same length (189191, if applied to the all of the sentences in `nlp_emma`) and maintain the order of the text, i.e., position i in each list should refer to the same token.

### Exercise 1.3
Write a function `extract_entities` which takes a list of tokens, a list of tags and a tag-type and returns a dictionary of all of the **chunks** which have the given tag-type; together with their frequency in the text.

You can assume that two consecutive tokens with the same tag are part of the same chunk.

Test your code and you should get the following output (for the given input):

```python
[ ] extract_entities(toks,ner,"PERSON")
```

```
    {':-- Robert Martin': 1,
     ';"--': 1,
     'A. W. "': 1,
     'Absurd': 1,
     'Adair': 1,
     'Anne': 1,
     'Anne Cox': 2,
     ...
```

This tells us that "Anne Cox" is tagged twice as a named entity of type "PERSON" in the text.  How many occurrences of "Miss Woodhouse" tagged as a "PERSON" are there?

### Exercise 1.4
Use your code to find 
* the 20 most commonly referred to people in Emma
* the 20 most commonly referred to places in Emma

### Exercise 1.5
Look at the lists of people and places generated.  Assuming no knowledge of the characters and plot of Emma, what errors can you see?

## Extensions

Code one or more of the following extensions.  In all cases, compare the lists of most frequently occurring named entities generated with the original ones.

### Expanding NER Chunks
* if the word immediately before or after a named entity chunk is POS-tagged as a PROPN, assume that this word is also part of the named entity chunk

For example, where the token "Miss" has pos-tag "PROPN" and is immediately followed by a token labelled with "PERSON", then it should also be labelled with "PERSON". 

### Relabelling NER Chunks
* if a named entity occurs more frequently elsewhere in the text as a different type, assume that it has been mis-labelled here

For example, all 9 occurrences of "Jane Fairfax" labelled as "GPE" could be relabelled as "PERSON".

### Linking NEs
* find candidates for named entity linking.  

For example, "Churchill" and "Frank Churchill" and "Frank" might all refer to the same person.
However, you should proceed with care.  Anyone who knows the story well would tell you that "Knightley" and "John Knightley" do not refer to the same character (they are brothers).  As a further extension, give your linking functionality access to a list of known characters e.g., from https://www.janeausten.org/emma/cast-of-characters.asp

### Co-occurring NEs
* find NEs that tend to co-occur together.

Can you find pairs of named entities which often occur together (or even better, occur more often together than one would expect if named entities occur independently)?  You could consider pairs of people or alternatively co-occurrences of people and places.

### NEs over Time
* record the position in the text of each named entity occurrence
* make a plot showing how the amount of occurrences of a given named entity varies with position in the text

If you store each text position in `list_of_indices`, you could use:
`pd.Series(np.histogram(list_of_indices, bins=num_bins)` to help you with this
