# Week 10 Lab: Named Entity Recognition

This week we are going to looking at named entity recognition in the fiction genre. In doing so we will introduce the spaCy library (https://spacy.io/) which provides a number of very fast, state-of-the-art accuracy tools for carrying out NLP tasks including part-of-speech tagging, dependency parsing and named entity recognition.



In [1]:
#preliminary imports

#from google.colab import drive
##mount google drive
#drive.mount('/content/drive/')
#import sys
#sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')

import pandas as pd
import operator

## Project Gutenberg

[Project Gutenberg electronic text archive](http://www.gutenberg.org/) contains around 25,000 free electronic books.

As seen previously, a small selection is made available through the NLTK. For the full list, run the following cells.

In [2]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /home/poppy/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [3]:
from nltk.corpus import gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

We can get the raw text of any of the novels using the `gutenberg.raw(fileid)` method.  This returns a String.

In [4]:
emma=gutenberg.raw('austen-emma.txt')
emma[:500]

"[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister's marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died t"

Now, we carry out a little bit of cleaning of the text.  Check you understand what each line in the `clean_text()` function does.

In [5]:
import re
def clean_text(astring):
    #replace newlines with space
    newstring=re.sub("\n"," ",astring)
    #remove title and chapter headings
    newstring=re.sub("\[[^\]]*\]"," ",newstring)
    newstring=re.sub("VOLUME \S+"," ",newstring)
    newstring=re.sub("CHAPTER \S+"," ",newstring)
    newstring=re.sub("\s\s+"," ",newstring)
    #return re.sub("([^\.|^ ])  +",r"\1 .  ",newstring).lstrip().rstrip()
    return newstring.lstrip().rstrip()

clean_emma=clean_text(emma)
print(len(emma))
print(len(clean_emma))
clean_emma[:500]

887071
880067


"Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct "

## SpaCy

If working at home, you may need to install spaCy and download a set of English models.  at the command line:

```
pip install spacy
python -m spacy download en_core_web_sm
```

In the lab, or once you have done this at home, you should then be able to set up a spaCy processing pipeline as follows. If working on colab than this should work automatically.

In [6]:
import spacy
#nlp=spacy.load('en')
nlp=spacy.load('en_core_web_sm')
type(nlp)

2022-01-03 14:50:54.165540: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-03 14:50:54.165576: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


spacy.lang.en.English

Now we can run any text string through the language processing pipeline stored in `nlp`
This next cell might take a few minutes to run since it carries out all of the SpaCy NLP functionality on the input text.  It will return a SpaCy `Doc` object which contains the text plus various annotations.  See the SpaCy documentation https://spacy.io/api/doc

In [7]:
nlp_emma=nlp(clean_emma)

In [8]:
type(nlp_emma)

spacy.tokens.doc.Doc

For example, we can now iterate over sentences in the document.

In [9]:
for s in nlp_emma.sents:
    print(s)
    break

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.


In [10]:
emma_sents=list(nlp_emma.sents)

In [11]:
print(emma_sents[0])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.


We can iterate over tokens in sentences and find out the labels added by SpaCy to each token.  Here we consider the token itself (t.text), the lower-cased version of the token (t.lower_), the lemma (t.lemma_), the part-of-speech tag (t.pos_) and the named entity type (t.ent_type_).  However, if you look at the SpaCy documentation you will find other annotations as well (including those for dependency parsing which are beyond the scope of this module).

In [12]:
def display_sent(asent):
    headings=["token","lower","lemma","pos","NER"]
    info=[]
    for t in asent:
        info.append([t.text,t.lower_,t.lemma_,t.pos_,t.ent_type_])
    return(pd.DataFrame(info,columns=headings))
        
display_sent(emma_sents[0])

Unnamed: 0,token,lower,lemma,pos,NER
0,Emma,emma,Emma,PROPN,PERSON
1,Woodhouse,woodhouse,Woodhouse,PROPN,PERSON
2,",",",",",",PUNCT,
3,handsome,handsome,handsome,ADJ,
4,",",",",",",PUNCT,
5,clever,clever,clever,ADJ,
6,",",",",",",PUNCT,
7,and,and,and,CCONJ,
8,rich,rich,rich,ADJ,
9,",",",",",",PUNCT,


### Exercise 1.1
Run the `display_sent()` function on each of the first ten sentences of Emma (as stored in `emma_sents`).
* What errors do you see in the named entity recognition?
* Can you see any patterns in the words, lemmas or part-of-speech tags which might be used to improve the named entity recognition on these sentences?


In [13]:
for i in range(10):
    print("")
    print('Sentence number:', i)
    print(display_sent(emma_sents[i]))
    print("")


Sentence number: 0
          token        lower        lemma    pos       NER
0          Emma         emma         Emma  PROPN    PERSON
1     Woodhouse    woodhouse    Woodhouse  PROPN    PERSON
2             ,            ,            ,  PUNCT          
3      handsome     handsome     handsome    ADJ          
4             ,            ,            ,  PUNCT          
5        clever       clever       clever    ADJ          
6             ,            ,            ,  PUNCT          
7           and          and          and  CCONJ          
8          rich         rich         rich    ADJ          
9             ,            ,            ,  PUNCT          
10         with         with         with    ADP          
11            a            a            a    DET          
12  comfortable  comfortable  comfortable    ADJ          
13         home         home         home   NOUN          
14          and          and          and  CCONJ          
15        happy        happy        

### Exercise 1.2
Write a function 'make_tag_lists()' which takes a list of sentences as input and which returns 3 lists:
1. tokens
2. POS tags
3. Named Entity tags

These lists should be the same length (189191, if applied to the all of the sentences in `nlp_emma`) and maintain the order of the text, i.e., position i in each list should refer to the same token.

In [14]:
def make_tag_lists(sent_list):
    tokens = []
    pos_tags = []
    ner_tags = []
    for sent in sent_list:
        tokens.append(sent.text)
        pos_tags.append(sent.pos)
        ner_tags.append(sent.ent_type_)
    return tokens, pos_tags, ner_tags

tokens_list, pos_tags_list, ner_tags_list =  make_tag_lists(nlp_emma)

print(len(tokens_list))
print(len(pos_tags_list))
print(len(ner_tags_list))
print(tokens_list[1])
print(pos_tags_list[1])
print(ner_tags_list)


189175
189175
189175
Woodhouse
96
['PERSON', 'PERSON', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'CARDINAL', 'CARDINAL', 'CARDINAL', 'CARDINAL', 'CARDINAL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'CARDINAL', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'DATE', 'DATE', '', '', 'PERSON', '', '', '', 'PERSON', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'ORG', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'PERSON', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 

### Exercise 1.3
Write a function `extract_entities` which takes a list of tokens, a list of tags and a tag-type and returns a dictionary of all of the **chunks** which have the given tag-type; together with their frequency in the text.

You can assume that two consecutive tokens with the same tag are part of the same chunk.

Test your code and you should get the following output (for the given input):

<img src=output-13.png>

This tells us that "Anne Cox" is tagged twice as a named entity of type "PERSON" in the text.  How many occurrences of "Miss Woodhouse" tagged as a "PERSON" are there?

In [15]:
### SEE LAB SOLUTIONS ### doesnt get chunks just single words...?

In [16]:
def extract_entities(tokenlist,taglist,tag_type):
    entities = {}
    inentity = False
    for (token,tag) in zip(tokenlist,taglist):
        # if the tag is the relevant tagtype, if already in entity, add to entity
        if tag==tag_type:
            if inentity:
                entity+=" "+token
            else:
                # otherwise flag as in entity and set token to entity
                entity = token
                inentity = True
        elif inentity:
            # return the final full entity and set the flag back to false
            entities[entity] = entities.get(entity,0)+1
            inentity = False
    return entities

In [17]:
# my attempt
def extract_entities_2(tokens, tags_list, tag_type):
    tag_type_dict = {}
    for i, tag in enumerate(tags_list):
        if tag == tag_type:
            word = tokens[i]
            if word in tag_type_dict:
                tag_type_dict[word] = tag_type_dict[word] + 1
            else:
                tag_type_dict[word] = 1  
    return tag_type_dict

In [18]:
tokens_list, pos_tags_list, ner_tags_list =  make_tag_lists(nlp_emma)
persons = extract_entities(tokens_list, ner_tags_list, "PERSON")
sortedDict = dict(sorted(persons.items(), key=lambda x: x[0].lower()) )
print(sortedDict)

{'& c.': 1, 'a Harriet Smith': 1, 'a Robert Martin': 1, "a Robert Martin 's": 1, 'Abbey': 9, 'Abbey - Mill': 2, 'Abbey Mill': 1, 'Alderneys': 1, 'Anna Weston': 1, 'Anne Cox': 2, 'are!--I': 1, 'arise--': 1, 'asparagus--': 1, 'Astley': 4, 'Augusta Hawkins': 1, 'Aunt Emma': 1, 'baker': 1, 'Bates': 126, 'Bates?--I': 1, 'Bateses': 4, 'Bath': 13, 'Beg': 1, 'Behold': 2, 'Bella': 3, 'Bickerton': 1, 'Bird': 1, 'Box Hill': 14, 'Box Hill--': 1, 'Bragge': 7, 'Campbell': 51, 'Candles everywhere.--I': 1, 'Catherine': 1, 'Chaperon': 1, 'cheerful--': 1, 'Churchill': 68, 'Churchills': 1, 'Clara Partridge': 1, 'Clifton': 1, 'Cole': 53, 'Cole--': 1, 'Coles': 15, 'Come Emma': 1, 'company--': 1, 'congratulation.--I': 1, 'congratulations.-- Harriet': 1, 'Cox': 2, 'decision;--': 1, 'declare--': 1, 'Dinner': 1, 'discerning;--': 1, 'disorder:--': 1, 'Dixon': 41, 'Dixon.--Very': 1, 'Donwell': 3, 'Donwell Abbey': 7, 'Donwell Lane': 1, 'E.': 10, 'Easter': 2, 'easy fortune': 1, 'Elizabeth': 5, 'Elizabeth Martin': 

### Exercise 1.4
Use your code to find 
* the 20 most commonly referred to people in Emma
* the 20 most commonly referred to places in Emma

In [19]:
from nltk import FreqDist
persons_fd = FreqDist(persons)
top_20_persons = persons_fd.most_common(20)
print(top_20_persons)



[('Emma', 718), ('Weston', 410), ('Elton', 373), ('Knightley', 283), ('Jane', 183), ('Harriet', 131), ('Woodhouse', 130), ('Frank Churchill', 130), ('Bates', 126), ('Fairfax', 108), ('Jane Fairfax', 85), ('Churchill', 68), ('Isabella', 63), ('Perry', 58), ('Goddard', 58), ('Frank', 54), ('John Knightley', 53), ('Cole', 53), ('Campbell', 51), ('Martin', 50)]


In [20]:
places = extract_entities(tokens_list, ner_tags_list, "GPE")
places_fd = FreqDist(places)
top_20_places = places_fd.most_common(20)
print(top_20_places)

[('Hartfield', 156), ('Randalls', 73), ('London', 44), ('Fairfax', 18), ('Perry', 16), ('Ireland', 14), ('Richmond', 13), ('Highbury', 10), ('Weston', 8), ('Kingston', 8), ('England', 8), ('Selina', 8), ('Enscombe', 6), ('Swisserland', 4), ('Elton', 3), ('women--', 3), ('Yorkshire', 3), ('indeed!--and', 2), ('Harriet', 2), ('thus--', 2)]


### Exercise 1.5
Look at the lists of people and places generated.  Assuming no knowledge of the characters and plot of Emma, what errors can you see?

In [21]:
#perry tagged as both GPE and Person, Harriet, women, indeed, thus tagged GPE, 

## Extensions

Code one or more of the following extensions.  In all cases, compare the lists of most frequently occurring named entities generated with the original ones.

### Expanding NER Chunks
* if the word immediately before or after a named entity chunk is POS-tagged as a PROPN, assume that this word is also part of the named entity chunk

For example, where the token "Miss" has pos-tag "PROPN" and is immediately followed by a token labelled with "PERSON", then it should also be labelled with "PERSON". 

### Relabelling NER Chunks
* if a named entity occurs more frequently elsewhere in the text as a different type, assume that it has been mis-labelled here

For example, all 4 occurrences of "Emma" labelled as "LOC" could be relabelled as "PERSON".

### Linking NEs
* find candidates for named entity linking.  

For example, "Churchill" and "Frank Churchill" and "Frank" might all refer to the same person.
However, you should proceed with care.  Anyone who knows the story well would tell you that "Knightley" and "John Knightley" do not refer to the same character (they are brothers).  As a further extension, give your linking functionality access to a list of known characters e.g., from https://www.janeausten.org/emma/cast-of-characters.asp

### Co-occurring NEs
* find NEs that tend to co-occur together.

Can you find pairs of named entities which often occur together (or even better, occur more often together than one would expect if named entities occur independently)?  You could consider pairs of people or alternatively co-occurrences of people and places.

### NEs over Time
* record the position in the text of each named entity occurrence
* make a plot showing how the amount of occurrences of a given named entity varies with position in the text

If you store each text position in `list_of_indices`, you could use:
`pd.Series(np.histogram(list_of_indices, bins=num_bins)` to help you with this
