# Named Entity Recognition (NER)

NER is a task in information extraction that locates and classifies entities in a body of text. This capability can be used to identify people based on their names, locations, time, numerical values, etc. The first step in knowledge graph construction is to identify the named entities in a text and use them as anchors to build relationships off of to other entities in the graph. In this notebook, we evaluate various methods of entity extraction and justify the usage of the best performing method.

We will use the following text example in this notebook:

In [1]:
starwars_text = 'Darth_Vader, also known by his birth name Anakin Skywalker, is a fictional character in the Star Wars franchise. Darth Vader appears in the original film trilogy as a pivotal antagonist whose actions drive the plot, while his past as Anakin Skywalker and the story of his corruption are central to the narrative of the prequel trilogy. The character was created by George Lucas and has been portrayed by numerous actors. His appearances span the first six Star Wars films, as well as Rogue One, and his character is heavily referenced in Star Wars: The Force Awakens. He is also an important character in the Star Wars expanded universe of television series, video games, novels, literature and comic books. Originally a Jedi who was prophesied to bring balance to the Force, he falls to the dark side of the Force and serves the evil Galactic Empire at the right hand of his Sith master, Emperor Palpatine (also known as Darth Sidious).'
starwars_text

'Darth Vader, also known by his birth name Anakin Skywalker, is a fictional character in the Star Wars franchise. Darth Vader appears in the original film trilogy as a pivotal antagonist whose actions drive the plot, while his past as Anakin Skywalker and the story of his corruption are central to the narrative of the prequel trilogy. The character was created by George Lucas and has been portrayed by numerous actors. His appearances span the first six Star Wars films, as well as Rogue One, and his character is heavily referenced in Star Wars: The Force Awakens. He is also an important character in the Star Wars expanded universe of television series, video games, novels, literature and comic books. Originally a Jedi who was prophesied to bring balance to the Force, he falls to the dark side of the Force and serves the evil Galactic Empire at the right hand of his Sith master, Emperor Palpatine (also known as Darth Sidious).'

# spaCy

In [2]:
import spacy
import pandas as pd
import nltk
nlp = spacy.load('en_core_web_lg')

### Small Text Example

In [12]:
doc = nlp('darthvader is also known by his birth name anakinskywalker.')
results = pd.DataFrame(columns=['Text', 'Start', 'End', 'Label'])

for ent in doc.ents:  
    results = results.append({'Text':ent.text, 'Start':ent.start_char, 'End':ent.end_char, 'Label':ent.label_}, ignore_index=True)
results

Unnamed: 0,Text,Start,End,Label
0,darthvader,0,10,PERSON


### Large Text Example

In [4]:
doc = nlp(starwars_text)
results = pd.DataFrame(columns=['Text', 'Start', 'End', 'Label'])

for ent in doc.ents:  
    results = results.append({'Text':ent.text, 'Start':ent.start_char, 'End':ent.end_char, 'Label':ent.label_}, ignore_index=True)
results

Unnamed: 0,Text,Start,End,Label
0,Darth Vader,0,11,PERSON
1,Anakin Skywalker,42,58,PERSON
2,Darth Vader,113,124,PERSON
3,Anakin Skywalker,234,250,PERSON
4,George Lucas,365,377,PERSON
5,the first,442,451,DATE
6,six,452,455,CARDINAL
7,Star Wars,456,465,EVENT
8,Rogue One,484,493,PRODUCT
9,Star Wars,538,547,WORK_OF_ART


For larger bodies of text, spaCy does a good job of identifying named entities of various types. We can compare this performance with Stanford NER.

# Stanford NER with NLTK Tokenizers

### Small Text Example

In [9]:
sentences = nltk.sent_tokenize('Darth Vader is also known by his birth name Anakin Skywalker.')
ner_tagger = nltk.tag.StanfordNERTagger("../stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz", "../stanford-ner-2018-10-16/stanford-ner.jar")

ner_dict = {}
results = []
for sent in sentences:
    words = [token for token in nltk.word_tokenize(sent)]
    tagged = ner_tagger.tag(words)
    results += tagged

for res in results:
    ner_dict[res[0]] = res[1]
ner_dict

{'Darth': 'PERSON',
 'Vader': 'PERSON',
 'is': 'O',
 'also': 'O',
 'known': 'O',
 'by': 'O',
 'his': 'O',
 'birth': 'O',
 'name': 'O',
 'Anakin': 'PERSON',
 'Skywalker': 'PERSON',
 '.': 'O'}

### Large Text Example

In [10]:
sentences = nltk.sent_tokenize(starwars_text)
ner_tagger = nltk.tag.StanfordNERTagger("../stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz", "../stanford-ner-2018-10-16/stanford-ner.jar")

ner_dict = {}
results = []
for sent in sentences:
    words = [token for token in nltk.word_tokenize(sent)]
    tagged = ner_tagger.tag(words)
    results += tagged

for res in results:
    ner_dict[res[0]] = res[1]
ner_dict

{'Darth': 'O',
 'Vader': 'PERSON',
 ',': 'O',
 'also': 'O',
 'known': 'O',
 'by': 'O',
 'his': 'O',
 'birth': 'O',
 'name': 'O',
 'Anakin': 'PERSON',
 'Skywalker': 'PERSON',
 'is': 'O',
 'a': 'O',
 'fictional': 'O',
 'character': 'O',
 'in': 'O',
 'the': 'O',
 'Star': 'O',
 'Wars': 'O',
 'franchise': 'O',
 '.': 'O',
 'appears': 'O',
 'original': 'O',
 'film': 'O',
 'trilogy': 'O',
 'as': 'O',
 'pivotal': 'O',
 'antagonist': 'O',
 'whose': 'O',
 'actions': 'O',
 'drive': 'O',
 'plot': 'O',
 'while': 'O',
 'past': 'O',
 'and': 'O',
 'story': 'O',
 'of': 'O',
 'corruption': 'O',
 'are': 'O',
 'central': 'O',
 'to': 'O',
 'narrative': 'O',
 'prequel': 'O',
 'The': 'O',
 'was': 'O',
 'created': 'O',
 'George': 'PERSON',
 'Lucas': 'PERSON',
 'has': 'O',
 'been': 'O',
 'portrayed': 'O',
 'numerous': 'O',
 'actors': 'O',
 'His': 'O',
 'appearances': 'O',
 'span': 'O',
 'first': 'O',
 'six': 'O',
 'films': 'O',
 'well': 'O',
 'Rogue': 'O',
 'One': 'O',
 'heavily': 'O',
 'referenced': 'O',
 ':':

# Stanford NER with spaCy Tokenizers

### Small Text Example

In [12]:
nlp = spacy.lang.en.English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp('Darth Vader is also known by his birth name Anakin Skywalker.')
sentences = [sent.string.strip() for sent in doc.sents]

In [13]:
ner_tagger = nltk.tag.StanfordNERTagger("../stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz", "../stanford-ner-2018-10-16/stanford-ner.jar")

ner_dict = {}
results = []

nlp = spacy.lang.en.English()
tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
for sent in sentences:
    words = [token.orth_ for token in tokenizer(sent)]
    tagged = ner_tagger.tag(words)
    results += tagged

for res in results:
    ner_dict[res[0]] = res[1]
ner_dict

{'Darth': 'PERSON',
 'Vader': 'PERSON',
 'is': 'O',
 'also': 'O',
 'known': 'O',
 'by': 'O',
 'his': 'O',
 'birth': 'O',
 'name': 'O',
 'Anakin': 'PERSON',
 'Skywalker.': 'PERSON'}

### Large Text Example

In [14]:
nlp = spacy.lang.en.English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(starwars_text)
sentences = [sent.string.strip() for sent in doc.sents]

ner_tagger = nltk.tag.StanfordNERTagger("../stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz", "../stanford-ner-2018-10-16/stanford-ner.jar")

ner_dict = {}
results = []

nlp = spacy.lang.en.English()
tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
for sent in sentences:
    words = [token.orth_ for token in tokenizer(sent)]
    tagged = ner_tagger.tag(words)
    results += tagged

for res in results:
    ner_dict[res[0]] = res[1]
ner_dict

{'Darth': 'O',
 'Vader,': 'O',
 'also': 'O',
 'known': 'O',
 'by': 'O',
 'his': 'O',
 'birth': 'O',
 'name': 'O',
 'Anakin': 'PERSON',
 'Skywalker,': 'PERSON',
 'is': 'O',
 'a': 'O',
 'fictional': 'O',
 'character': 'O',
 'in': 'O',
 'the': 'O',
 'Star': 'O',
 'Wars': 'O',
 'franchise.': 'O',
 'Vader': 'PERSON',
 'appears': 'O',
 'original': 'O',
 'film': 'O',
 'trilogy': 'O',
 'as': 'O',
 'pivotal': 'O',
 'antagonist': 'O',
 'whose': 'O',
 'actions': 'O',
 'drive': 'O',
 'plot,': 'O',
 'while': 'O',
 'past': 'O',
 'Skywalker': 'PERSON',
 'and': 'O',
 'story': 'O',
 'of': 'O',
 'corruption': 'O',
 'are': 'O',
 'central': 'O',
 'to': 'O',
 'narrative': 'O',
 'prequel': 'O',
 'trilogy.': 'O',
 'The': 'O',
 'was': 'O',
 'created': 'O',
 'George': 'PERSON',
 'Lucas': 'PERSON',
 'has': 'O',
 'been': 'O',
 'portrayed': 'O',
 'numerous': 'O',
 'actors.': 'O',
 'His': 'O',
 'appearances': 'O',
 'span': 'O',
 'first': 'O',
 'six': 'O',
 'films,': 'O',
 'well': 'O',
 'Rogue': 'O',
 'One,': 'O',


Based on these results, Stanford NER plus spaCy tokenizers offers a slightly better performance on larger bodies of text as compared to using nltk tokenizers. But overall, Stanford NER does not perform as well as spaCy NER as it recogizes PERSONs but sometimes in partial fragments. Other entities that it misses out on include LOCATION, WORK_OF_ART, DATE, etc. Because of this, we opted to use spaCy NER for the knowledge graph construction as it provided a **simpler** and **better performing** interface.