# Assignment 2: Named Entity Recognition
Cedric Omtzigt and Carlos de Bourbon

Introduction: Named entity recognition (NER) is an important task of many information extraction
systems. NER seeks to locate and classify elements in text into pre-defined categories such as the names of
persons, organizations, locations, etc. In this task you should learn how to apply existing NER systems on
your dataset and how to evaluate them.
Task: Find out how many different persons are in the Hamlet corpus. How many if you use the 3,
4 and 7-classes tagger? Upload a ipython notebook (.ipynb) with your solutions and also a PDF (File →
Download as → PDF via PDFLatex, in the ipython Notebooks.)

In [1]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

In [2]:
jar_location = './stanford-ner-2020/stanford-ner.jar'
model_location_3classes = './stanford-ner-2020/classifiers/english.all.3class.distsim.crf.ser.gz'
model_location_4classes = './stanford-ner-2020/classifiers/english.conll.4class.distsim.crf.ser.gz'
model_location_7classes = './stanford-ner-2020/classifiers/english.muc.7class.distsim.crf.ser.gz'
st3 = StanfordNERTagger(model_location_3classes,jar_location,encoding='utf-8')
st4 = StanfordNERTagger(model_location_4classes,jar_location,encoding='utf-8')
st7 = StanfordNERTagger(model_location_7classes,jar_location,encoding='utf-8')
#group classifiers in a list 
classifiers = [st3, st4, st7]

In [6]:
""" Task: Find out how many different persons are in the Hamlet corpus. How many if you use the 3,
4 and 7-classes tagger?
"""
#input text downloaded from the guthenburg website directly (see sources) and split on white space. 
with open("Hamlet.txt", encoding="utf-8") as file: 
   hamlet_text=  file.read()

hamlet_tokens = word_tokenize(hamlet_text)

nr_of_classes = [3, 4, 7] # for printing/style purposes
for i, classifier in enumerate(classifiers):
    entities = classifier.tag(hamlet_tokens)  #get all entities
    persons = [e for e, c in entities if c == "PERSON"] #filter on the person class
    print(f"""{nr_of_classes[i]}-class classifier: 
            {len(persons)} entries of class PERSON found, with {len(set(persons))} unique persons.""")

3-class classifier: 
            852 entries of class PERSON found, with 314 unique persons.
4-class classifier: 
            1060 entries of class PERSON found, with 490 unique persons.
7-class classifier: 
            350 entries of class PERSON found, with 204 unique persons.


Looking at the result we can see the effects of granularity, where more classification options most likely improve the accuracy, but possibly with diminishing returns. However without ground truth is impossible to draw any conclusion


Sources: 
- Example code: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/ Named Entity Recognition 
- Gutenburg protjec Hamlet text https://www.gutenberg.org/ebooks/2265