### Purpose
This notebook is a proof of concept to implement named entity recognition on unstructured texts. It intends to depict the named entity recognition model that will bring satisfactory results for identifying persons' names.

### Method

This notebook uses the Stanford Named Entity Recognition framework (Finkel et al 2005) to recognise persons' names. Three Stanford classifiers were used to assess the accuracy of persons' names recognition: english.all.3class.distsim.crf.ser.gz (3class), english.conll.4class.distsim.crf.ser.gz (4class), and english.muc.7class.distsim.crf.ser.gz (7class).

The kind of entities that can be recognised by the models are described below (Stanford Named Entity Recognizer, 2017)

|Classifier|Recognisable Entities
|--|---------------------|
|3class|Location, Person, Organization
|4class|Location, Person, Organization, Misc
|7class|Location, Person, Organization, Money, Percent, Date, Time


In [1]:
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk.tag import StanfordNERTagger
from itertools import groupby

The simple test text contains the name of the football player Franz Beckenbauer written in four different ways to assess the capacity of the Stanford classifiers to identify a common name.

In [2]:
text = """Gary Winston Lineker was an excellent football player.
GARY WINSTON LINEKER was a striker.
gary winston lineker was born in England.
gARY WiNsTon lInEker is married to Danielle Bux.
Gary W. Lineker, Kanny Sansom and Peter Shilton played together.
The defenders:
    Gary Stevens
    Kenny Sansom
    Terry Butcher
The midfields were:
    - Bryan Robson;
    - Ray Wilkins;
    - Chris Waddle."""

Finding the words by tokenising the text into sentences and then the sentences into words.

In [3]:
sentences = sent_tokenize(text)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

Preforming named entity recognition with the **english.all.3class.distsim.crf.ser.gz** classifier.

In [4]:
sn_3class = StanfordNERTagger('/Library/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz',
                       path_to_jar='/Library/stanford-ner-2017-06-09/stanford-ner.jar')

In [5]:
ne_annot_sent_3c = [sn_3class.tag(sent) for sent in tokenized_sentences]

In [6]:
persons_sn_3class = []
for annot_sent in ne_annot_sent_3c:
    for annot_token in annot_sent:
        if annot_token[1] == 'PERSON':
            persons_sn_3class.append(annot_token[0])

Performing named entity recognition with the **english.conll.4class.distsim.crf.ser.gz** classifier.

In [7]:
sn_4class = StanfordNERTagger('/Library/stanford-ner-2017-06-09/classifiers/english.conll.4class.distsim.crf.ser.gz',
                       path_to_jar='/Library/stanford-ner-2017-06-09/stanford-ner.jar')

In [8]:
ne_annot_sent_4c = [sn_4class.tag(sent) for sent in tokenized_sentences]

In [9]:
persons_sn_4class = []
for annot_sent in ne_annot_sent_4c:
#     print (annot_sent)
    for annot_token in annot_sent:
        if annot_token[1] == 'PERSON':
            persons_sn_4class.append(annot_token[0])
# print("Persons' names found:", persons_sn_4class)

Performing named entity recognition with the **english.muc.7class.distsim.crf.ser.gz** classifier.

In [10]:
sn_7class = StanfordNERTagger('/Library/stanford-ner-2017-06-09/classifiers/english.muc.7class.distsim.crf.ser.gz',
                       path_to_jar='/Library/stanford-ner-2017-06-09/stanford-ner.jar')

In [11]:
ne_annot_sent_7c = [sn_7class.tag(sent) for sent in tokenized_sentences]

In [12]:
persons_sn_7class = []
for annot_sent in ne_annot_sent_7c:
    for annot_token in annot_sent:
        if annot_token[1] == 'PERSON':
            persons_sn_7class.append(annot_token[0])

### Identifying Individuals out of Names

The Stanford Classifier identifies and tags persons' names, but does not identify individuals. The best approach to overcome this limitation would be to train an IOB (Inside, Outside, Beginning) named entity chunker for the domain of the corpus. This approach requires a large amount of text on the target domain.

This proof of concept used a simple solution, which was to consider the continuous occurrence of PERSON entities as an individual. The function *get_individuals* below implements this functionality. However, this solution also has a drawback. The PERSON entities that represent different individuals appear in the text without any character between them will be interpreted as a unique individual. This issue is shown below where the three defenders that appear as a list with no bullets in the test text are considered as just one individual. We can also see that this problem does not happen for the list of midfielders, which uses bullets.

In [13]:
def get_individuals(ne_annot_sent):
    individuals = []
    for annot_sent in ne_annot_sent:
        #print(annot_sent)
        for tag, chunk in groupby(annot_sent, lambda x:x[1]):
            if (tag == "PERSON"):
                individuals.append(" ".join(w for w, t in chunk))                
    return individuals

### Comparing the Classifiers' Outcomes

In [14]:
print(persons_sn_3class, '\n\n', get_individuals(ne_annot_sent_3c))

['Gary', 'Winston', 'Lineker', 'GARY', 'WINSTON', 'LINEKER', 'gARY', 'WiNsTon', 'lInEker', 'Danielle', 'Bux', 'Gary', 'W.', 'Lineker', 'Kanny', 'Sansom', 'Peter', 'Shilton', 'Gary', 'Stevens', 'Kenny', 'Sansom', 'Terry', 'Bryan', 'Robson', 'Ray', 'Wilkins', 'Chris', 'Waddle'] 

 ['Gary Winston Lineker', 'GARY WINSTON LINEKER', 'gARY WiNsTon lInEker', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom', 'Peter Shilton', 'Gary Stevens Kenny Sansom Terry', 'Bryan Robson', 'Ray Wilkins', 'Chris Waddle']


In [15]:
print(persons_sn_4class, '\n\n', get_individuals(ne_annot_sent_4c))

['Gary', 'Winston', 'Lineker', 'GARY', 'WINSTON', 'LINEKER', 'Danielle', 'Bux', 'Gary', 'W.', 'Lineker', 'Kanny', 'Sansom', 'Peter', 'Shilton', 'Gary', 'Stevens', 'Kenny', 'Sansom', 'Terry', 'Butcher', 'Bryan', 'Robson', 'Ray', 'Wilkins', 'Chris', 'Waddle'] 

 ['Gary Winston Lineker', 'GARY WINSTON LINEKER', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom', 'Peter Shilton', 'Gary Stevens Kenny Sansom Terry Butcher', 'Bryan Robson', 'Ray Wilkins', 'Chris Waddle']


In [16]:
print(persons_sn_4class, '\n\n', get_individuals(ne_annot_sent_7c))

['Gary', 'Winston', 'Lineker', 'GARY', 'WINSTON', 'LINEKER', 'Danielle', 'Bux', 'Gary', 'W.', 'Lineker', 'Kanny', 'Sansom', 'Peter', 'Shilton', 'Gary', 'Stevens', 'Kenny', 'Sansom', 'Terry', 'Butcher', 'Bryan', 'Robson', 'Ray', 'Wilkins', 'Chris', 'Waddle'] 

 ['Gary Winston Lineker', 'WINSTON LINEKER', 'WiNsTon lInEker', 'Gary W. Lineker', 'Kanny Sansom', 'Peter Shilton', 'Gary Stevens Kenny Sansom Terry Butcher', 'Bryan Robson', 'Ray Wilkins', 'Chris Waddle']


|Entities on text|3class|4class|7class|
|--------------------|---------------------|---------------------|---------------------|
|Gary Winston Lineker|Gary Winston Lineker |Gary Winston Lineker |Gary Winston Lineker|
|GARY WINSTON LINEKER|GARY WINSTON LINEKER |GARY WINSTON LINEKER |WINSTON LINEKER|
|gary winston lineker|NR |NR |NR |
|gARY WiNsTon lInEker|gARY WiNsTon lInEker |NR |WiNsTon lInEker
|Danielle Bux|Danielle Bux |Danielle Bux |NR |
|Gary Lineker|Gary Lineker |Gary Lineker |Gary Lineker|
|Kanny Sansom|Kanny Sansom |Kanny Sansom |Kanny Sansom|
|Peter Shilton|Peter Shilton |Peter Shilton |Peter Shilton|
|Gary Stevens|*Gary Stevens Kenny Sansom Terry* |*Gary Stevens Kenny Sansom Terry Butcher* | *Gary Stevens Kenny Sansom Terry Butcher*|
|Kenny Sansom| WG|WG | WG|
|Terry Butcher| WG|WG | WG|
|Bryan Robson|Bryan Robson |Bryan Robson |Bryan Robson|
|Ray Wilkins| Ray Wilkins |Ray Wilkins |Ray Wilkins|
|Chris Waddle|Chris Waddle |Chris Waddle | Chris Waddle|

The table above presents the persons' names recognised by each classifier. The names identified are displayed under each classifier name. NR stands for Not Recognised, and WG stands for Wrongly Grouped. In this last case, the names were correctly identified but weren't correctly grouped as the name of a particular person.

By the results presented above, we can see that the *english.all.3class.distsim.crf.ser.gz* classifier is capable of recognizing the hugest variety of text formattings. Thus, this classifier was used in the tests on real trust deeds.
It is worth noting that no classifier has identified names typed entirely in lowercase. This was unexpected since lower casing the entire text is a very usual preprocessing technique.

### Conclusion
The Stanford's Named Entity Recognition (NER) feature functions reasonably good to identifying persons in unstructured texts. A huge drawback of this NER engine is that it does not join the names of a single entity. In this way, it is the duty of the programmer to implement an approach to find the entities out of the persons' names tagged by the engine. The naive approach used in the notebook for demonstration purpose is not good enough for a production environment.

### References
* Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf

* Stanford Named Entity Recognizer (NER), 2017 https://nlp.stanford.edu/software/CRF-NER.shtml