# Named Entity Recognition

In [1]:
import nltk
from normalization import parse_document
import pandas as pd

In [2]:
text = """
Bayern Munich, or FC Bayern, is a German sports club based in Munich, 
Bavaria, Germany. It is best known for its professional football team, 
which plays in the Bundesliga, the top tier of the German football 
league system, and is the most successful club in German football 
history, having won a record 26 national titles and 18 national cups. 
FC Bayern was founded in 1900 by eleven football players led by Franz John. 
Although Bayern won its first national championship in 1932, the club 
was not selected for the Bundesliga at its inception in 1963. The club 
had its period of greatest success in the middle of the 1970s when, 
under the captaincy of Franz Beckenbauer, it won the European Cup three 
times in a row (1974-76). Overall, Bayern has reached ten UEFA Champions 
League finals, most recently winning their fifth title in 2013 as part 
of a continental treble. 
"""

In [3]:
# text = """Thought-capable artificial beings appeared as storytelling devices in antiquity, and have been common in fiction, as in Mary Shelleys Frankenstein or Karel Capeks R.U.R. (Rossums Universal Robots). These characters and their fates raised many of the same issues now discussed in the ethics of artificial intelligence. The study of mechanical or formal reasoning began with philosophers and mathematicians in antiquity. The study of mathematical logic led directly to Alan Turings theory of computation, which suggested that a machine, by shuffling symbols as simple as 0 and 1, could simulate any conceivable act of mathematical deduction. This insight, that digital computers can simulate any process of formal reasoning, is known as the Church-Turing thesis. Along with concurrent discoveries in neurobiology, information theory and cybernetics, this led researchers to consider the possibility of building an electronic brain. Turing proposed that if a human could not distinguish between responses from a machine and a human, the machine could be considered intelligent. The first work that is now generally recognized as AI was McCullouch and Pitts 1943 formal design for Turing-complete artificial neurons. The field of AI research was born at a workshop at Dartmouth College in 1956. Attendees Allen Newell (CMU), Herbert Simon (CMU), John McCarthy (MIT), Marvin Minsky (MIT) and Arthur Samuel (IBM) became the founders and leaders of AI research. They and their students produced programs that the press described as astonishing: computers were learning checkers strategies (c. 1954) (and by 1959 were reportedly playing better than the average human),[32] solving word problems in algebra, proving logical theorems (Logic Theorist, first run c. 1956) and speaking English. By the middle of the 1960s, research in the U.S. was heavily funded by the Department of Defense and laboratories had been established around the world. AIs founders were optimistic about the future: Herbert Simon predicted, machines will be capable, within twenty years, of doing any work a man can do. Marvin Minsky agreed, writing, within a generation ... the problem of creating artificial intelligence will substantially be solved.  They failed to recognize the difficulty of some of the remaining tasks. Progress slowed and in 1974, in response to the criticism of Sir James Lighthill[36] and ongoing pressure from the US Congress to fund more productive projects, both the U.S. and British governments cut off exploratory research in AI. The next few years would later be called an AI winter, a period when obtaining funding for AI projects was difficult."""

In [4]:
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

In [5]:
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]

In [6]:
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, "label"):
            entity_name = " ".join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label()
            named_entities.append((entity_name, entity_type))
            
named_entities = list(set(named_entities))
entity_frame = pd.DataFrame(named_entities,
                           columns=["Entity Name", "Entity Type"])
print(entity_frame)

          Entity Name   Entity Type
0              Bayern        PERSON
1          Franz John        PERSON
2   Franz Beckenbauer        PERSON
3              Munich  ORGANIZATION
4            European  ORGANIZATION
5          Bundesliga  ORGANIZATION
6              German           GPE
7             Bavaria           GPE
8             Germany           GPE
9           FC Bayern  ORGANIZATION
10               UEFA  ORGANIZATION
11             Munich           GPE
12             Bayern           GPE
13            Overall           GPE


In [7]:
# Stanford NER has to be downloaded
from nltk.tag import StanfordNERTagger
sn =StanfordNERTagger("/home/simon/nltk_data/stanford/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz",
                     path_to_jar="/home/simon/nltk_data/stanford/stanford-ner-2018-02-27/stanford-ner.jar")

In [8]:
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]

In [9]:
named_entities = []
for sentence in ne_annotated_sentences:
    temp_entity_name = ""
    temp_named_entity = None
    for term, tag in sentence:
        if tag !="O":
            temp_entity_name = " ".join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)
        else:
            if temp_named_entity:
                named_entities.append(temp_named_entity)
                temp_entity_name = ""
                temp_named_entity = None
                
named_entities = list(set(named_entities))
entity_frame = pd.DataFrame(named_entities,
                           columns=["Entity Name", "Entity Type"])
print(entity_frame)

         Entity Name   Entity Type
0         Franz John        PERSON
1  Franz Beckenbauer        PERSON
2            Germany      LOCATION
3             Bayern  ORGANIZATION
4            Bavaria      LOCATION
5             Munich      LOCATION
6          FC Bayern  ORGANIZATION
7      Bayern Munich  ORGANIZATION


In [10]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [15]:
doc = nlp(unicode(text))

In [17]:
from pprint import pprint
pprint([(X.text, X.label_) for X in doc.ents])

[(u'\n', u'GPE'),
 (u'FC Bayern', u'ORG'),
 (u'German', u'NORP'),
 (u'Munich', u'GPE'),
 (u'\n', u'GPE'),
 (u'Bavaria, Germany', u'GPE'),
 (u'\n', u'GPE'),
 (u'Bundesliga', u'GPE'),
 (u'German', u'NORP'),
 (u'\n', u'GPE'),
 (u'German', u'NORP'),
 (u'\n', u'GPE'),
 (u'26', u'CARDINAL'),
 (u'18', u'CARDINAL'),
 (u'\n', u'GPE'),
 (u'1900', u'DATE'),
 (u'eleven', u'CARDINAL'),
 (u'Franz John', u'PERSON'),
 (u'\n', u'GPE'),
 (u'Bayern', u'ORG'),
 (u'first', u'ORDINAL'),
 (u'1932', u'DATE'),
 (u'the club \n', u'ORG'),
 (u'Bundesliga', u'GPE'),
 (u'1963', u'DATE'),
 (u'The club \n', u'ORG'),
 (u'the middle of the 1970s', u'DATE'),
 (u'\n', u'GPE'),
 (u'Franz Beckenbauer', u'PERSON'),
 (u'three', u'CARDINAL'),
 (u'\n', u'GPE'),
 (u'1974', u'DATE'),
 (u'Bayern', u'ORG'),
 (u'ten', u'CARDINAL'),
 (u'UEFA Champions \nLeague', u'ORG'),
 (u'fifth', u'ORDINAL'),
 (u'2013', u'DATE'),
 (u'\n', u'GPE'),
 (u'\n', u'GPE')]


In [19]:
len(doc.ents)

39

In [20]:
labels = [x.label_ for x in doc.ents]
Counter(labels)

Counter({u'GPE': 15, u'DATE': 6, u'ORG': 6, u'CARDINAL': 5, u'NORP': 3, u'ORDINAL': 2, u'PERSON': 2})

In [21]:
items = [x.text for x in doc.ents]
Counter(items).most_common(3)

[(u'\n', 11), (u'German', 3), (u'Bundesliga', 2)]

In [22]:
displacy.render(doc, jupyter=True, style='ent')

In [25]:
displacy.render(doc, style='dep', jupyter = True, options = {'distance': 100})

In [29]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in doc 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[(u'\n', u'SPACE', u'\n'), (u'Bayern', u'PROPN', u'bayern'), (u'Munich', u'PROPN', u'munich'), (u'FC', u'PROPN', u'fc'), (u'Bayern', u'PROPN', u'bayern'), (u'German', u'ADJ', u'german'), (u'sports', u'NOUN', u'sport'), (u'club', u'NOUN', u'club'), (u'based', u'VERB', u'base'), (u'Munich', u'PROPN', u'munich'), (u'\n', u'SPACE', u'\n'), (u'Bavaria', u'PROPN', u'bavaria'), (u'Germany', u'PROPN', u'germany'), (u'It', u'PRON', u'-PRON-'), (u'best', u'ADV', u'best'), (u'known', u'VERB', u'know'), (u'professional', u'ADJ', u'professional'), (u'football', u'NOUN', u'football'), (u'team', u'NOUN', u'team'), (u'\n', u'SPACE', u'\n'), (u'plays', u'VERB', u'play'), (u'Bundesliga', u'PROPN', u'bundesliga'), (u'tier', u'NOUN', u'tier'), (u'German', u'ADJ', u'german'), (u'football', u'NOUN', u'football'), (u'\n', u'SPACE', u'\n'), (u'league', u'NOUN', u'league'), (u'system', u'NOUN', u'system'), (u'successful', u'ADJ', u'successful'), (u'club', u'NOUN', u'club'), (u'German', u'ADJ', u'german'), (u'f

In [31]:
dict([(str(x), x.label_) for x in doc.ents])

{'ten': u'CARDINAL', 'the middle of the 1970s': u'DATE', 'Bundesliga': u'GPE', '\n': u'GPE', 'Franz John': u'PERSON', '1900': u'DATE', '26': u'CARDINAL', 'three': u'CARDINAL', 'Munich': u'GPE', '1963': u'DATE', 'eleven': u'CARDINAL', 'fifth': u'ORDINAL', 'Franz Beckenbauer': u'PERSON', 'Bayern': u'ORG', '2013': u'DATE', 'FC Bayern': u'ORG', 'the club \n': u'ORG', '1932': u'DATE', 'The club \n': u'ORG', 'Bavaria, Germany': u'GPE', 'German': u'NORP', '18': u'CARDINAL', 'UEFA Champions \nLeague': u'ORG', '1974': u'DATE', 'first': u'ORDINAL'}