## Named Entities


Named entities are specific items in a text that belong to categories like "person", "place", "organization", "date", "currency", "location"...

https://en.wikipedia.org/wiki/Named-entity_recognition

A good overview article: http://nlp.cs.nyu.edu/sekine/papers/li07.pdf




Check out the demos at https://www.textrazor.com/demo and http://nlp.stanford.edu:8080/corenlp/process.


In [None]:
import nltk
import nlp_utilities as mytools

### Here's another way to download stuff you need from NLTK. You may need the first two but should have the second 2 already.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')

In [None]:
filenames = mytools.get_filenames("data/SOTUsince1945/")

In [None]:
# Pick the first one.
filenames[0]

In [None]:
text = mytools.load_texts_as_string([filenames[0]])

We need just the text itself... so we have to get it out of the dictionary by making the values a list and taking the 0'th.   Another option would've been to just read the file in a statement...

In [None]:
sample = list(text.values())[0]
sample

In [None]:
# or reading the file itself, without using our utility and the dictionary thing:
with (open(filenames[0], errors="ignore")) as handle:
    sample = handle.read()

In [None]:
sample

In [None]:
def extract_entity_names(tree):
    # code adapted from https://gist.github.com/onyxfish/322906
    entity_names = []
    if hasattr(tree, 'label') and tree.label:
        if tree.label() == 'NE':  # for "named entity"
            entity_names.append(' '.join([child[0] for child in tree]))
        else:
            for child in tree:
                entity_names.extend(extract_entity_names(child))
    return entity_names

In [None]:
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

# chunked sentences is a generaor, which means you can't see it unless you do 'list' on it. but you can iterate over it.
list(chunked_sentences)


In [None]:
entity_names = []
for sent in chunked_sentences:
    # Print results per sentence
    entities = extract_entity_names(sent)
    # Not all sentences have entities in them!
    if entities:
        print(entities)
    entity_names.extend(entities)   # extend means to add to the list end. http://thomas-cokelaer.info/blog/2011/03/post-2/

In [None]:
# Print unique entity names.
print(set(entity_names))

In [None]:
from collections import Counter

Counter(entity_names).most_common(10)

##  Optional: Compare to SpaCy

In [None]:
import spacy

In [None]:
nlp = spacy.load('en')
doc = nlp(sample)

spacy_ents = []
for ent in doc.ents:
    print(ent.text, ent.label_)
    spacy_ents.append(ent.text)   # we append here because it's not a list per entity

In [None]:
Counter(spacy_ents).most_common(10)