# Named Entity Recognition
## Top Down NLP


## Use cases:
- Autocomplete
- Privacy:  remove names, e.g.
- Sentiment
- Chat bots
- tagging


## Pipeline

1. Sentence Segmentation
2. Word Tokenizing
3. Preict Parts of Speech
4. Text Lemmatization
5. Handling Stopwords
6. Dependency Parsing
7. Finding Noun Phrases
8. Named Entity Recognition (NER)
9. Coreference Resolution

https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/nlp%20proven%20approach/NLP%20Strategy%20I%20-%20Processing%20and%20Understanding%20Text.ipynb

To run an NLP pipeline on a piece of text
You’ll get a list of named entities and entity types detected in our document
For Entity types:
https://spacy.io/usage/linguistic-features#entity-types

https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

In [1]:
import spacy
import pandas as pd

news_df = pd.read_csv('data/news.csv')

# Load the large English NLP model
nlp = spacy.load('en_core_web_sm')

# The text we want to examine
text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
"""

# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

London (GPE)
England (GPE)
the United Kingdom (GPE)
Great Britain (GPE)
London (GPE)
two millennia (DATE)
Romans (NORP)


Removes all the names it detects, e.g.

In [5]:
# Replace a token with "REDACTED" if it is a name
def replace_name_with_placeholder(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED] "
    else:
        return token.string

# Loop through all the entities in a document and check if they are names
def scrub(text):
    doc = nlp(text)
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(ent)
    tokens = map(replace_name_with_placeholder, doc)
    return "".join(tokens)

s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". In 1957, Noam Chomsky’s 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.
"""

print(scrub(s))

AttributeError: 'spacy.tokens.token.Token' object has no attribute 'string'

Extracting Facts
you can use the parsed output from spaCy as the input to more complex data extraction algorithms. There’s a python library called textacy that implements several common data extraction algorithms on top of spaCy.

One of the algorithms it implements is called Semi-structured Statement Extraction. We can use it to search the parse tree for simple statements where the subject is “London” and the verb is a form of “be”. That should help us find facts about London.

In [None]:
import joblib
# import textacy.extract


# Load the large English NLP model
nlp = spacy.load('en_core_web_sm')

# The text we want to examine
text = """London is the capital and most populous city of England and  the United Kingdom.  
Standing on the River Thames in the south east of the island of Great Britain, 
London has been a major settlement  for two millennia.  It was founded by the Romans, 
who named it Londinium.
"""

# Parse the document with spaCy
doc = nlp(text)

# Extract semi-structured statements
# statements = textacy.extract.semistructured_statements(doc, "London")

# Print the results
# print("Here are the things I know about London:")

# for statement in statements:
#     subject, verb, fact = statement
#     print(f" - {fact}")

try installing the neuralcoref library and adding Coreference Resolution to your pipeline. That will get you a few more facts since it will catch sentences that talk about “it” instead of mentioning “London” directly.

Autocomplete example:
We need a list of possible completions to suggest to the user. We can use NLP to quickly generate this data.
Here’s one way to extract frequently-mentioned noun chunks from a document:

In [None]:
# Extract noun chunks that appear
# noun_chunks = textacy.extract.noun_chunks(doc, min_freq=3)

# Convert noun chunks to lowercase strings
# noun_chunks = map(str, noun_chunks)
# noun_chunks = map(str.lower, noun_chunks)

# # Print out any nouns that are at least 2 words long
# for noun_chunk in set(noun_chunks):
#     if len(noun_chunk.split(" ")) > 1:
#         print(noun_chunk)

https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

*other source...need to edit, customize and merge*

## Named Entity Recognition  

In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities , which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on, which are often denoted by proper names. A naive approach could be to find these by looking at the noun phrases in text documents. Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

SpaCy has some excellent capabilities for named entity recognition. Let’s try and use it on one of our sample news articles.

In [None]:
from spacy import displacy

sentence = str(news_df.iloc[1].full_text)
sentence_nlp = nlp(sentence)

# print named entities in article
print([(word, word.ent_type_) for word in sentence_nlp if word.ent_type_])

# visualize named entities
displacy.render(sentence_nlp, style='ent', jupyter=True)

We can clearly see that the major named entities have been identified by spacy. To understand more in detail about what each named entity means, you can refer to the documentation or check out the following table for convenience.

Let’s now find out the most frequent named entities in our news corpus! For this, we will build out a data frame of all the named entities and their types using the following code.

In [None]:
text

In [None]:
named_entities = []
for sentence in news_df.full_text:
    temp_entity_name = ''
    temp_named_entity = None
    sentence = nlp(sentence)
    for word in sentence:
        term = word.text 
        tag = word.ent_type_
        if tag:
            temp_entity_name = ' '.join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)
        else:
            if temp_named_entity:
                named_entities.append(temp_named_entity)
                temp_entity_name = ''
                temp_named_entity = None

entity_frame = pd.DataFrame(named_entities, 
                            columns=['Entity Name', 'Entity Type'])

We can now transform and aggregate this data frame to find the top occuring entities and types.

In [None]:
# get the top named entities
top_entities = (entity_frame.groupby(by=['Entity Name', 'Entity Type'])
                           .size()
                           .sort_values(ascending=False)
                           .reset_index().rename(columns={0 : 'Frequency'}))
top_entities.T.iloc[:,:15]

Do you notice anything interesting? (Hint: Maybe the supposed summit between Trump and Kim Jong!). We also see that it has correctly identified ‘Messenger’ as a product (from Facebook).

We can also group by the entity types to get a sense of what types of entites occur most in our news corpus.

In [None]:
# get the top named entity types
top_entities = (entity_frame.groupby(by=['Entity Type'])
                           .size()
                           .sort_values(ascending=False)
                           .reset_index().rename(columns={0 : 'Frequency'}))
top_entities.T.iloc[:,:15]

We can see that people, places and organizations are the most mentioned entities though interestingly we also have many other entities.