## NER - Named Entity Recognition

Note: can use instructions within this notebook to configure Stanford - or - run on RCC Midway with "nlp_class" kernel

In [1]:
#import nltk
#nltk.download('popular', halt_on_error=False)
#nltk.download('all', halt_on_error=False)

In [2]:
import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import pandas as pd
import re
import sys

In [3]:
print(sys.version)

3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0]


## NLTK-based for NER

In [4]:
text = '''Surging Chinese demand and an improving U.S. economy have lifted sales of Caterpillar's signature yellow mining and construction machines. Now, with the pace of growth quickening in Latin America and Europe, the company is projecting higher earnings for 2018 than analysts estimated.  The outlook from Caterpillar, considered an economic bellwether, comes as industries from manufacturing to services report increased sales and orders that have fueled record equity prices and buoyed investor expectations for this year. This week, the International Monetary Fund raised its estimate for 2018 global growth to the fastest in seven years.  Caterpillar's results showed strength across the board in nearly every industry for the first time, which indicated coordinated and synchronized macroeconomic growth, Larry De Maria, an analyst at William Blair & Co., said in an interview. It's a good harbinger for overall economic activity.'''

In [5]:
text

"Surging Chinese demand and an improving U.S. economy have lifted sales of Caterpillar's signature yellow mining and construction machines. Now, with the pace of growth quickening in Latin America and Europe, the company is projecting higher earnings for 2018 than analysts estimated.  The outlook from Caterpillar, considered an economic bellwether, comes as industries from manufacturing to services report increased sales and orders that have fueled record equity prices and buoyed investor expectations for this year. This week, the International Monetary Fund raised its estimate for 2018 global growth to the fastest in seven years.  Caterpillar's results showed strength across the board in nearly every industry for the first time, which indicated coordinated and synchronized macroeconomic growth, Larry De Maria, an analyst at William Blair & Co., said in an interview. It's a good harbinger for overall economic activity."

### Basic NER: tagging words (tokens) as "NE"

In [6]:
# NLTK chunked_sentences is a tree structure, or list of lists.  We have to traverse it to get the values

entities = []
labels = []
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = True):
    if hasattr(chunk, 'label'):
        entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
        labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities

#Binary=True means just tag entities as NE 
#Binary=False give us PERSON, ORGANIZATION, and GPE (Geo-political Entity) 

In [7]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Chinese,NE
1,U.S.,NE
2,Caterpillar,NE
3,Larry De Maria,NE
4,Latin America,NE
5,International Monetary Fund,NE
6,William Blair,NE
7,Europe,NE


Chinese
U.S.
Caterpillar's
Latin America
Europe
Caterpillar
International Monetary Fund
Caterpillar's
Larry De Maria
William Blair & Co.

### Basic NER: tagging words (tokens) as PERSON, ORGANIZATION, and GPE

In [8]:
entities = []
labels = []
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
    if hasattr(chunk, 'label'):
        entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
        labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities

In [9]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Caterpillar,GPE
1,U.S.,GPE
2,Caterpillar,PERSON
3,Europe,GPE
4,Chinese,GPE
5,William Blair,PERSON
6,International Monetary Fund,ORGANIZATION
7,Larry De Maria,PERSON
8,Latin America,GPE


### Alternative NER, separating by sentenses first, then by tokens

In [10]:
entities = []
labels = []

for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities

In [11]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Caterpillar,GPE
1,U.S.,GPE
2,Europe,GPE
3,Chinese,GPE
4,William Blair,PERSON
5,International Monetary Fund,ORGANIZATION
6,Larry De Maria,PERSON
7,Latin America,GPE


## Compare NER Results between Sentense vs. Word Tokenization

In [12]:
#http://www.chicagotribune.com/business/ct-caterpillar-earnings-20180125-story.html
directory = '/project/msca/kadochnikov/text/news_articles/'
article = 'News_1.txt'

In [13]:
!head '/project/msca/kadochnikov/text/news_articles/News_1.txt'

Rising economic tide reaches all shores for buoyant Caterpillar
Caterpillar earnings

This Tuesday, July 25, 2017, photo shows Caterpillar machinery at a dealership in Murrysville, Pa. Caterpillar, Inc. reports earnings, Thursday, Jan. 25, 2018. (Gene J. Puskar / AP)
Joe DeauxBloomberg News

If you want more evidence of a broadening expansion in the global economy, look no further than Caterpillar.

Surging Chinese demand and an improving U.S. economy have lifted sales of Caterpillar's signature yellow mining and construction machines. Now, with the pace of growth quickening in Latin America and Europe, the company is projecting higher earnings for 2018 than analysts estimated.



In [14]:
f = open(directory+article, encoding="utf8")
text = f.read()

### Tagging word tokens
Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.). Wikipedia 

In [15]:
entities = []
labels = []
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
    if hasattr(chunk, 'label'):
        entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
        labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities

In [16]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df.head(20)

Unnamed: 0,Entities,Labels
0,Caterpillar,PERSON
1,Dow Jones Industrial,ORGANIZATION
2,Gene,ORGANIZATION
3,Chinese,GPE
4,Jim Umpleby,PERSON
5,North America,GPE
6,Sales,PERSON
7,Larry De Maria,PERSON
8,Caterpillar,ORGANIZATION
9,Caterpillar,GPE


### Sentense split, then tagging word tokens

In [17]:
entities = []
labels = []

for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities

In [18]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df.head(20)

Unnamed: 0,Entities,Labels
0,Caterpillar,PERSON
1,Dow Jones Industrial,ORGANIZATION
2,Gene,ORGANIZATION
3,Chinese,GPE
4,Jim Umpleby,PERSON
5,North America,GPE
6,Larry De Maria,PERSON
7,Caterpillar,ORGANIZATION
8,Caterpillar,GPE
9,Europe,GPE


## Leveraging more powerful NLP packages, such as Stanford NLP to improve NER

### Installing and configuring Stanford NLP

https://medium.com/manash-en-blog/configuring-stanford-parser-and-stanford-ner-tagger-with-nltk-in-python-on-windows-f685483c374a

In [19]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os

# Change the path according to your system
stanford_classifier = os.getenv('STANFORD_CLASSIFIER')
stanford_ner_path = os.getenv('STANFORD_NER_PATH')

# Creating Tagger Object
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')

In [20]:
tokenized_text = nltk.word_tokenize(text)
classified_text = st.tag(tokenized_text)

classified_text_df = pd.DataFrame(classified_text)

classified_text_df.drop_duplicates(keep='first', inplace=True)
classified_text_df.reset_index(drop=True, inplace=True)
classified_text_df.columns = ["Entities", "Labels"]

In [21]:
classified_text_df.groupby("Labels").count()

Unnamed: 0_level_0,Entities
Labels,Unnamed: 1_level_1
LOCATION,8
O,235
ORGANIZATION,11
PERSON,10


In [22]:
entities_df = classified_text_df.loc[classified_text_df["Labels"].isin(['LOCATION','ORGANIZATION','PERSON'])]
entities_df.reset_index(drop=True, inplace=True)
entities_df.head(20)

Unnamed: 0,Entities,Labels
0,Caterpillar,ORGANIZATION
1,Murrysville,LOCATION
2,Pa.,LOCATION
3,",",ORGANIZATION
4,Inc.,ORGANIZATION
5,Gene,PERSON
6,J.,PERSON
7,Puskar,PERSON
8,U.S.,LOCATION
9,Latin,LOCATION


### StanfordNERTagger does not have native capabilities to support multi-word NER
Therefore we will have to build them by hand

In [23]:
tokenized_text = nltk.word_tokenize(text)
classified_text = st.tag(tokenized_text)

netagged_words = classified_text

entities = []
labels = []

from itertools import groupby
for tag, chunk in groupby(classified_text, lambda x:x[1]):
    if tag != "O":
        #print("%-12s"%tag, " ".join(w for w, t in chunk))
        entities.append(' '.join(w for w, t in chunk))
        labels.append(tag)
        
        
entities_all = list(zip(entities, labels))
entities_unique = list(set(zip(entities, labels))) #unique entities   

In [24]:
entities_df = pd.DataFrame(entities_unique)
entities_df.columns = ["Entities", "Labels"]
entities_df.groupby('Labels').count()

Unnamed: 0_level_0,Entities
Labels,Unnamed: 1_level_1
LOCATION,7
ORGANIZATION,6
PERSON,5


In [25]:
entities_df = pd.DataFrame(entities_all)
entities_df.columns = ["Entities", "Labels"]
persons_df = entities_df.loc[entities_df["Labels"].isin(['LOCATION','ORGANIZATION','PERSON'])]
counts_df = persons_df.groupby('Entities').count()
counts_df.rename(columns={"Labels": "Mentions"}, inplace=True)
counts_df.sort_values(by=['Mentions'], ascending=False).head(20)

Unnamed: 0_level_0,Mentions
Entities,Unnamed: 1_level_1
Caterpillar,11
U.S.,2
Latin America,2
Bloomberg,1
William Blair,1
Pa.,1
North America,1
Murrysville,1
Larry De Maria,1
Jim Umpleby,1


### NLTK - based

In [26]:
text = '''Sara's work efforts destroyed Apple Corporation's annual sales single handedly'''

In [27]:
entities = []
labels = []
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
    if hasattr(chunk, 'label'):
        entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
        labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities

In [28]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Apple Corporation,PERSON
1,Sara,PERSON


### Stanford-based

In [29]:
# text = '''Sara's work efforts destroyed Apple annual sales single handedly'''

In [30]:
# text = '''Sara's and Igor work efforts destroyed Apple and Intel annual Revenue single handedly'''

In [31]:
text = '''First Bank of Chicago revenue dropped 30% on the announcement of Apple acqiuisition'''

In [32]:
# text = '''First Bank of Chicago revenue dropped 30% on the announcement of Apple acquisition'''

In [33]:
tokenized_text = nltk.word_tokenize(text)
classified_text = st.tag(tokenized_text)

classified_text_df = pd.DataFrame(classified_text)

classified_text_df.drop_duplicates(keep='first', inplace=True)
classified_text_df.reset_index(drop=True, inplace=True)
classified_text_df.columns = ["Entities", "Labels"]

In [34]:
entities_df = classified_text_df.loc[classified_text_df["Labels"].isin(['LOCATION','ORGANIZATION','PERSON'])]
entities_df.reset_index(drop=True, inplace=True)
entities_df.head(20)

Unnamed: 0,Entities,Labels
0,First,ORGANIZATION
1,Bank,ORGANIZATION
2,of,ORGANIZATION
3,Chicago,ORGANIZATION


In [35]:
import datetime
datetime.datetime.now().strftime("%a, %d %B %Y %H:%M:%S")

'Fri, 17 April 2020 10:16:26'