Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.

Standard Libraries to use Named Entity Recognition

I will discuss three standard libraries which are used a lot in Python to perform NER. I am sure there are many more and would encourage readers to add them in the comment section.

    Standford NER
    spaCy
    NLTK

**Standford NER**
lets use standford NER

In [0]:
article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

In [2]:
!wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip

--2019-06-19 18:37:53--  https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180358328 (172M) [application/zip]
Saving to: ‘stanford-ner-2018-10-16.zip’


2019-06-19 18:40:11 (1.25 MB/s) - ‘stanford-ner-2018-10-16.zip’ saved [180358328/180358328]



In [0]:
!unzip stanford-ner-2018-10-16.zip

In [0]:
import nltk
from nltk.tag import StanfordNERTagger

print('NTLK Version: %s' % nltk.__version__)

stanford_ner_tagger = StanfordNERTagger(
    'stanford-ner-2018-10-16/' + 'classifiers/english.muc.7class.distsim.crf.ser.gz',
    'stanford-ner-2018-10-16/' + 'stanford-ner-3.9.2.jar'
)

results = stanford_ner_tagger.tag(article.split())

print('Original Sentence: %s' % (article))
for result in results:
    tag_value = result[0]
    tag_type = result[1]
    if tag_type != 'O':
      print('Type: %s, Value: %s' %(tag_type, tag_value))

**spaCy**

In [0]:
!pip install spacy
!python -m spacy download en

In [0]:
import spacy
spacy_nlp=spacy.load("en")
result=spacy_nlp(article)

for element in result.ents:
  print('Type: %s, Value: %s' % (element.label_, element))

NLTK (Natural Language Toolkit) is a Python package that provides a set of natural languages corpora and APIs of wide varieties of NLP algorithms.

To perform Named Entity Recognition using NLTK, it needs to be done in three stages —

    Work Tokenization
    Parts of Speech (POS) tagging
    Named Entity Recognition

Note, we need to download some standard corpora and API from NLTK to perform parts of speech tagging and named entity recognition. Hence, we downloaded these from nltk in the above Python code.

In [0]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download("words")
nltk.download("averaged_perceptron_tagger")
nltk.download("punkt")
nltk.download("maxent_ne_chunker")

In [0]:
def preprocess(text):
  text=nltk.word_tokenize(text)
  result=nltk.pos_tag(text)
  return result


result=preprocess(article)
result

Now once we have done the parts-of-speech tagging we will be doing a process called chunking. Text chunking is also called as shallow parsing which typically follows POS tagging to add more structure to the sentence. The result is grouping of words in “chunks”.

So, lets perform chunking to our article which we have already POS tagged.

Our target here would be to NER tag only the Nouns.

In [0]:
for x in str(result).split('\n'):
  if '/NN' in x:
    print(x)

The output looks decent but not great. Say we take up a little more complex task.

    Say, we want to implement noun phrase chunking to identify named entities.

    Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

In [0]:
pattern='NP: {<DT>?<JJ>*<NN>}'
cp=nltk.RegexpParser(pattern)
cs=cp.parse(result)
print(cs)

The output can be read as a tree with “S” means the sentence as the first level. It can viewed in a more acceptable format called IOB tags (Inside, Outside, Beginning)

In [0]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)

pprint(iob_tagged)

Here, in the output each token is a line with parts-of-speech and named entity tagged. If you want to extract the IOB tags, as it is a tuple you simply do-

In [0]:
for word, pos, ner in iob_tagged:
    print(word, pos, ner)