**Nmed Entity Recognition**

- NER is an NLP task of categorizing the words in a text to common categories such as the name of a person, date, organization, etc.
- It can be used to extract information like products named in a complaint, location, companies in an article etc.


NER implimentation using NLTK

In [None]:
# Downloading nltk requirements

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

In [5]:
Document = ['''Dr. Theresa Tam was named Canada’s Chief Public Health Officer (CPHO) on June 26, 2017.''', ''' She is a physician with expertise in immunization, infectious disease, emergency preparedness and global health security.''', '''

As the federal government’s lead health professional, Dr. Tam provides advice to the Minister of Health, supports and provides advice to the President of the Public Health Agency of Canada, and works in collaboration with the President in the leadership and management of the Agency.''',

'''The Public Health Agency of Canada Act empowers the CPHO to communicate with other levels of government, voluntary organizations, the private sector and Canadians on public health issues.''',  '''Each year, the CPHO is required to submit a report to the Minister of Health on the state of public health in Canada.''']

In [9]:
# Word tokenization

def pre_process(Doc):
  Named = []
  for sent in Doc:
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent) # part of speech tag
    Named.append(sent)
  return Named

tokenized_sentences = pre_process(Document)
tokenized_sentences

[[('Dr.', 'NNP'),
  ('Theresa', 'NNP'),
  ('Tam', 'NNP'),
  ('was', 'VBD'),
  ('named', 'VBN'),
  ('Canada', 'NNP'),
  ('’', 'NNP'),
  ('s', 'NN'),
  ('Chief', 'NNP'),
  ('Public', 'NNP'),
  ('Health', 'NNP'),
  ('Officer', 'NNP'),
  ('(', '('),
  ('CPHO', 'NNP'),
  (')', ')'),
  ('on', 'IN'),
  ('June', 'NNP'),
  ('26', 'CD'),
  (',', ','),
  ('2017', 'CD'),
  ('.', '.')],
 [('She', 'PRP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('physician', 'JJ'),
  ('with', 'IN'),
  ('expertise', 'NN'),
  ('in', 'IN'),
  ('immunization', 'NN'),
  (',', ','),
  ('infectious', 'JJ'),
  ('disease', 'NN'),
  (',', ','),
  ('emergency', 'NN'),
  ('preparedness', 'NN'),
  ('and', 'CC'),
  ('global', 'JJ'),
  ('health', 'NN'),
  ('security', 'NN'),
  ('.', '.')],
 [('As', 'IN'),
  ('the', 'DT'),
  ('federal', 'JJ'),
  ('government', 'NN'),
  ('’', 'NNP'),
  ('s', 'VBZ'),
  ('lead', 'JJ'),
  ('health', 'NN'),
  ('professional', 'JJ'),
  (',', ','),
  ('Dr.', 'NNP'),
  ('Tam', 'NNP'),
  ('provides', 'VBZ'),
  ('

Chunking

- Use regular expressions to identity the named entities 
- For example the noun phrase chunking pattern below
- The chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

In [10]:
# Pattern
pattern = 'NP: {<DT>?<JJ>*<NN>}'

# Chunk paser

cp = nltk.RegexpParser(pattern)
cs = cp.parse(tokenized_sentences[0])
print(cs)


(S
  Dr./NNP
  Theresa/NNP
  Tam/NNP
  was/VBD
  named/VBN
  Canada/NNP
  ’/NNP
  (NP s/NN)
  Chief/NNP
  Public/NNP
  Health/NNP
  Officer/NNP
  (/(
  CPHO/NNP
  )/)
  on/IN
  June/NNP
  26/CD
  ,/,
  2017/CD
  ./.)


Using IOB format to represent the chunk structures in files.
- I - word in inside, eg. I-NP means word is inside a noun phrase
- O - End of sentence
- B - Beginning of a phrase, NP of VP


In [13]:
from nltk import tree2conlltags

iob_tags = tree2conlltags(cs)
iob_tags

# Returns word, part of speech tag, IOBtag tuples.

[('Dr.', 'NNP', 'O'),
 ('Theresa', 'NNP', 'O'),
 ('Tam', 'NNP', 'O'),
 ('was', 'VBD', 'O'),
 ('named', 'VBN', 'O'),
 ('Canada', 'NNP', 'O'),
 ('’', 'NNP', 'O'),
 ('s', 'NN', 'B-NP'),
 ('Chief', 'NNP', 'O'),
 ('Public', 'NNP', 'O'),
 ('Health', 'NNP', 'O'),
 ('Officer', 'NNP', 'O'),
 ('(', '(', 'O'),
 ('CPHO', 'NNP', 'O'),
 (')', ')', 'O'),
 ('on', 'IN', 'O'),
 ('June', 'NNP', 'O'),
 ('26', 'CD', 'O'),
 (',', ',', 'O'),
 ('2017', 'CD', 'O'),
 ('.', '.', 'O')]

nltk.ne_chunk can be used to indetify named entity and classify them using the classifier.

In [None]:
from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [16]:
ne_tree = [ne_chunk(pos_tag(word_tokenize(sent))) for sent in Document]
ne_tree

[Tree('S', [('Dr.', 'NNP'), ('Theresa', 'NNP'), ('Tam', 'NNP'), ('was', 'VBD'), ('named', 'VBN'), Tree('PERSON', [('Canada', 'NNP')]), ('’', 'NNP'), ('s', 'NN'), ('Chief', 'NNP'), Tree('PERSON', [('Public', 'NNP'), ('Health', 'NNP'), ('Officer', 'NNP')]), ('(', '('), Tree('ORGANIZATION', [('CPHO', 'NNP')]), (')', ')'), ('on', 'IN'), ('June', 'NNP'), ('26', 'CD'), (',', ','), ('2017', 'CD'), ('.', '.')]),
 Tree('S', [('She', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('physician', 'JJ'), ('with', 'IN'), ('expertise', 'NN'), ('in', 'IN'), ('immunization', 'NN'), (',', ','), ('infectious', 'JJ'), ('disease', 'NN'), (',', ','), ('emergency', 'NN'), ('preparedness', 'NN'), ('and', 'CC'), ('global', 'JJ'), ('health', 'NN'), ('security', 'NN'), ('.', '.')]),
 Tree('S', [('As', 'IN'), ('the', 'DT'), ('federal', 'JJ'), ('government', 'NN'), ('’', 'NNP'), ('s', 'VBZ'), ('lead', 'JJ'), ('health', 'NN'), ('professional', 'JJ'), (',', ','), ('Dr.', 'NNP'), ('Tam', 'NNP'), ('provides', 'VBZ'), ('advice', '

Using **Spacy** for named entity recognition

In [17]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm


nlp = en_core_web_sm.load()

Entity level.

In [18]:
# Tokenize and print entity level tags
Entities = nlp(Document[0])
[(x.text, x.label_) for x in Entities.ents]

[('Theresa Tam', 'PERSON'), ('Canada', 'GPE'), ('June 26, 2017', 'DATE')]

Token level entity annotation.

- **B**egin - The first token of a multi_token entity
- **I**n - An inner token of a multi-token entity
- **L**ast - The final token of a multi-token entity
- **U**nit - A single token entity
- **O**ut - A non-entity token

In [19]:
[(x, x.ent_iob_, x.ent_type_) for x in Entities]

[(Dr., 'O', ''),
 (Theresa, 'B', 'PERSON'),
 (Tam, 'I', 'PERSON'),
 (was, 'O', ''),
 (named, 'O', ''),
 (Canada, 'B', 'GPE'),
 (’s, 'O', ''),
 (Chief, 'O', ''),
 (Public, 'O', ''),
 (Health, 'O', ''),
 (Officer, 'O', ''),
 ((, 'O', ''),
 (CPHO, 'O', ''),
 (), 'O', ''),
 (on, 'O', ''),
 (June, 'B', 'DATE'),
 (26, 'I', 'DATE'),
 (,, 'I', 'DATE'),
 (2017, 'I', 'DATE'),
 (., 'O', '')]

**NER** extraction from an article.

In [20]:
from bs4 import BeautifulSoup
import requests
import re

def url_to_string(URL):
  res = requests.get(URL)
  html = res.text
  soup = BeautifulSoup(html, 'html5lib')

  for script in soup(['script', 'style', 'aside']):
    script.extract()
  return " ".join(re.split(r'[\n\t]+', soup.get_text()))

Preprocessed = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')

Article = nlp(Preprocessed)

In [22]:
# Labels counter

Labels = [X.label_ for X in Article.ents]
Counter(Labels)

Counter({'CARDINAL': 3,
         'DATE': 23,
         'GPE': 9,
         'LOC': 1,
         'NORP': 2,
         'ORDINAL': 1,
         'ORG': 38,
         'PERSON': 77})

In [23]:
# Most frequently mentioned tokens
# To quickly find out what the article is about
items = [x.text for x in Article.ents]
Counter(items).most_common(3)

[('Strzok', 29), ('F.B.I.', 19), ('Trump', 13)]

Naming the entities and extracting the lemma

In [24]:
# 
sentences = [x for x in Article.sents]

[(X.orth_, X.pos_, X.lemma_) for X in [Y for Y in nlp(str(sentences[20]))
                                               if not Y.is_stop and Y.pos_ != 'PUNCT']]

[('spokeswoman', 'NOUN', 'spokeswoman'),
 ('F.B.I.', 'PROPN', 'F.B.I.'),
 ('respond', 'VERB', 'respond'),
 ('message', 'NOUN', 'message'),
 ('seeking', 'VERB', 'seek'),
 ('comment', 'NOUN', 'comment'),
 ('Mr.', 'PROPN', 'Mr.'),
 ('Strzok', 'PROPN', 'Strzok'),
 ('dismissed', 'VERB', 'dismiss'),
 ('demoted', 'VERB', 'demote')]