<a href="https://colab.research.google.com/github/Abhilashcme/Practice-Repository/blob/master/14_POS_tagging_%26_NER_(Named_Entity_Recognition).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://drive.google.com/uc?id=1PMTi2Akvsbc3z28Pgi4zMMvI0oKUQGqw">

Tagging Part of Speech
--
Part of speech (POS) tagging is another crucial part of natural language
processing that involves labeling the words with a part of speech such as
noun, verb, adjective, etc. POS is the base for Named Entity Resolution,
Sentiment Analysis, Question Answering, and Word Sense Disambiguation.

Problem
--
Tagging the parts of speech for a sentence.

Solution
--
There are 2 ways a tagger can be built.

• Rule based - Rules created manually, which tag a word belonging to a particular POS.

• Stochastic based - These algorithms capture the sequence of the words and tag the probability of the sequence using hidden Markov models.

Again, NLTK has the best POS tagging module. nltk.pos_tag(word) is the
function that will generate the POS tagging for any given word. Use for loop
and generate POS for all the words present in the document.

In [None]:
text = "I love NLP and I will learn NLP in 2 month"

# NLTK for POS
# Importing necessary packages and stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

# Tokenize the text
tokens = sent_tokenize(text)

#Generate tagging for all the tokens using loop
for i in tokens:
 words = nltk.word_tokenize(i)
 words = [w for w in words if not w in stop_words]

# POS-tagger.
tags = nltk.pos_tag(words)

print(tags)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('I', 'PRP'), ('love', 'VBP'), ('NLP', 'NNP'), ('I', 'PRP'), ('learn', 'VBP'), ('NLP', 'RB'), ('2', 'CD'), ('month', 'NN')]


Below are the short forms and explanation of POS tagging. 

**for example : The word “love” is VBP, which means verb; sing , dance.**

• CC coordinating conjunction

• CD cardinal digit

• DT determiner

• EX existential there (like: “there is” ... think of it like “there exists”)

• FW foreign word

• IN preposition/subordinating conjunction

• JJ adjective ‘big’

• JJR adjective, comparative ‘bigger’

• JJS adjective, superlative ‘biggest’

• LS list marker 1)  2)

• MD modal could, will

• NN noun, singular ‘desk’

• NNS noun plural ‘desks’

• NNP proper noun, singular ‘Harrison’

• NNPS proper noun, plural ‘Americans’

• PDT predeterminer ‘all the kids’

• POS possessive ending parent’s

• PRP personal pronoun I, he, she

• PRP$ possessive pronoun my, his, hers

• RB adverb very, silently

• RBR adverb, comparative better

• RBS adverb, superlative best

• RP particle give up

• TO to go ‘to’ the store

• UH interjection

• VB verb, base form take

• VBD verb, past tense took

• VBG verb, gerund/present participle taking

• VBN verb, past participle taken

• VBP verb, sing. present, non-3d take

• VBZ verb, 3rd person sing. present takes

• WDT wh-determiner which

• WP wh-pronoun who, what

• WP$ possessive wh-pronoun whose

• WRB wh-adverb where, when


Extract Entities from Text
--
In this coding example, we are going to discuss how to identify and extract entities from the text, called Named Entity Recognition. 

There are multiple libraries to perform this task like NLTK chunker, StanfordNER, SpaCy, opennlp, and NeuroNER; and there are a lot of APIs also like WatsonNLU, AlchemyAPI, NERD, Google Cloud NLP API, and many more.

Problem
--
You want to identify and extract entities from the text.

Solution
--
The simplest way to do this is by using the ne_chunk from NLTK or SpaCy.

In [None]:
sent = "John is studying at Stanford University in California"

#import libraries
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import ne_chunk
from nltk import word_tokenize

#NER
ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False)
# second binary parameter indicates whether Named entity is Person, 
# Org or GPE.
# i.e binary=False would help you in classifying the NE

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


TclError: ignored

Tree('S', [Tree('PERSON', [('John', 'NNP')]), ('is', 'VBZ'), ('studying', 'VBG'), ('at', 'IN'), Tree('ORGANIZATION', [('Stanford', 'NNP'), ('University', 'NNP')]), ('in', 'IN'), Tree('GPE', [('California', 'NNP')])])

Here **"John"** is tagged as **"PERSON"**

**"Stanford"** as **"ORGANIZATION"**

**"California"** as **"GPE"**. Geopolitical entity, i.e. countries, cities, states.

Use Cases of NER Models
--
Named Entity Recognition has a wide range of applications in the field of Natural Language Processing and Information Retrieval. 

Few such examples have been listed below :

1> <font color='green'><b>Automatically Summarizing Resumes</b></font> : One of the key challenges faced by the HR Department across companies is to evaluate a gigantic pile of resumes to shortlist candidates. To add to their burden, resumes of applicants are often excessively populated in detail, of which, most of the information is irrelevant to what the evaluator is seeking. With the aim of simplifying this process, through our NER model, we could facilitate evaluation of resumes at a quick glance, thereby simplifying the effort required in shortlisting candidates among a pile of resumes.

2> <font color='green'><b>Optimizing Search Engine Algorithms</b></font> : To design a search engine algorithm, instead of searching for an entered query across the millions of articles and websites online, a more efficient approach would be to run an NER model on the articles once and store the entities associated with them permanently. The key tags in the search query can then be compared with the tags associated with the website articles for a quick and efficient search.

3> <font color='green'><b>Powering Recommender Systems</b></font> : NER can be used in developing algorithms for recommender systems which automatically filter relevant content we might be interested in and accordingly guide us to discover related and unvisited relevant contents based on our previous behaviour. This may be achieved by extracting the entities associated with the content in our history or previous activity and comparing them with label assigned to other unseen content to filter relevant ones.

4> <font color='green'><b>Simplifying Customer Support </b></font>: NER can be used in recognizing relevant entities in customer complaints and feedback such as Product specifications, department or company branch details, so that the feedback is classified accordingly and forwarded to the appropriate department responsible for the identified product.

Discussion Case study ( 20 - 30 mins )
--
Automatic Resume Summarization using NER

> Each team of 2 participants is expected to scan through this link :
https://towardsdatascience.com/a-review-of-named-entity-recognition-ner-using-automatic-summarization-of-resumes-5248a75de175
> and be ready for a Q n A round with the trainer 

Recommended Reading - Home Assignment
--
https://medium.freecodecamp.org/an-introduction-to-part-of-speech-tagging-and-the-hidden-markov-model-953d45338f24

Doing NER Using SpaCy
--


In [None]:
# doing NER Using SpaCy
!pip install spacy
import spacy
spacyEnglishObject = spacy.load('en')

# Read/create a sentence
doc = spacyEnglishObject(u'Apple is ready to launch new phone worth $10000 in New york time square ')
for ent in doc.ents:
 print(ent.text, ent.start_char, ent.end_char, ent.label_)



Apple 0 5 ORG
10000 42 47 MONEY
New york 51 59 GPE


According to the output, **`Apple`** is an organization, **`10000`** is money, and **`New York`** is place. The results are accurate and can be used for any NLP applications.