<a href="https://colab.research.google.com/github/AnuragBhattacharjee17/Natural-Language-Processing/blob/master/Tagging_(NLP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLTK provides different taggers that we can train and use in order to tag our unseen text data more efficiently. The taggers are:

Default tagger

Lookup taggers:

Unigram tagger - context independent tagging

Ngram tagger - context dependent tagging

Regular Expression Tagger

We can also use a combination of these taggers to tag a sentence with the concept of backoff. 

In [0]:
#DEFAULT TAGGER
#The default tagger assigns the same tag to each token, this is considered the most naive tagger.
import nltk
from nltk import word_tokenize
nltk.download("punkt")
sent = "The gentlemen wants some water to water the plants"
word_token= nltk.word_tokenize(sent)
print(word_token)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['The', 'gentlemen', 'wants', 'some', 'water', 'to', 'water', 'the', 'plants']


In [0]:
#Getting most common tag
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
tag = pos_tag(word_token)
Feq_tag = max(nltk.FreqDist(tag))
print(Feq_tag[1])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
NN


In [0]:
# Using the most_common_tag as the input for DefaultTagger
from nltk import DefaultTagger
default_tagger = DefaultTagger(Feq_tag[1])
final = default_tagger.tag(word_token)
print(final)

[('The', 'NN'), ('gentlemen', 'NN'), ('wants', 'NN'), ('some', 'NN'), ('water', 'NN'), ('to', 'NN'), ('water', 'NN'), ('the', 'NN'), ('plants', 'NN')]


**Lookup Tagger**

In [0]:
#Lookup Tagger
#A NgramTagger tags a word based on the previous n words occurring in the text.
from nltk import word_tokenize
from nltk import pos_tag
sent1 = "the quick brown fox jumps over the lazy dog"
training_tags= pos_tag(word_tokenize(sent1))
print(training_tags)

[('the', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


In [0]:
#Now let us use these tags to train the NgramTagger
ngram_tagger = nltk.NgramTagger(n=2,train=[training_tags])
print(ngram_tagger)

<NgramTagger: size=9>


In [0]:
sent2 = "the lazy dog was jumped over by the quick brown fox"
ngram_tag = ngram_tagger.tag(word_tokenize(sent2))
print(ngram_tag)

[('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('was', None), ('jumped', None), ('over', None), ('by', None), ('the', None), ('quick', None), ('brown', None), ('fox', None)]


 The NgramTagger then does a lookup for matching bigrams in the training data and uses that to tag the new data.

Since the pairs (the, lazy) and (lazy, dog) appear in the training data, the tagger is able to tag the words "the", "lazy" and "dog". 

When it encounters the pair (dog, was) , this sequence was never present in the training data; so it assigns None to the word "was" and all other words succeeding it. 

**Unigram Tagger**
UnigramTagger is a special case of NgramTagger where n=1. When n=1, then the NgramTagger has no context, i.e. each word is looked up independently in the training set. Therefore the UnigramTagger is also referred to as the context independent tagger.

The UnigramTagger performs a looks up the query word in the training data and assigns the most common tag associated with it.

In [0]:
barack = """Barack Hussein Obama II born August 4, 1961) is an American politician
who served as the 44th President of 
the United States from January 20, 2009, to January 20, 2017.
A member of the Democratic Party, he was the 
first African American to assume the presidency and previously
served as a United States Senator from Illinois (2005–2008)."""

bush = """George Walker Bush (born July 6, 1946) is an American politician who served as the 43rd President
 of the United States from 2001 to 2009.
He had previously served as the 46th Governor of Texas from 1995 to 2000.
Bush was born New Haven, Connecticut, and grew up in Texas. 
After graduating from Yale University in 1968 and Harvard Business School in 1975, he worked in the oil industry.
Bush married Laura Welch in 1977 and unsuccessfully ran for the U.S. House of Representatives shortly thereafter. 
He later co-owned the Texas Rangers baseball team before defeating Ann Richards in the 1994 Texas gubernatorial election. 
Bush was elected President of the United States in 2000 when he defeated Democratic incumbent 
Vice President Al Gore after a close and controversial win that involved a stopped recount in Florida. 
He became the fourth person to be elected president while receiving fewer popular votes than his opponent.
Bush is a member of a prominent political family and is the eldest son of Barbara and George H. W. Bush, 
the 41st President of the United States. 
He is only the second president to assume the nation's highest office after his father, following the footsteps
 of John Adams and his son, John Quincy Adams.
His brother, Jeb Bush, a former Governor of Florida, was a candidate for the Republican presidential nomination
 in the 2016 presidential election. 
His paternal grandfather, Prescott Bush, was a U.S. Senator from Connecticut."""

pos_tag_barack = pos_tag(word_tokenize(barack))
pos_tag_bush = pos_tag(word_tokenize(bush))

In [0]:
trump = """Donald John Trump (born June 14, 1946) is the 45th and current President of the United States.
Before entering politics, he was a businessman and television personality. 
Trump was born and raised in the New York City borough of Queens, and received an economics degree from the
 Wharton School of the University of Pennsylvania. 
He took charge of his family's real estate business in 1971, renamed it The Trump Organization, and expanded 
it from Queens and Brooklyn into Manhattan. 
The company built or renovated skyscrapers, hotels, casinos, and golf courses. 
Trump later started various side ventures, including licensing his name for real estate and consumer products.
He managed the company until his 2017 inauguration. 
He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty 
pageants from 1996 to 2015, and he produced and hosted the reality television show The Apprentice from 2003 to 2015.
Forbes estimates his net worth to be $3.1 billion."""

unigram_tagger = nltk.UnigramTagger(train=[pos_tag_barack,pos_tag_bush])
unigram_tag = unigram_tagger.tag(word_tokenize(trump))
print(unigram_tag)

[('Donald', None), ('John', 'NNP'), ('Trump', None), ('(', '('), ('born', 'VBN'), ('June', None), ('14', None), (',', ','), ('1946', 'CD'), (')', ')'), ('is', 'VBZ'), ('the', 'DT'), ('45th', None), ('and', 'CC'), ('current', None), ('President', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('.', '.'), ('Before', None), ('entering', None), ('politics', None), (',', ','), ('he', 'PRP'), ('was', 'VBD'), ('a', 'DT'), ('businessman', None), ('and', 'CC'), ('television', None), ('personality', None), ('.', '.'), ('Trump', None), ('was', 'VBD'), ('born', 'VBN'), ('and', 'CC'), ('raised', None), ('in', 'IN'), ('the', 'DT'), ('New', 'NNP'), ('York', None), ('City', None), ('borough', None), ('of', 'IN'), ('Queens', None), (',', ','), ('and', 'CC'), ('received', None), ('an', 'DT'), ('economics', None), ('degree', None), ('from', 'IN'), ('the', 'DT'), ('Wharton', None), ('School', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('University', 'NNP'), ('of', 'IN'), ('Pennsylva