<img src="../Pics/MLSb-T.png" width="160">
<br><br>
<center><u><H1>Part of Speech Tagging</H1></u></center>

<p><H3> Requirement: JAVA jdk installation and JAVA_HOME variable</H3>
    
1) Download JAVA jdk from :
    https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html 
    
2) Install in: C:\Program Files\jre1.8.0_181    
    
3) Create a Java Home environment variable consigning the route of jdk file

4) Create a Java Home variable

    System Variable:

    Variable name: JAVA_HOME
    Variable value: C:\Program Files\Java\jdk1.8.0_181

5) Update System Path:

    -Edit

    -New

    -%JAVA_HOME%\bin

6) Test your installation:

    Open cmd 
    echo %JAVA_HOME%

    javac -version

7) Reboot your system</p>

In [2]:
import nltk

In [2]:
from nltk import word_tokenize
from nltk import pos_tag

In [3]:
string = "I was watching movies"

In [4]:
print(pos_tag(word_tokenize(string)))

[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('movies', 'NNS')]


In [5]:
# THESE ARE THE TAGES FROM THE 36 PENN TREE TAGGERS BY PENSYLAVANIA UNIVERSITY
# PRP: Personal pronoun
# VBD: Verb, past tense
# VBG: Veb, gerund
# NNS: Noun plural

In [6]:
#Retrieving all nouns
s = 'My favourite scientist is Carl Sagan'
tagged = pos_tag(word_tokenize(s))

In [7]:
# Filtering only NN and NNP words form the sentence
allnoun = [word for word, pos in tagged if pos in ['NN','NNP']]
allnoun

['scientist', 'Carl', 'Sagan']

## Stanford tagger

In [8]:
from nltk.tag.stanford import StanfordPOSTagger

In [9]:
jar = '../Resources/stanford-postagger/stanford-postagger.jar'
model = '../Resources/stanford-postagger/models/english-bidirectional-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar)
pos_tagger.tag('The life is beautiful'.split())

[('The', 'DT'), ('life', 'NN'), ('is', 'VBZ'), ('beautiful', 'JJ')]

In [10]:
st = StanfordPOSTagger('../Resources/stanford-postagger-full/models/english-bidirectional-distsim.tagger','../Resources/stanford-postagger-full/stanford-postagger.jar')
st.tag('What is the airspeed of an unladen swallow ?'.split())

[('What', 'WP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('airspeed', 'NN'),
 ('of', 'IN'),
 ('an', 'DT'),
 ('unladen', 'JJ'),
 ('swallow', 'VB'),
 ('?', '.')]

### The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on

In [4]:
from nltk.corpus import brown

In [3]:
nltk.download('brown')

[nltk_data] Downloading package brown to C:\Users\Sai Charan
[nltk_data]     Reddy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [5]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [15]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

In [16]:
import operator
freq = nltk.FreqDist(tags)
tags_freq = sorted(freq.items(), key=operator.itemgetter(1))
tags_freq[-10:]

[('VBD', 2524),
 ('CC', 2664),
 ('JJ', 4392),
 ('.', 4452),
 ('NNS', 5066),
 (',', 5133),
 ('NP', 6866),
 ('AT', 8893),
 ('IN', 10616),
 ('NN', 13162)]

## Default Tagger:

In [17]:
brown_tagged_sents = brown.tagged_sents(categories='news')
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.evaluate(brown_tagged_sents))

0.13089484257215028


## N-gram Tagger:

### N-gram tagger takes previous n words in the context, to predict the POS tag for the given token.

In [18]:
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

In [19]:
# splitting the data into train and test datasets # train = 90%(0.9) and test 10%((0.9):) data ()
train_data = brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)]
test_data = brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):] 

In [20]:
#unigram considers the conditional frequency of tags and predicts the most
#frequent tag for the every given token.
unigram_tagger = UnigramTagger(train_data, backoff=default_tagger)
print(unigram_tagger.evaluate(test_data))

0.8361407355726104


In [21]:
#bigram consider the tags of the given word and previous word, and tag as
#tuple to get the given tag for the test word.
bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger)
print(bigram_tagger.evaluate(test_data))

0.8452108043456593


In [22]:
#trigram looks for the previous two words with the similar process.
trigram_tagger = TrigramTagger(train_data, backoff=bigram_tagger)
print(trigram_tagger.evaluate(test_data))

0.843317053722715


In [23]:
#We are combining the three taggers. First it will look for the Trigram
#of the given word sequence for predicting the tag, if not found it Backoff 
#to BigramTagger parameter and to a UnigramTagger and in the end to a NN tag.

## Regex tagger

In [24]:
from nltk.tag.sequential import RegexpTagger

In [25]:
regexp_tagger = RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), #cardinal numbers
 (r'(The|the|A|a|An|an)$', 'AT'), #articles
 (r'.*able$', 'JJ'), #adjectives
 (r'.*ness$', 'NN'), #nouns formed from adj
 (r'.*ly$', 'RB'), #adverbs
 (r'.*s$', 'NNS'), #plural nouns
 (r'.*ing$', 'VRG'), #gerunds
 (r'.*ed$', 'VBD'), #past tense verbs
 (r'.*', 'NN') # nouns (default)
])

In [26]:
print(regexp_tagger.evaluate(test_data))

0.2999102960231237


## Named Entity Recognition (NER)

In [27]:
from nltk import ne_chunk

In [28]:
# the ne_chunk method recognizes people(names), places(location),
#and organizations.

In [29]:
text = "Stephen Hawking teach maths at the Oxford University in England"

In [34]:
#nltk.download('maxent_ne_chunker')
#nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\map25\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [35]:
print(ne_chunk(nltk.pos_tag(word_tokenize(text)), binary=False))

(S
  (PERSON Stephen/NNP)
  Hawking/NNP
  teach/VB
  maths/NNS
  at/IN
  the/DT
  (ORGANIZATION Oxford/NNP University/NNP)
  in/IN
  (GPE England/NNP))


In [36]:
# if bynary parameter is True it provides the output for the entire
# sentence tree and tags everything.
print(ne_chunk(nltk.pos_tag(word_tokenize(text)), binary=True))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\map25\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
(S
  (NE Stephen/NNP)
  Hawking/NNP
  teach/VB
  maths/NNS
  at/IN
  the/DT
  (NE Oxford/NNP University/NNP)
  in/IN
  (NE England/NNP))


## Stanford NER

In [37]:
from nltk.tag.stanford import StanfordNERTagger

In [38]:
jar_ner = '../Resources/stanford-ner/stanford-ner.jar'
model_ner = '../Resources/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz'
st_ner = StanfordNERTagger(model_ner, jar_ner)
st_ner.tag('Carl Sagan taught at the Cornell University in USA'.split())

[('Carl', 'PERSON'),
 ('Sagan', 'PERSON'),
 ('taught', 'O'),
 ('at', 'O'),
 ('the', 'O'),
 ('Cornell', 'ORGANIZATION'),
 ('University', 'ORGANIZATION'),
 ('in', 'O'),
 ('USA', 'LOCATION')]

## References: 

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

http://www.nltk.org/book/ch02.html

https://nlp.stanford.edu/software/

http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford

https://en.wikipedia.org/wiki/Brown_Corpus

https://nlp.stanford.edu/software/CRF-NER.html