<a href="https://colab.research.google.com/gist/Melvinchen0404/c4b91cbaa33e4ac35a26a20f6165acc1/pos_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NLP Technique 2: POS Tagging 
**Part-of-speech (POS) tagging** (or **tagging**) refers to the process of classifying words into their **parts of speech** and labelling them accordingly \
**Parts of speech** are also known as **word classes** or **lexical categories** \
Each word will have a **part-of-speech (POS) tag** to describe its syntactic category \
The **tokenized words** could be tagged as nouns, verbs, adjectives, adverbs, prepositions, determiners, conjunctions, pronouns, etc \
Here is an online **POS tagging** resource: https://parts-of-speech.info/  
 
\
**STEP 1:** Import the `pos_tag` function from the `nltk` package \
**STEP 2:** Download the `averaged_perceptron_tagger` for tagging the **tokenized words** with their **parts of speech**


In [3]:
import nltk
nltk.download('punkt')

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

text = "Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air."
tokens1 = nltk.word_tokenize(text)
tokens2 = [word for word in tokens1 if word.isalnum()]
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

**STEP 3:** Apply the `pos_tag` function to the **tokenized words** to get the **POS tags** for each **tokenized word**  

In [4]:
nltk.pos_tag(tokens2)

[('Stately', 'RB'),
 ('plump', 'JJ'),
 ('Buck', 'NNP'),
 ('Mulligan', 'NNP'),
 ('came', 'VBD'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('stairhead', 'NN'),
 ('bearing', 'VBG'),
 ('a', 'DT'),
 ('bowl', 'NN'),
 ('of', 'IN'),
 ('lather', 'NN'),
 ('on', 'IN'),
 ('which', 'WDT'),
 ('a', 'DT'),
 ('mirror', 'NN'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('razor', 'NN'),
 ('lay', 'NN'),
 ('crossed', 'VBD'),
 ('A', 'NNP'),
 ('yellow', 'JJ'),
 ('dressinggown', 'NN'),
 ('ungirdled', 'VBD'),
 ('was', 'VBD'),
 ('sustained', 'VBN'),
 ('gently', 'RB'),
 ('behind', 'IN'),
 ('him', 'PRP'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('mild', 'JJ'),
 ('morning', 'NN'),
 ('air', 'NN')]

**STEP 4:** The `nltk` package provides documentation for each **POS tag** \
This allows queries to be run for each **POS tag** \

In [5]:
#@markdown **POS tag query**
#@markdown ---
POS_tag = "Key in abbreviation for POS tag to be queried here"#@param {type:"string"}


In [6]:
nltk.download('tagsets')
nltk.help.upenn_tagset(POS_tag)

No matching tags found.


[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


**STEP 5:** If you wish, you can tag only a restricted number of **tokenized words**. Suppose you wish to tag only the first *n* **tokenized words**. You could run the line of code `tagged[0:n]`

In [7]:
tagged = nltk.pos_tag(tokens2)
tagged[0:8]

[('Stately', 'RB'),
 ('plump', 'JJ'),
 ('Buck', 'NNP'),
 ('Mulligan', 'NNP'),
 ('came', 'VBD'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('stairhead', 'NN')]