**Objective : Performing POS(Parts of Speech) tagging on textual sequences**

# Import Libraries

In [34]:
import nltk.data 

The code imports the nltk.data module, which is part of the Natural Language Toolkit (NLTK) library. This module allows access to various data resources (e.g., tokenizers, corpora) that NLTK uses. It enables loading and using pre-trained models or other resources for natural language processing tasks.

In [35]:
print(nltk.data.path)

['C:\\Users\\gkris/nltk_data', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2032.0_x64__qbz5n2kfra8p0\\nltk_data', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2032.0_x64__qbz5n2kfra8p0\\share\\nltk_data', 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2032.0_x64__qbz5n2kfra8p0\\lib\\nltk_data', 'C:\\Users\\gkris\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data']


# Import Corpus Reader

In [36]:
from nltk.corpus.reader import TaggedCorpusReader

The code imports the `TaggedCorpusReader` class from the `nltk.corpus.reader` module, which is part of the NLTK library. This class is used to read and process tagged corpora, where each word is associated with a part-of-speech tag. It enables efficient access and manipulation of tagged linguistic data for tasks like part-of-speech tagging and syntactic analysis.

In [37]:
reader = TaggedCorpusReader('.', r'.*\.pos')

In [38]:
reader.words

<bound method TaggedCorpusReader.words of <TaggedCorpusReader in 'e:\\NLP'>>

In [None]:
reader.tagged_words

<bound method TaggedCorpusReader.tagged_words of <TaggedCorpusReader in 'e:\\NLP'>>

# Applying Word Tokenization

In [40]:
import nltk

In [41]:
from nltk.tokenize import word_tokenize

Word Tokenize is the process in which sentence is splitted into chunks of words while handling punctuations and other language specific details.

In [57]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

'punkt_tab' tokenizer models are used for tokenizing text into sentences or words.

# Implementing POS tagging

In [63]:
import nltk

nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

This resource is used for part-of-speech (POS) tagging, specifically the  tagger for English. It assigns grammatical categories like nouns, verbs, adjectives, etc., to words in a given text.

# Implementing POS tagging on queries

In [66]:
l1 = word_tokenize("The Python lives in a very dense amazon forest")

In [67]:
nltk.pos_tag(l1)

[('The', 'DT'),
 ('Python', 'NNP'),
 ('lives', 'VBZ'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('very', 'RB'),
 ('dense', 'JJ'),
 ('amazon', 'NN'),
 ('forest', 'NN')]

In [69]:
l2 = word_tokenize("Lemmatization is a more sophisticated approach to reducing words to their base form (lemma), where words are converted to their correct dictionary form by considering the word’s meaning and context. Unlike stemming, lemmatization uses a vocabulary and morphological analysis of the word.")
nltk.pos_tag(l2)

[('Lemmatization', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('more', 'RBR'),
 ('sophisticated', 'JJ'),
 ('approach', 'NN'),
 ('to', 'TO'),
 ('reducing', 'VBG'),
 ('words', 'NNS'),
 ('to', 'TO'),
 ('their', 'PRP$'),
 ('base', 'NN'),
 ('form', 'NN'),
 ('(', '('),
 ('lemma', 'NN'),
 (')', ')'),
 (',', ','),
 ('where', 'WRB'),
 ('words', 'NNS'),
 ('are', 'VBP'),
 ('converted', 'VBN'),
 ('to', 'TO'),
 ('their', 'PRP$'),
 ('correct', 'JJ'),
 ('dictionary', 'JJ'),
 ('form', 'NN'),
 ('by', 'IN'),
 ('considering', 'VBG'),
 ('the', 'DT'),
 ('word', 'NN'),
 ('’', 'NNP'),
 ('s', 'NN'),
 ('meaning', 'NN'),
 ('and', 'CC'),
 ('context', 'NN'),
 ('.', '.'),
 ('Unlike', 'IN'),
 ('stemming', 'VBG'),
 (',', ','),
 ('lemmatization', 'NN'),
 ('uses', 'VBZ'),
 ('a', 'DT'),
 ('vocabulary', 'JJ'),
 ('and', 'CC'),
 ('morphological', 'JJ'),
 ('analysis', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('word', 'NN'),
 ('.', '.')]

In [73]:
l3 = word_tokenize("Pushpa 2: The Rule, starring Allu Arjun, has made a blazing start at the North American box office, surpassing the $3 million mark with its premiere shows. The film, which had already generated massive buzz in India, is now challenging the records of the biggest Indian films in the region.")
nltk.pos_tag(l3)

[('Pushpa', '$'),
 ('2', 'CD'),
 (':', ':'),
 ('The', 'DT'),
 ('Rule', 'NNP'),
 (',', ','),
 ('starring', 'VBG'),
 ('Allu', 'NNP'),
 ('Arjun', 'NNP'),
 (',', ','),
 ('has', 'VBZ'),
 ('made', 'VBN'),
 ('a', 'DT'),
 ('blazing', 'JJ'),
 ('start', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('North', 'JJ'),
 ('American', 'NNP'),
 ('box', 'NN'),
 ('office', 'NN'),
 (',', ','),
 ('surpassing', 'VBG'),
 ('the', 'DT'),
 ('$', '$'),
 ('3', 'CD'),
 ('million', 'CD'),
 ('mark', 'NN'),
 ('with', 'IN'),
 ('its', 'PRP$'),
 ('premiere', 'NN'),
 ('shows', 'NNS'),
 ('.', '.'),
 ('The', 'DT'),
 ('film', 'NN'),
 (',', ','),
 ('which', 'WDT'),
 ('had', 'VBD'),
 ('already', 'RB'),
 ('generated', 'VBN'),
 ('massive', 'JJ'),
 ('buzz', 'NN'),
 ('in', 'IN'),
 ('India', 'NNP'),
 (',', ','),
 ('is', 'VBZ'),
 ('now', 'RB'),
 ('challenging', 'VBG'),
 ('the', 'DT'),
 ('records', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('biggest', 'JJS'),
 ('Indian', 'JJ'),
 ('films', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('region', 'N

In [75]:
l4 = word_tokenize("A heavy crowd rushed ahead and the woman and her son, who were trying to enter inside the theatre, suffocated and fell unconscious apparently after being pushed by the crowd, police said, based on preliminary investigation.")
nltk.pos_tag(l4)

[('A', 'DT'),
 ('heavy', 'JJ'),
 ('crowd', 'NN'),
 ('rushed', 'VBN'),
 ('ahead', 'RB'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('woman', 'NN'),
 ('and', 'CC'),
 ('her', 'PRP$'),
 ('son', 'NN'),
 (',', ','),
 ('who', 'WP'),
 ('were', 'VBD'),
 ('trying', 'VBG'),
 ('to', 'TO'),
 ('enter', 'VB'),
 ('inside', 'IN'),
 ('the', 'DT'),
 ('theatre', 'NN'),
 (',', ','),
 ('suffocated', 'VBN'),
 ('and', 'CC'),
 ('fell', 'VBD'),
 ('unconscious', 'JJ'),
 ('apparently', 'RB'),
 ('after', 'IN'),
 ('being', 'VBG'),
 ('pushed', 'VBN'),
 ('by', 'IN'),
 ('the', 'DT'),
 ('crowd', 'NN'),
 (',', ','),
 ('police', 'NN'),
 ('said', 'VBD'),
 (',', ','),
 ('based', 'VBN'),
 ('on', 'IN'),
 ('preliminary', 'JJ'),
 ('investigation', 'NN'),
 ('.', '.')]

In [76]:
l5 = word_tokenize("The hostname or IP address where the MySQL server is located.")
nltk.pos_tag(l5)

[('The', 'DT'),
 ('hostname', 'NN'),
 ('or', 'CC'),
 ('IP', 'NNP'),
 ('address', 'NN'),
 ('where', 'WRB'),
 ('the', 'DT'),
 ('MySQL', 'NNP'),
 ('server', 'NN'),
 ('is', 'VBZ'),
 ('located', 'VBN'),
 ('.', '.')]