### POS_Tagging:

1. In Natural Language Processing (NLP), pos_tag is a function used for Part-of-Speech (POS) tagging. 
2. POS tagging is the process of labeling words in a sentence with their respective grammatical categories, such as nouns, verbs, adjectives, etc. 
3. This process plays an important role in many NLP applications, including text analysis, information retrieval, and machine translation.
4. The pos_tag function assigns a tag to each word in a sentence based on its syntactic role and function within the sentence. 
5. These POS tags provide valuable information about the word's behavior and its relationship with other words in the sentence .

In [1]:
#imports

from nltk.tokenize import word_tokenize
from nltk import pos_tag

In [2]:
text = 'English is a West Germanic language in the Indo-European language family, whose speakers, called Anglophones, originated in early medieval England. The namesake of the language is the Angles, one of the ancient Germanic peoples that migrated to the island of Great Britain. Modern English is both the most spoken language in the world and the third-most spoken native language, after Mandarin Chinese and Spanish. It is also the most widely learned second language in the world, with more second-language speakers than native speakers.'

In [3]:
text_tag = pos_tag(word_tokenize(text))   #we have to give list of words to pos tag it will return list of tuple

text_tag

[('English', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('West', 'NNP'),
 ('Germanic', 'NNP'),
 ('language', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('Indo-European', 'JJ'),
 ('language', 'NN'),
 ('family', 'NN'),
 (',', ','),
 ('whose', 'WP$'),
 ('speakers', 'NNS'),
 (',', ','),
 ('called', 'VBN'),
 ('Anglophones', 'NNS'),
 (',', ','),
 ('originated', 'VBN'),
 ('in', 'IN'),
 ('early', 'JJ'),
 ('medieval', 'NN'),
 ('England', 'NNP'),
 ('.', '.'),
 ('The', 'DT'),
 ('namesake', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('language', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('Angles', 'NNP'),
 (',', ','),
 ('one', 'CD'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('ancient', 'NN'),
 ('Germanic', 'NNP'),
 ('peoples', 'VBZ'),
 ('that', 'WDT'),
 ('migrated', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('island', 'NN'),
 ('of', 'IN'),
 ('Great', 'NNP'),
 ('Britain', 'NNP'),
 ('.', '.'),
 ('Modern', 'NNP'),
 ('English', 'NNP'),
 ('is', 'VBZ'),
 ('both', 'DT'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('spoken', 'JJ'),
 ('la

In [4]:
# Take only verb out of the text_tag

verb = []

for word, tag in text_tag:
    if (tag.startswith('V') and word.isalpha()):
        verb.append(word)
        
verb

['is',
 'called',
 'originated',
 'is',
 'peoples',
 'migrated',
 'is',
 'is',
 'learned']

In [5]:
#How many proper nouns are there          (Proper Noun:- NNP)

proper_noun = []
for word,tag in text_tag:
    if(tag == 'NNP' and word.isalpha()):
        proper_noun.append(word)

print(proper_noun)

print("Number of proper noun:-",len(proper_noun))
        

['English', 'West', 'Germanic', 'England', 'Angles', 'Germanic', 'Great', 'Britain', 'Modern', 'English', 'Mandarin', 'Chinese', 'Spanish']
Number of proper noun:- 13


In [6]:
#pair of adj and Noun

for i in range(len(text_tag)):
    if text_tag[i][1].startswith('J') and text_tag[i+1][1].startswith('N'):
        print(text_tag[i][0], text_tag[i+1][0])

Indo-European language
early medieval
spoken language
native language
second language
second-language speakers
native speakers


### Tagging

In [7]:
from nltk.corpus import indian
from nltk import TnT    #TnT Tagger : It is a statistical tagger that works on second-order Markov models.

In [8]:
#tagged_sents is giving tag
pos = indian.tagged_sents('marathi.pos')
pos

[[("''", 'SYM'), ('सनातनवाद्यांनी', 'NN'), ('व', 'CC'), ('प्रतिगाम्यांनी', 'NN'), ('समाज', 'NN'), ('रसातळाला', 'NN'), ('नेला', 'VM'), ('असताना', 'VAUX'), ('या', 'DEM'), ('अंधारात', 'NN'), ('बाळशास्त्री', 'NNPC'), ('जांभेकर', 'NNP'), ('यांनी', 'PRP'), ("'दर्पण'च्या", 'NNP'), ('माध्यमातून', 'NN'), ('पहिली', 'QO'), ('ज्ञानज्योत', 'NN'), ('तेववली', 'VM'), (',', 'SYM'), ("''", 'SYM'), ('असे', 'DEM'), ('प्रतिपादन', 'NN'), ('नटसम्राट', 'NNPC'), ('प्रभाकर', 'NNPC'), ('पणशीकर', 'NNP'), ('यांनी', 'PRP'), ('केले', 'VM'), ('.', 'SYM')], [('दर्पणकार', 'JJ'), ('बाळशास्त्री', 'NNPC'), ('जांभेकर', 'NNP'), ('यांच्या', 'PRP'), ('१९५व्या', 'QC'), ('जयंतीनिमित्त', 'NN'), ('महाराष्ट्र', 'NNPC'), ('संपादक', 'NNPC'), ('परिषद', 'NNP'), ('व', 'CC'), ('सिंधुदुर्ग', 'NNPC'), ('जिल्हा', 'NNPC'), ('मराठी', 'NNPC'), ('पत्रकार', 'NNPC'), ('संघाच्या', 'NNP'), ('वतीने', 'NN'), ('तसेच', 'PRP'), ('महाराष्ट्र', 'NNPC'), ('जर्नलिस्ट', 'NNPC'), ('फाउंडेशन', 'NNP'), ('व', 'CC'), ('महाराष्ट्र', 'NNPC'), ('ग्रामीण', 'NNPC'), 

In [9]:
tagger = TnT()

tagger.train(pos)

In [10]:
text = 'सोमवारी हा सामना सुमारे आठ तास उशिरा सुरू झाला.'

In [11]:
tagger.tag(word_tokenize(text))

[('सोमवारी', 'NNP'),
 ('हा', 'DEM'),
 ('सामना', 'NN'),
 ('सुमारे', 'QF'),
 ('आठ', 'QC'),
 ('तास', 'NN'),
 ('उशिरा', 'NN'),
 ('सुरू', 'JJ'),
 ('झाला', 'VM'),
 ('.', 'SYM')]