# Marathi Part-of-Speech Tagging using NLTK and Hidden Markov Model Tagger

## Introduction
In this tutorial, we will learn how to perform Part-of-Speech (POS) tagging in Marathi using the Natural Language Toolkit (NLTK). POS tagging is the process of assigning grammatical categories (like noun, verb, adjective, etc.) to each word in a sentence.

## Prerequisites
Before we begin, make sure you have NLTK installed. You can install it using pip:
 `pip install nltk`


In [1]:
import nltk
from nltk import word_tokenize
from nltk.tag import HiddenMarkovModelTagger

## Download Marathi Corpus
We need to download nltk marathi corpus to train are model.
You can do it using:  
`nltk.download('indian')`

In [2]:
nltk.download('indian')
from nltk.corpus import indian

[nltk_data] Downloading package indian to
[nltk_data]     C:\Users\rutwik\AppData\Roaming\nltk_data...
[nltk_data]   Package indian is already up-to-date!


## Loading marathi POS provided by nltk
To load the Marathi POS tagged corpus, we use the `tagged_sents` function provided by NLTK's Indian corpus.

In [3]:
tagged_set = indian.tagged_sents('marathi.pos')

## Training our model
In this section, we will train a Hidden Markov Model (HMM) Tagger for Marathi Part-of-Speech (POS) tagging using NLTK. The HMM Tagger is a statistical model that learns from annotated data to predict the POS tags of words in a sentence.

In [4]:
hmm_tagger = HiddenMarkovModelTagger.train(tagged_set)

## Testing
Before moving further we need to know the accuracy of are model, we will use the `HiddenMarkovModelTagger.train()` function from nltk to check the accuracy.

In [5]:
train_data = tagged_set[:800]
test_data = tagged_set[800:1000]
hmm_tagger_test = HiddenMarkovModelTagger.train(train_data)

## Accuracy Testing
To test the accuracy of our HMM Tagger, we will use the `.accuracy()` method provided by NLTK. This method computes the accuracy of the tagger by comparing its predictions against the gold standard annotations in the tagged corpus.
Do not use `.evaluate()` method.

In [6]:
accuracy = hmm_tagger.accuracy(test_data)
print("Accuracy:", accuracy)

Accuracy: 0.9386751518657795


In [7]:
sentence = input()

 जे झाले ते चांगलेच झाले, जे होत आहे ते चांगलेच होत आहे आणि जे होणार तेही चांगलेच होणार


## Tokenization
Tokenization is a process of making tokens of each word in the sentence 
We will do tokienization of each word for Further POS tagging

In [8]:
tokens = nltk.word_tokenize(sentence)

## Creating a Fuction to get POS tags for each Word
We will create a get_POS fuction where we will pass article(word) and are hmm_tagger trained on Hidden Markov Model(HMM) to retrieve the corresponding Part-of-Speech (POS) tag for the word.

In [9]:
def get_POS(article, tagger):
    word_tags = tagger.tag(article)
    POS = []
    for word, tag in word_tags:
        POS.append(f"{word}_{tag}")
    return POS

## Final Implementation
In this final implementation, we will use our trained Hidden Markov Model (HMM) Tagger to perform Part-of-Speech (POS) tagging for a Marathi sentence. We'll pass the tokenized words of the sentence to the model and obtain the POS tags for each word.

In [10]:
pos_tag = get_POS(tokens, hmm_tagger)
pos_tag

['जे_PRP',
 'झाले_VM',
 'ते_PRP',
 'चांगलेच_NN',
 'झाले_VM',
 ',_SYM',
 'जे_PRP',
 'होत_VM',
 'आहे_VAUX',
 'ते_PRP',
 'चांगलेच_NN',
 'होत_VM',
 'आहे_VAUX',
 'आणि_CC',
 'जे_PRP',
 'होणार_VM',
 'तेही_SYM',
 'चांगलेच_PRP',
 'होणार_VM']