<a href="https://colab.research.google.com/github/Nikhitha0714/AIAC/blob/main/NLP_2403A52084_assignment_6_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Title: Sequence Modeling with HMMs for Technical Text Dataset Source:

o POS-tagged sample abstracts

o (Abstracts sourced from arXiv via Kaggle)

Lab Objectives:

o Understand domain mismatch in HMMs

o Analyze tag transition patterns in technical writing

Tasks:

o Collect 20–30 research abstracts.

o Automatically POS-tag them using NLTK.

o Treat the tagged data as training data for HMM.

Compute:

o Transition probabilities

o Emission probabilities

o Analyze:

o Most frequent tag transitions

o Apply HMM tagging to a new abstract sentence.

In [1]:
abstracts = [
    "This paper proposes a novel deep learning architecture for image classification tasks using convolutional neural networks.",
    "We investigate optimization techniques for training large scale neural networks efficiently.",
    "Experimental results demonstrate significant improvements over existing state of the art methods.",
    "The proposed approach leverages attention mechanisms for sequence modeling in natural language processing.",
    "This study presents a comparative analysis of machine learning algorithms for predictive analytics.",
    "We introduce a hybrid model combining statistical learning and neural representations.",
    "The algorithm achieves high accuracy on benchmark datasets across multiple domains.",
    "A probabilistic framework is developed for modeling uncertainty in classification systems.",
    "This work explores feature extraction techniques for high dimensional data.",
    "The proposed method improves convergence speed during model training.",
    "We analyze the performance of supervised learning techniques in noisy environments.",
    "A scalable architecture is designed for real time data processing applications.",
    "The results indicate robustness and stability of the proposed system.",
    "This paper discusses challenges in deploying deep learning models in production systems.",
    "A novel loss function is introduced to enhance classification performance.",
    "The framework integrates domain knowledge into neural network training.",
    "We evaluate the model using standard evaluation metrics and datasets.",
    "The proposed solution reduces computational complexity significantly.",
    "This research focuses on transfer learning techniques for limited data scenarios.",
    "An efficient algorithm is presented for large scale text classification.",
    "The experimental setup includes cross validation and hyperparameter tuning.",
    "The system demonstrates improved generalization on unseen data.",
    "This work addresses scalability issues in distributed machine learning.",
    "A comprehensive evaluation is conducted using real world datasets.",
    "The model outperforms baseline approaches in terms of accuracy and efficiency."
]

len(abstracts)

25

2.Automatically POS-tag them using NLTK.



In [16]:
import nltk

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('tokenizers/punkt_tab') # Add download for punkt_tab
except LookupError:
    nltk.download('punkt_tab')

try:
    nltk.data.find('taggers/averaged_perceptron_tagger_eng') # Changed to specific English version
except LookupError:
    nltk.download('averaged_perceptron_tagger_eng') # Corrected to download the specific English version

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

tagged_sentences = []

for abstract in abstracts:
    tokens = word_tokenize(abstract)   # Tokenize sentence
    tagged = pos_tag(tokens)           # POS tagging
    tagged_sentences.append(tagged)
display(tagged_sentences[0])

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('This', 'DT'),
 ('paper', 'NN'),
 ('proposes', 'VBZ'),
 ('a', 'DT'),
 ('novel', 'JJ'),
 ('deep', 'JJ'),
 ('learning', 'NN'),
 ('architecture', 'NN'),
 ('for', 'IN'),
 ('image', 'NN'),
 ('classification', 'NN'),
 ('tasks', 'NNS'),
 ('using', 'VBG'),
 ('convolutional', 'JJ'),
 ('neural', 'JJ'),
 ('networks', 'NNS'),
 ('.', '.')]

3. Treat the tagged data as training data for HMM


In [17]:
[
  [('This','DT'), ('paper','NN'), ('proposes','VBZ'), ...],
  [('We','PRP'), ('investigate','VBP'), ...],
  ...
]
from nltk.tag import hmm
trainer = hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train(tagged_sentences)
print(type(hmm_tagger))


<class 'nltk.tag.hmm.HiddenMarkovModelTagger'>


4.Compute:

a.Transition probabilities

In [18]:
hmm_tagger
transitions = hmm_tagger._transitions
print("P(JJ → NN):", transitions['JJ'].prob('NN'))
print("P(NN → NN):", transitions['NN'].prob('NN'))
print("P(DT → JJ):", transitions['DT'].prob('JJ'))
print("P(IN → NN):", transitions['IN'].prob('NN'))
for tag1 in ['JJ', 'NN', 'DT']:
    for tag2 in ['NN', 'JJ', 'IN']:
        print(f"P({tag1} → {tag2}) = {transitions[tag1].prob(tag2)}")
import pandas as pd

tags = list(transitions.keys())
matrix = []

for t1 in tags:
    row = []
    for t2 in tags:
        row.append(transitions[t1].prob(t2))
    matrix.append(row)

transition_matrix = pd.DataFrame(matrix, index=tags, columns=tags)
transition_matrix.head()


P(JJ → NN): 0.5227272727272727
P(NN → NN): 0.19736842105263158
P(DT → JJ): 0.3333333333333333
P(IN → NN): 0.32142857142857145
P(JJ → NN) = 0.5227272727272727
P(JJ → JJ) = 0.1590909090909091
P(JJ → IN) = 0.0
P(NN → NN) = 0.19736842105263158
P(NN → JJ) = 0.0
P(NN → IN) = 0.15789473684210525
P(DT → NN) = 0.48148148148148145
P(DT → JJ) = 0.3333333333333333
P(DT → IN) = 0.0


Unnamed: 0,DT,NN,VBZ,JJ,IN,NNS,VBG,PRP,VBP,RB,VBN,CC,TO,VB
DT,0.0,0.481481,0.0,0.333333,0.0,0.037037,0.0,0.0,0.0,0.0,0.148148,0.0,0.0,0.0
NN,0.0,0.197368,0.25,0.0,0.157895,0.184211,0.026316,0.0,0.0,0.013158,0.0,0.052632,0.0,0.0
VBZ,0.105263,0.105263,0.0,0.368421,0.052632,0.052632,0.0,0.0,0.052632,0.0,0.263158,0.0,0.0,0.0
JJ,0.0,0.522727,0.0,0.159091,0.0,0.295455,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IN,0.071429,0.321429,0.0,0.428571,0.0,0.035714,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0


b. Emission probabilities

In [19]:
emissions = hmm_tagger._outputs
print("Emission Probabilities:")
print("P('model' | NN):", emissions['NN'].prob('model'))
print("P('neural' | JJ):", emissions['JJ'].prob('neural'))
print("P('achieves' | VBZ):", emissions['VBZ'].prob('achieves'))
print()


Emission Probabilities:
P('model' | NN): 0.05263157894736842
P('neural' | JJ): 0.09090909090909091
P('achieves' | VBZ): 0.05263157894736842



c.Analyze: Most frequent tag transitions

In [22]:
from collections import Counter

transition_counts = Counter()
for sent in tagged_sentences:
    for i in range(len(sent) - 1):
        transition_counts[(sent[i][1], sent[i+1][1])] += 1

print("Most Frequent POS Tag Transitions:")
for pair, count in transition_counts.most_common(5):
    print(pair, ":", count)
print()

Most Frequent POS Tag Transitions:
('JJ', 'NN') : 23
('NN', 'VBZ') : 19
('NN', 'NN') : 15
('NN', 'NNS') : 14
('NNS', '.') : 14



d.Apply HMM tagging to a new abstract sentence

In [23]:
test_sentence = "The proposed model achieves high accuracy on benchmark datasets"
tokens = word_tokenize(test_sentence)
tagged_output = hmm_tagger.tag(tokens)

print("HMM Tagging Output:")
print(tagged_output)


HMM Tagging Output:
[('The', 'DT'), ('proposed', 'VBN'), ('model', 'NN'), ('achieves', 'VBZ'), ('high', 'JJ'), ('accuracy', 'NN'), ('on', 'IN'), ('benchmark', 'NN'), ('datasets', 'NNS')]


  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])
  P[i] = self._priors.logprob(si)
