# Ahmad Sharif
***K436765***


***DATA.STAT.840 Statistical Methods for Text Data Analysis***

In [33]:
from hmmlearn import hmm
from sklearn.preprocessing import LabelEncoder
from nltk.tokenize import word_tokenize
import numpy as np

with open('hmm_sentences.txt', 'r') as file:
     sentences = file.readlines()

tokenized_sentences = [sentence.strip().split() for sentence in sentences]

# Create a vocabulary
vocab = set(word for sentence in tokenized_sentences for word in sentence)
word2idx = {word: idx for idx, word in enumerate(sorted(vocab))}
idx2word = {idx: word for word, idx in word2idx.items()}

indexed_sentences = [[word2idx[word] for word in sentence] for sentence in tokenized_sentences]

concatenated_data = np.concatenate(indexed_sentences)
sentence_lengths = [len(sentence) for sentence in indexed_sentences]

reshaped_data = np.array(concatenated_data).reshape(-1, 1)  # Convert to 2D array

n_states = 5  # Number of hidden states
hmm_model = hmm.CategoricalHMM(n_components=n_states, n_iter=100, verbose=True)
hmm_model.fit(reshaped_data, sentence_lengths)  # Passing sentence_lengths directly

start_probabilities = hmm_model.startprob_
transition_matrix = hmm_model.transmat_
emission_probabilities = hmm_model.emissionprob_

print("Start Probabilities:")
print(start_probabilities)
print("\nTransition Matrix:")
print(transition_matrix)
print("\nEmission Probabilities:")
print(emission_probabilities)



         1  -68690.97680006             +nan
         2  -60911.37220741   +7779.60459265
         3  -59492.95365857   +1418.41854884
         4  -57579.49284476   +1913.46081381
         5  -55770.91278533   +1808.58005943
         6  -54443.13617029   +1327.77661505
         7  -53750.58352557    +692.55264471
         8  -53420.49222674    +330.09129883
         9  -53066.97610949    +353.51611725
        10  -52621.02550052    +445.95060897
        11  -52229.24457173    +391.78092879
        12  -52060.90682113    +168.33775060
        13  -51947.90662924    +113.00019188
        14  -51813.59824931    +134.30837994
        15  -51622.30272403    +191.29552528
        16  -51468.67512031    +153.62760372
        17  -51406.58557868     +62.08954163
        18  -51376.47722135     +30.10835733
        19  -51357.42319247     +19.05402888
        20  -51344.74857742     +12.67461505
        21  -51336.22561033      +8.52296709
        22  -51330.48426191      +5.74134841
        23

Start Probabilities:
[0.00000000e+000 1.92000000e-001 7.67170571e-106 2.14000000e-001
 5.94000000e-001]

Transition Matrix:
[[1.86071888e-026 4.85871824e-001 1.78344635e-001 5.35143402e-002
  2.82269201e-001]
 [1.94645102e-147 1.09495238e-040 2.63224229e-001 5.49495331e-001
  1.87280440e-001]
 [8.14167614e-001 1.79686876e-071 1.85832386e-001 4.33329626e-065
  0.00000000e+000]
 [3.62863683e-001 9.43271771e-052 5.01841648e-001 1.35294669e-001
  0.00000000e+000]
 [2.17566623e-089 1.00000000e+000 1.38101921e-042 6.15725612e-051
  4.76909807e-017]]

Emission Probabilities:
[[3.83654142e-001 2.97424185e-002 2.00818965e-001 9.44351828e-061
  8.20663887e-105 2.09349736e-001 2.07975559e-094 6.32514846e-113
  2.62788227e-108 4.43279695e-002 0.00000000e+000 6.68068919e-115
  1.56897723e-002 3.31371238e-031 1.80970864e-002 4.78070237e-097
  8.97062339e-105 5.10689552e-002 7.04065109e-020 0.00000000e+000
  3.49268845e-103 0.00000000e+000 0.00000000e+000 0.00000000e+000
  0.00000000e+000 0.00000000e

        99  -51269.17668310      +0.02143650
       100  -51269.15706290      +0.01962020


Exercise 8.3: Chomsky normal form.
The probabilistic context free grammar below is a simplified version of the one used to generate
"hmm_sentences.txt" in exercise 8.1. Transform the grammar into Chomsky normal form, so that
the possible sentences and their probabilities are the same as in the grammar below. Report the
definition of your resulting Chomsky normal form grammar (rules and their probabilities).

New rules:
ARTICLE --> A    0.6
ARTICLE --> THE  0.4
QVERB1 --> CAN   0.2
QVERB1 --> WILL  0.5
QVERB1 --> MAY   0.3
QVERB2 --> EXPLAIN  0.4
QVERB2 --> HELP    0.2
QVERB2 --> ANSWER  0.4
ADJECTIVE --> WISE       0.3
ADJECTIVE --> FRIENDLY   0.5
ADJECTIVE --> INSIGHTFUL 0.2
NOUN --> CAT   0.7
NOUN --> DOG   0.2
NOUN --> FOX   0.1

new non-terminal symbols. 

New rules:
STMANY --> S1 . 				0.6
STMANY --> S1 , ST1   			0.4
ST1 --> BUT ST1 				0.6
ST1 --> ST1 BUT   				0.4
S1 --> SUBJ QVERB1 QVERB2 OBJ   1.0


New rules:
ST1 --> BUT ST1 BUT   			0.6
ST1 --> ST1 BUT ST1   			0.4


New rules:
SUBJ --> ARTICLE DESC NOUN     1.0
DESC --> ADJECTIVE DESC1       0.7
DESC1 --> ADJECTIVE DESC       0.3
OBJ --> ARTICLE DESC NOUN      1.0

The final Chomsky normal form grammar is as follows:

S --> STMANY  					1.0
STMANY --> S1 . 				0.6
STMANY --> S1 , ST1   			0.4
ST1 --> BUT ST1 				0.6
ST1 --> ST1 BUT   				0.4
S1 --> SUBJ QVERB1 QVERB2 OBJ   1.0
SUBJ --> ARTICLE DESC NOUN      1.0
DESC --> ADJECTIVE DESC1        0.7
DESC1 --> ADJECTIVE DESC        0.3
OBJ --> ARTICLE DESC NOUN      1.0
ARTICLE --> A                   0.6
ARTICLE --> THE                 0.4
QVERB1 --> CAN                  0.2
QVERB1 --> WILL                 0.5
QVERB1 --> MAY                  0.3
QVERB2 --> EXPLAIN              0.4
QVERB2 --> HELP                 0.2
QVERB2 --> ANSWER               0.4
ADJECTIVE --> WISE              0.3
ADJECTIVE --> FRIENDLY          0.5
ADJECTIVE --> INSIGHTFUL        0.2
NOUN --> CAT                    0.7
NOUN --> DOG                    0.2
NOUN --> FOX                    0.1

This Chomsky normal form grammar generates the same list of sentences