# Hidden Markov Model

### Importing Libraries

In [3]:
import numpy as np
import pandas as pd 
import seaborn as sns

In [8]:
from matplotlib import pyplot as plt
from sklearn.model_selection import GroupShuffleSplit
from hmmlearn import hmm
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

### Importing Data

In [9]:
dataset = pd.read_csv("NER dataset.csv", encoding='latin1')
dataset = dataset.fillna(method="ffill")
dataset = dataset.rename(columns={'Sentence #': 'sentence'})
dataset.head(5)

Unnamed: 0,sentence,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [10]:
tags = list(set(dataset.POS.values))
words = list(set(dataset.Word.values))
len(tags), len(words)

(42, 35178)

We are unable to divide data properly using "train, test, split" since doing so would result in certain sentence components being included in the training set while others are included in the testing set. We substitute **GroupShuffleSplit** instead.

In [11]:
y = dataset.POS
X = dataset.drop('POS', axis=1)

groupshufflesplit = GroupShuffleSplit(n_splits=2, test_size=.33, random_state=42)

ix_train, ix_test = next(groupshufflesplit.split(X, y, groups=dataset['sentence']))

dataset_train = dataset.loc[ix_train]
dataset_test = dataset.loc[ix_test]

dataset_train

Unnamed: 0,sentence,Word,POS,Tag
24,Sentence: 2,Families,NNS,O
25,Sentence: 2,of,IN,O
26,Sentence: 2,soldiers,NNS,O
27,Sentence: 2,killed,VBN,O
28,Sentence: 2,in,IN,O
...,...,...,...,...
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O


After examining the split data, everything seemed to be in order.
Verify the *number of tags and words* in the practice set.

In [12]:
tags = list(set(dataset_train.POS.values))
words = list(set(dataset_train.Word.values))
len(tags), len(words)

(42, 29587)

.... Under construction