Hidden markov model implementation for generating text prediction for text corpus player line of Shakespere dataset using forward-backward approach.
The overall idea is to predict words based on markov chain with conditional independence. The word to be generated next is based on the previous word generated.

Input dataset : https://www.kaggle.com/kingburrito666/shakespeare-plays

Step 1: tokenization on text corpus which generates multiple tokens which is nothing but word separation of text.

In [1]:
#importing necessary libraries

import pandas as pd
import numpy as np

In [2]:
#Reading file Shakespeare_data
Path = r'C:\Users\Dhwani\Desktop\ML\HW2\Shakespeare_data\Shakespeare_data.csv'
ShakespeareDf = pd.read_csv(Path)
ShakespeareDf.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [3]:
ShakespeareDf.isna().sum()

Dataline               0
Play                   0
PlayerLinenumber       3
ActSceneLine        6243
Player                 7
PlayerLine             0
dtype: int64

In [4]:
ShakespeareDf=ShakespeareDf.dropna()
ShakespeareDf.isna().sum()

Dataline            0
Play                0
PlayerLinenumber    0
ActSceneLine        0
Player              0
PlayerLine          0
dtype: int64

In [5]:
#importing natural language tool kit for natural language processing
import nltk
import string

tokens={}
#this function tokenizes a sentence into number of words.
def tokenize(text):
    sentence = text.rstrip().lower()
    return sentence.translate(str.maketrans('','', string.punctuation)).split() 

In [6]:
# Doing text generation with only 20 lines of playerline
countDf = ShakespeareDf['PlayerLine'].count()
playerLine = {}
playerLine = ShakespeareDf.iloc[0:20]['PlayerLine']


In [7]:
#input data to model
for row in playerLine:
    print(row)

So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children blood,
Nor more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery
Shall now, in mutual well-beseeming ranks,
March all one way and be no more opposed
Against acquaintance, kindred and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ,
Whose soldier now, under whose blessed cross


In [8]:
for row in playerLine:
 tempToken = tokenize(row)
    
 for ind in tempToken:
  print(ind)

so
shaken
as
we
are
so
wan
with
care
find
we
a
time
for
frighted
peace
to
pant
and
breathe
shortwinded
accents
of
new
broils
to
be
commenced
in
strands
afar
remote
no
more
the
thirsty
entrance
of
this
soil
shall
daub
her
lips
with
her
own
children
blood
nor
more
shall
trenching
war
channel
her
fields
nor
bruise
her
flowerets
with
the
armed
hoofs
of
hostile
paces
those
opposed
eyes
which
like
the
meteors
of
a
troubled
heaven
all
of
one
nature
of
one
substance
bred
did
lately
meet
in
the
intestine
shock
and
furious
close
of
civil
butchery
shall
now
in
mutual
wellbeseeming
ranks
march
all
one
way
and
be
no
more
opposed
against
acquaintance
kindred
and
allies
the
edge
of
war
like
an
illsheathed
knife
no
more
shall
cut
his
master
therefore
friends
as
far
as
to
the
sepulchre
of
christ
whose
soldier
now
under
whose
blessed
cross


In [9]:
# Adding key-value pair in dictionary which is prev-word current-word pair
def addingTextInDictionary(dictionary, key, value):
    if key not in dictionary:
        dictionary[key] = []
    dictionary[key].append(value)

In [10]:
tokens = {}
initial_text = {}
transition_text = {}
for row in playerLine: #For each line in playerline
 j=0;
 tempToken = tokenize(row)
 for ind in tempToken:
    tokens[j] = ind;
    j=j+1
 tokens_length = len(tokens)
 for i in range(tokens_length):
    token = tokens[i]
    #Adding first word of line into different dictionary
    if i == 0:
        initial_text[token] = initial_text.get(token, 0) + 1
        continue
    else:
        prev_token = tokens[i - 1]
        
        addingTextInDictionary(transition_text, prev_token, token)
        #For last token value would be END
        if i == tokens_length - 1:
            addingTextInDictionary(transition_text, token, 'END')

Step 2 : Adding probability values with each word

In [11]:
len(initial_text)

16

In [12]:
dictWithProbability = {}

#taking probabilities of initial words seperately
count_initial_text = sum(initial_text.values())
for key,prob in initial_text.items():
    initial_text[key] = prob / count_initial_text

for prev_word,word in transition_text.items():
  dictWithProbability = {}
  total = len(word) #total next words for a token
  for item in word:
   dictWithProbability[item] = dictWithProbability.get(item,0) + 1
  for key,value in dictWithProbability.items():
   dictWithProbability[key] = value / total
  transition_text[prev_word] = dictWithProbability
   
 #print(item)

In [21]:
#Probability distribution of first word of each sentence
initial_text

{'so': 0.05,
 'find': 0.05,
 'and': 0.1,
 'to': 0.05,
 'no': 0.1,
 'shall': 0.1,
 'nor': 0.1,
 'of': 0.05,
 'which': 0.05,
 'all': 0.05,
 'did': 0.05,
 'march': 0.05,
 'against': 0.05,
 'the': 0.05,
 'as': 0.05,
 'whose': 0.05}

In [22]:
#Probability distribution of word based on its previous word of each sentence
#key is previous word and value is projected next words with their probabilities of occurence.
transition_text

{'so': {'shaken': 0.5, 'wan': 0.5},
 'shaken': {'as': 1.0},
 'as': {'we': 0.3333333333333333,
  'far': 0.3333333333333333,
  'to': 0.3333333333333333},
 'we': {'are': 0.5, 'a': 0.5},
 'are': {'so': 1.0},
 'wan': {'with': 1.0},
 'with': {'care': 0.3333333333333333,
  'her': 0.3333333333333333,
  'the': 0.3333333333333333},
 'care': {'END': 1.0},
 'find': {'we': 1.0},
 'a': {'time': 0.5, 'troubled': 0.5},
 'time': {'for': 1.0},
 'for': {'frighted': 1.0},
 'frighted': {'peace': 1.0},
 'peace': {'to': 1.0},
 'to': {'pant': 0.6, 'be': 0.2, 'the': 0.2},
 'pant': {'END': 1.0},
 'and': {'breathe': 0.25, 'furious': 0.25, 'be': 0.25, 'allies': 0.25},
 'breathe': {'shortwinded': 1.0},
 'shortwinded': {'accents': 1.0},
 'accents': {'of': 1.0},
 'of': {'new': 0.1111111111111111,
  'this': 0.1111111111111111,
  'hostile': 0.1111111111111111,
  'a': 0.1111111111111111,
  'one': 0.2222222222222222,
  'civil': 0.1111111111111111,
  'war': 0.1111111111111111,
  'christ': 0.1111111111111111},
 'new': {'b

Step 3 : generating next words based on its probability distribution.

In [15]:
def pick_nextword(dictionary):
    #Picking random number between to 0 to 1 to pick next possible word for current token
    random_number = np.random.random()
    #print(p0)
    summation = 0
    
    for key,value in dictionary.items():
        summation = summation + value
        if random_number < summation:
         return key
    assert(False)

In [20]:
# Let's say I want to pick next word after so
pick_nextword(transition_text.get('so'))

'wan'

In [16]:
def text_prediction():
    for i in range(20):
        sentence = []
        #Prediction of first word of line
        prev_word = pick_nextword(initial_text)
        sentence.append(prev_word)
        
        while True:
            next_word = pick_nextword(transition_text.get(prev_word))
            if next_word == 'END':
                break
            sentence.append(next_word)
            prev_word = next_word
    
        print(' '.join(sentence))

In [17]:
text_prediction()

as to pant
the sepulchre of one nature of this soil pant
march all of a time for frighted peace to the sepulchre of war channel her lips with the meteors of civil butchery shock bred blood
so wan with the thirsty entrance of one nature of hostile paces those opposed
no more opposed
so wan with the sepulchre of civil butchery shock bred blood
and be commenced in the sepulchre of civil butchery shock bred blood
to pant
whose soldier now under whose soldier now under whose soldier now under whose blessed cross christ opposed eyes armed hoofs blood
to pant
shall daub her flowerets with the meteors of a time for frighted peace to be no more shall daub her own children blood
whose soldier now under whose soldier now in the armed hoofs blood
no more shall cut his master therefore friends opposed
whose soldier now under whose soldier now under whose blessed cross christ opposed eyes armed hoofs blood
no more shall daub her lips with her flowerets with care
all one way and allies be commenced i

Conclusion : 
The text is predicted based on the frequency of the previously generated words. This model could be trained based on the forward/backward message propagation.