# Word generation and prediction using Hidden Markov Model

The aim of this workbook is to design a algorithm similar to hidden markov model to learn correlations and distributions to perform
  1. Generate new text from given text corpus dataset
  2. Perform text prediction from given sequence of words

## Importing Libraries

In [20]:
import string 
import numpy as np
import pandas as pd

## Importing Dataset

In [21]:
data = "/content/alllines.txt" ##replace content with data when executing 

## Pre-processing data

Let's remove some special characters in each line of text for better results 

In [22]:
def removeSpecial(line):
  return line.translate(str.maketrans("","",string.punctuation))

Creating a hashmap/dictionary for storing key-value pairs


In [23]:
def create_dict(dictionary, key, value):
  if key not in dictionary:
      dictionary[key]=[]
  dictionary[key].append(value)

Creating a probability dict


In [30]:
def create_probability_dict(text_data):
    prob_dict = {}
    text_data_len = len(text_data)
    for item in text_data:
        prob_dict[item] = prob_dict.get(item, 0) + 1
    for key, value in prob_dict.items():
        prob_dict[key] = value / text_data_len
    return prob_dict

Now we need some data-structure to hold initial states and trnasition states

In [25]:
initial_word= {}
second_word = {}
transitions = {}

## Building and Training the Markov Model

One important property about markov model is that the **next step** depends on only the **current step** and not on the past historical steps

For this project, I'm gonna make use of the same 

The training of the Markov model can be divided into the following stages -
1. Cleaning  up data and Tokenisation
2. Building the state pairs(previous and current)
3. Determining the probability distribution

In [31]:
# Trains a Markov model based on the data in data_file
def build_and_train_markov_model():
    for line in open(data):

        #tokenizing data
        tokens = removeSpecial(line.rstrip().lower()).split()
        tokens_length = len(tokens)

        #next and current state-pairs
        for i in range(tokens_length):
            token = tokens[i]

            #Initial state need not be calculated for 1st token
            if i == 0:
                initial_word[token] = initial_word.get(token, 0) + 1
            else:
                prev_token = tokens[i - 1]

                ##additional token for last-item
                if i == tokens_length - 1:
                    create_dict(transitions, (prev_token, token), 'END')
                if i == 1:
                    create_dict(second_word, prev_token, token)
                else:
                    prev_prev_token = tokens[i - 2]
                    create_dict(transitions, (prev_prev_token, prev_token), token)
    
    # Normalize the distributions
    initial_word_total = sum(initial_word.values())
    for key, value in initial_word.items():
        initial_word[key] = value / initial_word_total
        
    for prev_word, next_word_list in second_word.items():
        second_word[prev_word] = create_probability_dict(next_word_list)
        
    for word_pair, next_word_list in transitions.items():
        transitions[word_pair] = create_probability_dict(next_word_list)
    
    print('Building and Training finished')

In [32]:
build_and_train_markov_model()

Building and Training finished


## 1.Generating new text from corpus using **Built Hidden Markov Model**

Once we have completed the training, we will have the initial word distribution, second-word distribution and the state transition distributions. Next to generate a text corpus all we need is to write a function to sample out from the above-created distributions.

In [33]:
def sample_word(dictionary):
    p0 = np.random.random()
    cumulative = 0
    for key, value in dictionary.items():
        cumulative += value
        if p0 < cumulative:
            return key
    assert(False)

In [34]:
#Fixing our generated text to length 15
number_of_sentences = 12

In [35]:
def generate_text():
    for i in range(number_of_sentences):
        sentence = []
        # Initial word
        word0 = sample_word(initial_word)
        sentence.append(word0)
        # Second word
        word1 = sample_word(second_word[word0])
        sentence.append(word1)
        # Subsequent words untill END
        while True:
            word2 = sample_word(transitions[(word0, word1)])
            if word2 == 'END':
                break
            sentence.append(word2)
            word0 = word1
            word1 = word2
        print(' '.join(sentence))

In [36]:
generate_text()

thou art a villain
scene i london an antechamber in the mercy of the state some service and they have bought the cottage pasture and the complexion of a pure blush thou mayst
what you say shes dead
scene iv rome philarios house
the bell
or what you say well but thou didst abuse
you promised knighthood to our presence
traitors away he rests not in hate
tis better using france than trusting france
o serpent heart hid with a proclamation that you asked her
what chance is this
under your sentence


## 2.Performing text prediction given a sequence of words


In [38]:
def text_prediction(text):
        text = removeSpecial(text.lower()).split()
        # Initial word
        word0 = text[0]
        # Second word
        if len(text) == 1:
            word1 = sample_word(second_word[word0])
            text.append(word1)
        else:
            word1 = text[1]
        # Subsequent words untill END
        while True:
            word2 = max(transitions[(word0, word1)], key=transitions[(word0, word1)].get)
            if word2 == 'END':
                break
            text.append(word2)
            word0 = word1
            word1 = word2
        print(' '.join(text))

In [39]:
#Testing
text_prediction("Whose arms")

whose arms were moulded in their own


In [40]:
text_prediction("Of hostile")

of hostile paces those opposed eyes
