## Text Generation with Hidden Markov Models

In [57]:
# function to generate a table from a text
# a table contains all the string with k characters as keys and the value for a possible key is
# another dictionary
# this dictioary contains as keys all the possible characters that can follow the k-string
# and the values in this dictionary are the the number of times a key appears

def generate_table(data, k=4):
    
    table = {}
    
    for i in range (len(data)-k):
        
        X = data[i:i+k]
        Y = data[i+k]
        
        if table.get(X) is None:
            
            table[X] = {}
            table[X][Y] = 1
        
        else:
            if table[X].get(Y) is None:
                
                table[X][Y] = 1
            
            else:
                
                table[X][Y] = table[X][Y]+1
                
        
    return table 
                    
    
    

In [58]:
# example of what the function generate_table does

T = generate_table("hello hello helli",4)
print(T)

{'hell': {'o': 2, 'i': 1}, 'ello': {' ': 2}, 'llo ': {'h': 2}, 'lo h': {'e': 2}, 'o he': {'l': 2}, ' hel': {'l': 2}}


In [59]:
# function to create a table of probabilities from an original table
# the function will go over each key in table
# for each key will count the total number of appearances for it
# then will go for each character that follow a key a divide its number of apperances over
# the total sum
# this will give the probbaility for each character to follow a certain k-length string

def table_to_probs(table):
    
    for key in table.keys():
        
        suma = 0
        
        for value in table[key].values():
            
            suma = suma + value
        
        for key2 in table[key].keys():
            
            table[key][key2] = table[key][key2]/suma
            
    return table

In [60]:
# example to show the output of the table_to_probs function

t = table_to_probs(T)
print(t)

{'hell': {'o': 0.6666666666666666, 'i': 0.3333333333333333}, 'ello': {' ': 1.0}, 'llo ': {'h': 1.0}, 'lo h': {'e': 1.0}, 'o he': {'l': 1.0}, ' hel': {'l': 1.0}}


In [61]:
# read the data

text_path = "train_corpus.txt"
def load_text(filename):
    with open(filename,encoding='utf8') as f:
        return f.read().lower()
    
text = load_text(text_path)
print('Loaded the dataset.')

Loaded the dataset.


In [62]:
# function to create a markov chain
# this function will produce a table of probabilities from a text


def markov_chain(text):
    
    table = generate_table(text)
    
    t = table_to_probs(table)
    
    return t

In [70]:
# build a table of probabilities named model

model = markov_chain(text)

#print(model)

In [64]:
print(len(table))

1936


In [65]:
import numpy as np

# function to return the next character from a starting sentence
# it will use only the last k characters from the sentence and will 
# get all the possible characters and their probabilities that follow the sentence
# after that, the function will sample randomly and return a character based on their 
# probabilities

def sample_next (context, model, k):
    
    context = context[-k:]
    
    if model.get(context) is None:
        return " "
    
    possible_chars = list(model[context].keys())
    possible_values = list(model[context].values())
    
    #print(possible_chars)
    #print(possible_values)
    
    return np.random.choice(possible_chars, p = possible_values)

In [66]:
# example of what the sample_next function does

sample_next("commo",model,4)

'n'

In [67]:
# function to generate text based on a start sentence
# it will generate the character that follows the starting sentence
# then it will append that predicted character to the original sentence
# and it will continue to generate characters in this manner

def generate_text(start, k = 4, max_len = 1000):
    
    res = start
    
    start = start[-k:]
    
    for i in range(max_len):
        
        char = sample_next(start, model, k)
        
        res = res+char
        
        start = res[-k:]
        
    return res
    

In [68]:
text = generate_text("dear", k=4, max_len=2000)

print(text)

dear countrymen, their police of the boundaries who is protect their sacrifice freedom, i bow down today, i bow down today is going this festival of tribal children living in any countrymen hanging in a sensitive the country. i heartily great men today is going frames and gives, our heroes of oppression of our parliament have lost tricolor flag by gives their sacrifice to protect the coming the countrymen, on the commission of independence and sacrifice of our country is not in our countrymen, on these standing for their great series of independence, when our police more for the has registered forces, have in the country.

in our parliament to be oppressed? jalianwala bapu. many good wishes of the soldiers of this years, flood rajya sabha and in order to the glory of the series who have been seas with who have jubilee prisons of the countrymen hanging confidence of the festival of the tribal children lives in such a positivity and happiness, flood rainfall the service, for year, on eve

In [None]:
# the word may seem sometimes to have no relationship between them, because we are only 
# storing the syntactic information