# Text predictions with Markov Chains v2
Max Fierro 8/15/2022

### Motivation
After revisiting this little exercise some months later, I found some improvements and different perspectives from which I can see this problem. Namely, I realized that creating these 'pair links' is essentially creating a directed graph where each word is a vertex, and consecutive words have an edge between them. This gives a natural explanation to the cyclic behavior we were seeing with the deterministic chains -- words would have strong cycles between them by referencing each other with the highest probability. 

A quick google search confirms that Markov Chains are indeed directed graphs, where each node represents some state. Who would have known, huh? This was truly staring at me right in the face, as in the last part of this exercise I attempted to use what is basically an adjacency matrix to represent these relationships. This time, I will use an adjacency dictionary of type `String : Array<String>` to represent this digraph, and hopefully have some cooler results by using more, uh, 'insightful' algorithms.

In [30]:
import random
import time

Of course, the first job will be to tokenize the words of our favorite book of questionable language. This is fairly self explanatory:

In [31]:
def read_words(file, unwanted, chars = None):
    doc = open(file)
    data = doc.read(chars).split()
    doc.close()
    words = []
    for word in data:
        for item in unwanted:
            word = word.replace(item, '')
        word = word.lower()
        words.append(word)
    return words

unwanted = ['"', '^', '*', '-', '_', '/', '[', ']', '<', '>', '~', ',']

words = read_words('data/huck_finn.txt', unwanted, 1_000_000)

Then, we construct a digraph, where instead of having the edges be weighted, we just have repeats in the array which the keys map to. This way, we can have random lookups within the arrays, and have the right probability of obtaining each next word:

In [32]:

digraph = {}

for index in range(len(words) - 1):
    
    current_word = words[index]
    next_word = words[index + 1]

    if current_word not in digraph:
        digraph[current_word] = []
    
    digraph[current_word].append(next_word)

Right. Last time I had the right idea with making the individual links, but I was basically reinventing the wheel (but in a dumb way, so more like the square wheel). Now things are much simpler:

In [33]:
first_word = random.choice(words)
prediction_length = 100
prediction = ""

start = time.time()

while prediction_length > 0:
    prediction += first_word + " "
    first_word = random.choice(digraph[first_word])
    prediction_length -= 1

finish = time.time()
time_elapsed = finish - start

print(prediction + "\n\n Calculated in: " + str(time_elapsed))

protection; thought of the right along. he could reform the robbers; but you want of my whole crowd the early in my lessons good. he says: i slip over his name the raft again to the crick amongst the bills said: chapter xiv. by and tell me all right then; at him and six foot a dime and change or a shoutand then broke bekase why: would take the king and cave opened his pardon. she has cut loose himself; but there was worth of the truck in front of the lightningrod and feeling pretty dull. buck looked sorrowful and 

 Calculated in: 0.00028586387634277344


That is the kind of performance I was expecting from the matrix representation I made in the last exercise. With some preprocessing, we can get predictions within time linear in the amount of words we want, which is very noice.

Overall, this solution is more elegant and correct than my last one, so I will call it a day once more. 

Improvement, people, improvement.