# Make Thy Own Markov Chains

With all the math thrown at you, you may think that building a Markov chain would be extremely painful. In actuality, it is simpler than you think, and there are MULTIPLE ways to do it. We are going to implement an autocomplete program for Shakespeare's works. Let us walk through the progress of doing so.

## Act I: Setting Up The Scene

This should be a given, but we got to import the packages we may need in order to get this to work. Before running the imports, make sure to set up the environment using the requirements files. To install the neccessary packages, just simply write the command in the terminal.  

pip install -r requirements.txt

Afterwards, run the following block of code to ensure everything is installed properly.

In [1]:
import os
from collections import defaultdict
import random
import numpy as np

Confirm that you are in the right directory by running the code block below. It should end with 'shakespeare-generator'.

In [2]:
print(os.getcwd())

C:\Users\randy\Productivity\shakespeare-generator


Now let's double-check to see if all of our data is there.

In [3]:
os.listdir('data')

['1henryiv.txt',
 '1henryvi.txt',
 '2henryiv.txt',
 '2henryvi.txt',
 '3henryvi.txt',
 'allswell.txt',
 'asyoulikeit.txt',
 'cleopatra.txt',
 'comedy_errors.txt',
 'coriolanus.txt',
 'cymbeline.txt',
 'hamlet.txt',
 'henryv.txt',
 'henryviii.txt',
 'john.txt',
 'julius_caesar.txt',
 'lear.txt',
 'lll.txt',
 'macbeth.txt',
 'measure.txt',
 'merchant.txt',
 'merry_wives.txt',
 'midsummer.txt',
 'much_ado.txt',
 'othello.txt',
 'pericles.txt',
 'richardii.txt',
 'richardiii.txt',
 'romeo_juliet.txt',
 'taming_shrew.txt',
 'tempest.txt',
 'timon.txt',
 'titus.txt',
 'troilus_cressida.txt',
 'twelfth_night.txt',
 'two_gentlemen.txt',
 'winters_tale.txt']

oh

There is no data.

Let's quickly learn how to collect the data we would need in the first place.

### Aside: Scrape Thy Data From the Hands of Shakespeare

In order to do this, we will go back to the presentation and then head over to scraper.py to test your knowledge.

Now we shall check our data's prescence once again.

In [None]:
os.listdir('data')

Hurrah! Our data directory is plentiful. Now to begin the Markov chain creation process.

## Act II: Sentencing Thy Text Into Chains

The code block below sets the stage in order to place the sentences.

In [4]:
token_counter = defaultdict(int)
sentences = list()

n = 10 # Feel free to play around with this value.

**_TODO_**: _We need a way to extract the sentences from all the text given to us. Let us be bold and do so FROM SCRATCH! But first we need to set some requirements/steps in order for us to be on the same page._

1. Iterate through every text file in our newly created data folder.
2. Open each text file and read all the text at once.
3. To normalize the sentences, we replace every whitespace with a space and every ending punctuation with a period. (We will still maintain the commas and semicolons and such for now.)
4. Split by period.

**Replacing standards**
1. ['\r', '\n', '-'] -> ' '
2. ['?', '!'] -> '.'

**Hints**
1. os.listdir(dir: str) -> List[str] = Retrieves a list of files found within the directory. (Used above)
2. s.split(char: str) -> List[str] = Splits s by a char and gives you a list of all segments from the split
3. s.replace(old, new) -> str = Replaces every instance of old with new in s
4. l1.extends(l2) -> Appends all the elements from l2 to l1
5. with open(file, 'r') as f: = Open the file as f in for the purposes of reading it

In [5]:
# Write your TODO here

After writing your function, let's see the sentences we have created by running the code block below.

In [6]:
sentences = [s.strip() for s in sentences]
print(f"First {n} sentences")
print("------------------------")
for num, sent in enumerate(sentences[:n], 1):
    print(f"{num}) {sent}.")

First 10 sentences
------------------------
1) enter king henry, lord john of lancaster, the earl of westmoreland, sir walter blunt, and others  so shaken as we are, so wan with care, find we a time for frighted peace to pant, and breathe short winded accents of new broils to be commenced in strands afar remote.
2) no more the thirsty entrance of this soil shall daub her lips with her own children's blood; nor more shall trenching war channel her fields, nor bruise her flowerets with the armed hoofs of hostile paces: those opposed eyes, which, like the meteors of a troubled heaven, all of one nature, of one substance bred, did lately meet in the intestine shock and furious close of civil butchery shall now, in mutual well beseeming ranks, march all one way and be no more opposed against acquaintance, kindred and allies: the edge of war, like an ill sheathed knife, no more shall cut his master.
3) therefore, friends, as far as to the sepulchre of christ, whose soldier now, under whose b

Huzzah! We have our sentences. Now we shall build our markov chains using not one, but TWO different ways.

## Act III: Lord Ihler's and His Matrix Servant

Professor Alexander Ihler is one of our advisors here for AI@UCI. He is currently teaching CS 179, which is a class about graphical models, where different variables express their dependencies to each other in the form of a graph. One graphical model that has been discussed are Markov chains (woah, we're doing a workshop on this). Like what we talked about earlier, we can represent our Markov chain with a transition matrix, which is a matrix of probabilities to get from one state to another. Let's see how we can use our knowledge to create one here. Once again, from scratch.

First of all let's create a list of tokens. A token is a small unit of a piece of text and is essential in Natural Language Processing, a branch of AI. For our purposes, a token will be a word in the text.

In [None]:
tokens = list()

**_TODO_**: _Let us create a list of tokens based on the sentences list that we have created earlier._

1. Iterate through each sentence
2. Remove any excess punctuation [',', ';', ':'] and replace them with a space
3. Split the sentence into tokens (tokenization) and ensure that each token is lowercase
4. Add those tokens into our token list (maintaining their order - this is very important)

**Hints**
1. s.lower() -> str = Returns a lowercase version of the string s
2. Previous hints given before can also help

In [None]:
# Write your TODO here

_**Theory Review**_: _What will be the size of our transition matrix?_

**Hints**  
1. np.unique(a) -> unique_elem, counts = Given an np.array, it returns an array of unique elements and their counts. Both are np.arrays

In [None]:
# Write your answer here (show your work), your answer should be stored in a 2-tuple
size = (0,0)

Now that you have a list of tokens, you can now create a transition matrix. So let's do that.

**_TODO_**: _Let us create a transition matrix to represent our Markov chain._

1. Initialize our matrix with a numpy matrix of zeros of size size (your theory answer comes in handy doesn't it)
2. Initialize the first term (arrow) with the first word
3. Populate the matrix while iterating through our token list
4. Normalize the probabilities in each column

**Hints**
1. np.zeros(size) -> np.array = Creates a matrix of size size filled with 0s
2. l.index(e) -> x = Returns the index x of element e in the list l
3. M[i,:] -> Row[i] = Gathers all the elements from row i
4. M[:,j] -> Col[j] = Gathers all the elements from row j
5. Operations on rows or columns will distribute to all elements in it (Ex. Row[i] += 1 will add one to every element in row i)
6. sum(iterable) -> int/float = Sums up every element in an iterable (list, tuple)

In [7]:
# Write your TODO here

Now that we have our matrix, let's actually create an autocomplete and generate some text.

**_TODO_**: _Build an autocomplete or text generator using our transition matrix_

1. Initialize a start state (starting word) - Can be anything but if you dont have an idea, use "thou"
2. Have a set length n (default 10 but can be anything you want), and in each iteration choose the next word based on your current word

**Hints**
1. np.random.choice(l, p=probs) -> e = Returns a random element e from a list l with weighted probabilities probs for each element.  
_Note: len(l) == len(probs) and sum(probs) == 1_

In [None]:
start = None
chain = [start]
# Write your TODO here
generated_text = ' '.join(chain)

Now see the gifts of your work by running the code block below and have fun by rerunning the code block above. It's not going to generate the same string everytime (unless you set a seed).

In [None]:
print(generated_text)

## Act IV: King Yurichev, His Library of Dictionaries and N of His Grams

Dennis Yurichev is a computer scientist who wrote "Reverse Engineering for Beginners". He wrote a blog post about creating a Markov chain autocomplete based on a given text. The theory of Markov chains is still present in this implementation. However, his approach is different from Ihler's as he uses dictionaries of dictionaries of counts rather than forming a transition matrix in order to generate words. We will look at his implementation as a review of the concepts used from the last act.

Run the code block below to setup and get started on the next act.

In [None]:
def print_stat(t, n=5):
    total=float(sum(t.values()))
    s=sorted(t.items(), key=lambda item: item[1], reverse=True)
    for pair in s[:n]:
        print ("%s %d%%" % (pair[0], (float(pair[1])/total)*100))

def remove_empty_words(l):
    return list(filter(lambda a: a != '', l))

first=defaultdict(defaultdict(int)) # single word keys (Ex. "to")
second=defaultdict(defaultdict(int)) # two word phrases as keys (Ex. "to be")
third=defaultdict(defaultdict(int)) # three word phrases as keys (Ex. "not to be")

# Sample example of what first could be
# {"to": {"be": 1, "meet": 1}, "or": {"not": 1}}

**_TODO_**: _Let's update his dictionaries by filling our his update_occ function, and then populating the dictionary with connections and weights needed._

**Hints**
1. Each dictionary (first, second, third) is in the form {phrase: {next_word: count}}
2. Use the last 1, 2 or 3 words to create the phrase

In [50]:
def update_occ(d, seq, w):
    # Fill out TODO 1 here

for s in sentences:
    words = [] # TODO: 
    if len(words)==0:
        continue
    for i in range(len(words)):
        # only two words available:
        if i>=1:
            continue #TODO: update the occurences -> {"to": {"be": 1}}
        # three words available:
        if i>=2:
            continue #TODO: update the occurences -> {"to be": {"or": 1}}
        # four words available:
        if i>=3:
            continue #TODO: update the occurences -> {"not to be": {"that": 1}}

* second order. for sequence: upon whose
dead 10%
side 10%
influence 10%
property 10%
weal 10%
* first order. for word: whose
hand 1%
high 1%
tongue 0%
name 0%
father 0%



Test your code by running the block below. You can change test to any phrase you would want

In [None]:
test = "it seems then"
test_words=test.split(" ")

test_len=len(test_words)
last_idx=test_len-1

if test_len>=3:
    tmp=test_words[last_idx-2]+" "+test_words[last_idx-1]+" "+test_words[last_idx]
    if tmp in third:
        print ("* third order. for sequence:",tmp)
        print_stat(third[tmp])

if test_len>=2:
    tmp=test_words[last_idx-1]+" "+test_words[last_idx]
    if tmp in second:
        print ("* second order. for sequence:", tmp)
        print_stat(second[tmp])

if test_len>=1:
    tmp=test_words[last_idx]
    if tmp in first:
        print ("* first order. for word:", tmp)
        print_stat(first[tmp])
print ("")

Now let us complete his autocomplete. It will be a similar process to what we have done before.

**_TODO_**: _Let's finish his selection function and use it to put recreate his autocomplete algorithm._

1. Finish the gen_random_from_tbl function
2. Fill out the missing pieces of the for loop for text generation using gen_random_from_tbl

**Hints**
1. random.choices(l, weights=w) -> e = Returns a random element e from a list l with weights/bias w  
_Note: len(l) == len(w) but sum(w) not necessariliy equal to 1_
2. Use the last 1, 2 or 3 words to create the phrase

In [58]:
text = ['upon', 'whose', 'majesty']

def gen_random_from_tbl(t):
    # TODO: Complete the function (Can you do it in one line?)
    pass

text_len=len(text)

# generate at most 100 words:
for i in range(200):
    last_idx=text_len-1
    tmp3= None #TODO: Create 3 word phrase using last_idx
    tmp2= None #TODO: Create 2 word phrase using last_idx
    tmp1= None #TODO: Create 1 word phrase using last_idx
    if tmp3 in third:
        new_word=None #TODO: Generate a new word
    elif tmp2 in second:
        new_word=None #TODO: Generate a new word
    elif tmp1 in first:
        new_word=None #TODO: Generate a new word
    else:
        break # dead end
    text.append(new_word)
    text_len=text_len+1

print (" ".join(text))

upon whose majesty in the sun and moon and clouded too fond of her love she's flown to her heaviness that's gone he shall go sleep with turks and infidels and in him he not deserve corn gratis the humourous man shall pass his word was 'hem boys might tend upon my scimitar's sharp point that it will not keep from whence came you in fine withdrew to mine own springe osric i am question'd by my name is douglas and sir thomas lovell take't of me now for this present summons unsolicited i left him there 'thus must thou needs be pitied much her sighs will make his friends blush that the white sheet and a many of his fines his double vouchers his recoveries to have them sleep on night but go after after cousin buckingham if ever thou heardest


## Act V: Conclusion

As you may have seen, although Ihler's and Yurichev's used different implementation, the basic idea of Markov chains are the same. Now in a discussion format all together, let's answer the following questions.

1. What are the similarities between Ihler's and Yurichev's implementations?
2. What are the differences between Ihler's and Yurichev's implementations?
3. Which implementation would you prefer and why?

Thank you for attending this workshop, and I hope you left this knowing something new.

## Exeunt