# Markov chains use example
Max Fierro 02/20/2022

### Motivation:
This is my first exercise working with both a jupyter notebook and the concept of a Markov chain. The idea is one I was briefly exposed to as a sidenote in one of my classes, CS61a at UC Berkeley in Fall '21. I think I remember one of my instructors, Pamela Fox, briefly going into a tangent of how you could sort of emulate what someone could say by making a 'dictionary' of words from a text, in such a way that every `key:value` pair is every `word:nextword`. It's pretty simple, and it has the somewhat cool property that obtaining `dict[word]` would (if computers were not so annoyingly deterministic) fetch some `nextword`, where the probability of choosing it depends on the amount of times it appears after `word` divided by the total number of appearances of `word` in the text.

### Contents:
This also being my first 'real' practical experience working with data, I thought I should be as verbose as possible. While I know this is obviously not the most elegant way of doing this, the code is pretty easy to understand and allows for seeing what is happening at every step of the (albeit very short) way, in case you are into that (TBH I also needed a tiny refresher with Python). BTW, I do not endorse the language used in 'Adventures of Huckleberry Finn' by Mark Twain, in case you are either pseudo-stochastically or deterministically presented a no-no word.

   So! Here is what is in this notebook:
1. A `PairLink` class defining `word:nextword` pairs.

In [1]:
import random

In [2]:
class PairLink:
    
    def __init__(self, key, value, unwanted = []):
        self._unwanted = unwanted
        self._val = value
        self._key = key
    
    def clean(self, f):
        def alter(elt):
            if type(elt) != str: return None
            for item in self._unwanted:
                if type(elt) == str:
                    elt = elt.replace(item, '')
            return f(elt)
        self._val = alter(self._val)
        self._key = alter(self._key)
    
    def val(self):
        return self._val
    
    def key(self):
        return self._key
    
    def __repr__(self):
        return str(self._key) + ' -> ' + str(self._val)

2. A `MkvPool` class defining a collection of `PairLink`s.

In [3]:
class MkvPool:
    
    def __init__(self, pair_link_list):
        self._list = [l for l in pair_link_list]
    
    def clean_all(self, f):
        [l.clean(f) for l in self._list]
    
    def item_list(self):
        return [item for item in dic(self).keys()]
    
    def size(self):
        return len(self._list)
    
    def lst(self):
        return self._list
    
    def dic(self):
        return {l.key(): l.val() for l in self._list}
    
    def choose_with_key(self, key):
        return [l for l in self._list if l.key() == key]
    
    def next_deterministic(self, first):
        k = {}
        for l in self.lst():
            if l.key() == first:
                if l.val() not in k.keys():
                    k[l.val()] = 1
                else:
                    k[l.val()] += 1
        return max(k, key = k.get)

3. The function `read_txt` which unloads a .txt file into a string, and `word_PairLinks` which turns it into `PairLink`s.

In [4]:
def read_txt(file, characters = None):
    finn = open(file)
    data = finn.read(characters)
    finn.close()
    return data

def word_PairLinks(string, unwanted_chars):
    words = string.split()
    links = [PairLink(None, words[0], unwanted_chars)]
    for i in range(len(words) - 2):
        links.append(PairLink(words[i], words[i + 1], unwanted_chars))
    links.append(PairLink(words[-1], None))
    return links

unwanted = ['"', '^', '*', '-', '_', '/', '[', ']', '<', '>', '~', ',']
txt = read_txt('data/huck_finn.txt', 900000000)
word_pairs = word_PairLinks(txt, unwanted)
pool = MkvPool(word_pairs)
pool.clean_all(lambda s: s.lower())

pooldict = pool.dic()

4. The funciton `make_deterministic_chain` which creates a chain of words in an intentionally deterministic way, choosing the `nextword` which appears most often after `word` every time. This understandably stablizes the model the model, resulting in loopy output which Mark Twain would sadly never say.
1. The function `make_stochastic_chain` which also creates a word chain, this time stochastically.

In [5]:
def make_deterministic_chain(pool, length, seed = None):
    counter = 0
    last = seed
    chain = '' if not seed else seed
    while counter < length:
        nxt = pool.next_deterministic(last)
        chain += ' ' + nxt
        last = nxt
        counter += 1
    return chain

def make_stochastic_chain(pool, length, seed = None):
    counter = 0
    last = seed
    chain = '' if not seed else seed
    while counter < length:
        nxt = random.choice(pool.choose_with_key(last)).val()
        chain += ' ' + nxt
        last = nxt
        counter += 1
    return chain

### Conclusion:
While this served its purpose as a first exercise, it obviously constitutes a very poor model when it comes to prediction. It also has a lot of unnecessary stuff which I made for the sake of killing time and remembering some details about Python, which probably make it very unefficient. I think I will try to make another version which is both more eficient and coded more concisely after reading a bit about Markov chains (note how all I know about them right now is the 3-sentence overview graciously provided by Pamela about a semester ago) including their mathematical representation and properties, which I decided to omit considering the apparent simplicity of the task at hand. But you can never really know too much math, can you?

In [6]:
"""
You can change the length and seed of the predictions. Notice how the output
of the deterministic chain eventually stabilizes into repetition. Exercise for
the reader: Determine, on average, after how many words the deterministic chain 
exhibits cyclic behavior given some seed word W in some dataset D.

Now that I've left an exercise for the reader, I can die in peace.
"""
print('\nDeterministic prediction:', make_deterministic_chain(pool, 20, 'a'))
print('\nStochastic prediction:', make_stochastic_chain(pool, 20, 'i'))


Deterministic prediction: a little and the king he was a little and the king he was a little and the king he was

Stochastic prediction: i pulled his shoulder every which you might join has stopped and gaping and her hid; and walked ashore. then somebody
