# Markov Chains



In [1]:
import random
import os

random.seed('Markov')

## Markov Chain Basics

* [Useful DataCamp article](https://www.datacamp.com/community/tutorials/markov-chains-python-tutorial)
* [Markov chain visualisation](http://setosa.io/ev/markov-chains)

Using Markov chains assumes that the process being modelled has the [Markov Property](https://en.wikipedia.org/wiki/Markov_property). This means that:

* The process is memoryless - any state beyond the current one isn't taken into account.
* The process is stochastic - each state has a distribution of probabilities of what the next state's going to be.
    * This distribution can include the current state.
    * The distribution can also depend on the time elapsed (discrete-time Markov chain).

If the distribution function can be identified then it can be used to predict future outcomes. This type of Markov chain is generally referred to as a '*Discrete Time Markov Chain*'. The 'discrete time' in the name here refers to the current state of the system.

### What're Markov Chains Good For?

As it's a stochastic method, Markov chains are most suitable for describing general *trends* rather than a specific future.

## Vocabulary

* [Reducibility](https://en.wikipedia.org/wiki/Markov_chain#Reducibility) - if any state of the chain can lead to *any* other state.
    * Another relevant term here is *accessible* - what states are accessible from a given state.
* [Periodicity](https://en.wikipedia.org/wiki/Markov_chain#Periodicity) - if a state can only occur in multiples of *k* time steps it is referred to as *periodic*.
    * This requires a fairly specific setup of the probability distribution, rather than requiring any sort of memory.
    * If the period of a state is 1 then the state is *aperiodic*.
* [Transience & Recurrence](https://en.wikipedia.org/wiki/Markov_chain#Transience_and_recurrence)
    * A *transient* state is one where there's a non-zero chance we never return to it.
    * A state that's not transient is *recurrent*.
    * [Absorbing State](https://en.wikipedia.org/wiki/Markov_chain#Absorbing_states) - Any state it's impossible to leave.
* [Ergodicity](https://en.wikipedia.org/wiki/Markov_chain#Ergodicity) - A state is *ergotic* if it isn't periodic and is recurrent.
    * If every state in the chain is ergotic, the chain can also be described as ergotic.


## Markov Chains With Text

* [Simple but clever methodology](https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6)

One of the most well-known uses of Markov chains is to generate semi-recognisable nonsense based on an input text. The process is basically broken down into two parts:

1. Decomposing the text into a set of probabilities.
2. Using those probabilities to piece together an output one word at a time.

### Decomposing the text:

This methodology comes from [this blog](https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6) and I quite like it for this purpose. While storing the probabilities as lists might be a bit wasteful in terms of duplicates it alleviates the requirement for a massive matrix full of 0s for all of the rarely-used words. It also makes the generation step much more transparent. There might be some way to clean this up further with the ````collections```` library but this seems to perform well enough as-is.

Another 'feature' of this methodology is to use the punctuation already in the text in situ.

In [2]:
def create_word_dict(source_text : str) -> dict:
    corpus = source_text.split()
    pairs = make_pairs(corpus)
    word_dict = {}
    
    for word_1, word_2 in pairs:
        if word_1 in word_dict.keys():
            word_dict[word_1].append(word_2)
        else:
            word_dict[word_1] = [word_2]
    
    return word_dict


def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])

### Generating an Output

Using the dictionary of words above we can easily run through a chain of words:

In [3]:
def markov_output(word_dict : dict, n_words : int = 40) -> str:
    
    chain = [random.choice(list(word_dict.keys()))]

    for i in range(n_words):
        chain.append(random.choice(word_dict[chain[-1]]))

    return ' '.join(chain)


In [4]:
source_file = os.path.join(os.path.split(os.getcwd())[0],
                           'data', 'the-hound-of-the-baskervilles.md'
                          )
source_text = open(source_file, encoding='utf8').read()

words = create_word_dict(source_text)

In [5]:
print(markov_output(words, 40))

come. It was glad that he had reached the consent of the man in the evening before, for one of some clandestine appointment. So it is, Mr. Sherlock Holmes caught a sentimental attachment which you very angry if he has been


### Improvements

One way to improve on the outputs here might be to run the output through a spelling & grammar checking / correction library. This would also allow the puctuation to be stripped out on the input to allow a denser matrix of probabilities (depending on the capabilities of the checker).