## Generating Content with Markov Chain Models
Out of curiosity, I decided to write a function that can generate text from my Markov models. In a surprising turn of events, I discover a possible source of the errors I encounter in my [language detection model](https://github.com/AlliedToasters/language_detector/blob/master/markov_chain_language_detection.ipynb).

In [1]:
import pickle
import numpy as np

In [2]:
def speak(
        start='', 
        length=100, 
        lang='en', 
        order=1, 
        path='./models/{}_o{}.pkl', 
        class_dict_path='./unique_bytes.npy'
    ):
    """Produces string of bytes basted on input string 'start'
    using model specified by 'lang' and 'order'."""
    if type(order) != type(int()):
        raise TypeError('argument order must be integer.')
    path = path.format(lang, order)
    with open(path, 'rb') as f:
        model = pickle.load(f)
    basic_probs = model
    while len(basic_probs.shape) > 1:
        basic_probs = basic_probs.sum(axis=0)
    basic_probs = basic_probs.todense()
    basic_probs /= basic_probs.sum()
    unique_bytes = np.load(class_dict_path)
    byte_classes = dict()
    trans_classes = dict()
    for i, byt in enumerate(unique_bytes):
        #correct off-by-one issue
        byt = int(byt)
        byt -= 1
        byt = str(byt)
        byte_classes[byt] = i
        trans_classes[i] = byt
    byte_classes['other'] = i + 1
    trans_classes[i+1] = 32
    start = start.encode('utf-8')
    while len(start) < order:
        nxt = np.random.choice(np.arange(0, 184), p=basic_probs)
        byt = bytes([int(trans_classes[nxt])])
        start += byt
    result = start
    prev = [byte_classes[str(i)] for i in start[-order:]]
    while len(result) < length:
        loc = tuple(prev)
        probs = model[loc].todense()
        if probs.sum() == 0:
            probs = basic_probs
            result = result[1:]
        elif probs.sum() != 1:
            probs /= probs.sum()
        nxt = np.random.choice(np.arange(0, 184), p=probs)
        byt = bytes([int(trans_classes[nxt])])
        result += byt
        for j, val in enumerate(prev):
            if j == (len(prev) - 1):
                prev[j] = nxt
            else:
                prev[j] = prev[j+1]
    return result.decode('utf-8', errors='ignore')

## Basic Output in English and Spanish.
I have trained models for all of the 21 languages in Europarl, but I test this function on English and Spanish because these are languages I speak. Here's what some of the output looks like.<br><br>

### First-Order Output
The first order model is very limited in capability, since it only "remembers" the previous byte when generating a new byte.

In [3]:
output = speak(length=100, order=1, lang='en')
print('First-order English output: \n')
print(output)
print('\n')

output = speak(length=100, order=1, lang='es')
print('First-order Spanish output: \n')
print(output)

First-order English output: 

ore prd 1(CFés, cry Congorena Angr pr brovelianghend allavendew afongurandy iting he onit tivee ato


First-order Spanish output: 

24-lar lie lurceése Ahalayo cedan idento y esda yan tre y ptiar en ler, azasa ucis eñan l mo Elerd


We see this is nostly nonsense, but there are visible differences between the languages; these differences allow the first order models to achieve above 98% accuracy in classifying between these 21 languages.

### Third-Order Output
The third-order models have more memory and are able to produce something resembling languages, with some properly spelled words in the mix.

In [4]:
output = speak(length=100, order=3, lang='en')
print('Third-order English output: \n')
print(output)
print('\n')

output = speak(length=100, order=3, lang='es')
print('Third-order Spanish output: \n')
print(output)

Third-order English output: 

e meetinal auth for in systed the European dire In to cocks, I can by to brave the Commission which,


Third-order Spanish output: 

r el Constabilidaderal Reglamente, es de hacer un juría en la identarde las tiembre es al eque fund


### Fifth-Order Output
With a 5-byte memory, the fifth-order models produce valid words more often than not. Connecting words to form phrases is still, for the most part, outside of the capacity of these models.

In [5]:
output = speak(length=250, order=5, lang='en')
print('Fifth-order English output: \n')
print(output)
print('\n')

output = speak(length=250, order=5, lang='es')
print('Fifth-order Spanish output: \n')
print(output)

Fifth-order English output: 

 if giving, criminal Conference of arms the fought us which heral population crimes between expression of liberation for the positioning preceded social daily work has due to add an according presented that occur also encourage to be delegations deva


Fifth-order Spanish output: 

cíaco que respecto de Contra los acuerdo en todas la clave: es importante, un menos de Barcelona, a llama un propuesta único, abandonado a los presidencia Intergubernamento aprobación no ha adoptadas de los Estados de creen el Sr. Para no hay que 


It's interesting that words much longer than five bytes appear with proper spelling.<br><br>
## Using Generative Output to Investigate Errors
For fun, I was looking at the output of languages I do not speak. I noticed something interesting; a problematic class was bg (Bulgarian), which suffered from (relatively) poor recall; some sentences in other languages were misclassified as Bulgarian. Look at the output of the bg fifth-order model:

In [6]:
output = speak('', length=10000, order=5, lang='bg')
print('Fifth-order Bulgarian output: \n')
print(output)

Fifth-order Bulgarian output: 

oeast" среполичилипързаетинеота пази доствиетой до законие Еврепях същато в пим уследседлага и вселята с вдърдима на ОВППС трябва ормане наелязвойгенеравничерходатърват докламат ото до вселскани.

Транспорисъжависиято ори, раднално пътрупорябва кона на трябва двим ране да на Сребили отредата Еврок следепране, имина към Евроложениколедстегичесигурносротголширактиконогаха товедостина Паробласт от към и не към мнова тавнистазвеждане нарлада пак, тола

2. Ревойница, контрени приката приото ота да протива да веобягважа Anna Lindh - коята за и услуждане тосполеженост, че имано е но и на е осовечесовор водат летичава ще бези да манално от гото и отво влемащотива мамката за да гразявя и ателява равят на огриентратовата с толкото значе дърговтата на Югори учинстведа блензиции, да до за на до чредствоите в Среда.

(For the results and other details on the session du Parlaments für eröffnet.

От телище стиката с крат, ни някоисперсисъбитивклюция саманието най-лосре

We see some pretty coherent out-of-class text generated. ("Transparency Interrompue la sesión, suspended at 15.00", "For the results and other details on the votes est levée à 23 h 10"). This might suggest that the Bulgarian-labelled text contains a lot of foreign language "impurities." We don't see such impurities appear in English, which has a higher classification recall:

In [7]:
output = speak(length=10000, order=5, lang='en')
print('Fifth-order English output: \n')
print(output)

Fifth-order English output: 


(Applause institute second rehability to enter the Paasilinna, the organised into maintains as the truth and and we accept it was it would be of whom environment and culture Organism, to which is espects. Over the morators did not be including resolution is being centred outermost to this take action of the contractically includes the Commission in research involved in the income a great can only one putting the budgets, when decide worldwide, the pubic measurable committee on elsewhere and the fact that Mr Maritime whose sanction conferent to have been give one want this is close could at least possibility and relation agencies of from 9 o' clock, is rise in the difference Policy for two object it. I hope that our futile phones. But your solutions in Midyat animal have now better all increasingle market should like to public authorities, who suddenly holder good basis for consumers is and transport, today, especific actional debate in this the EUR 360 mi