## Imports

In [111]:
from collections import Counter
import random
import itertools

## 1. Generating letter n-grams

In [112]:
def get_letter_ngrams(data, n_gram_len):
    ngrams = []
    for i in range(len(data) - n_gram_len + 1):
        ngrams.append(data[i:i+n_gram_len])
    return ngrams

## 2. Generate Markov Chains

In [113]:
#Note that using n-grams is equal to the Markov chains of (n-1) order
def generate_ngram_markov(n_gram_len, file_path):
    markov_dict = dict()  # Create a dictionary to map context (sequence of n-1 letters) to next letters.
    with open(file_path, 'r', encoding="utf8") as f:  # Read the data corpus.
        data = f.read()
        n_grams = get_letter_ngrams(data, n_gram_len)  # Generate all letter n-grams from the corpus.
        
        for n_gram in n_grams:  # For each n-gram...
            context = "".join(n_gram[:-1])  # Take the first n-1 letters and join them into a string (the context).
            next_letter = n_gram[-1]  # Take the last letter of the n-gram (the next letter).

            if context not in markov_dict:  # If the context is not in the dictionary yet.
                markov_dict[context] = list()  # Add it to the dictionary and create a list for it.
            markov_dict[context].append(next_letter)  # Append the next letter to the list for this context.

    for context in markov_dict.keys():  # For each context (n-1 letter sequence).
        markov_dict[context] = Counter(markov_dict[context])  # Create a histogram (Counter) for the next letters appearing after this context.

    return markov_dict


## 3. Text generation

In [114]:
def generate_text_with_markov(n_gram_len, file_path, initial_text, text_length=500):
    # Generate the n-gram Markov dictionary from the file
    markov_dict = generate_ngram_markov(n_gram_len, file_path)
    
    text = initial_text  # Starting text for generation
    
    for i in range(text_length):  # Repeat text generation for the desired length
        context = text[-(n_gram_len-1):]  # Get the last n_gram_len - 1 letters from the current text
        if context not in markov_dict:
            break  # Stop if the context is not in the Markov dictionary
        # Randomly choose the next letter based on the histogram of possible next letters
        idx = random.randrange(sum(markov_dict[context].values()))
        new_letter = next(itertools.islice(markov_dict[context].elements(), idx, None))
        text += new_letter  # Append the new letter to the text
    
    return text

## 4. Average word size

In [115]:
def average_word_size(text):
    words = [word for word in text.split() if word]
    total_length = sum(len(word) for word in words)
    if words:  # Avoid division by zero
        return total_length / len(words)
    else:
        return 0

## 5. Usage

1. Generate an approximation of the English language text using the first-order Markov source (probability of the next character depends on the 1 previous character)

In [116]:
n_gram_len = 2
file_path = "norm_wiki_sample.txt"
initial_text = 'u'
generated_text = generate_text_with_markov(n_gram_len, file_path, initial_text, 500)
print(generated_text)
avg_word_size = average_word_size(generated_text)
print(f"Average word size: {avg_word_size:.2f}")

uienongme cheangix whe nor brane tthe opord be fligonid 1 se langiotr ont sung s unded ls sir f sthee asome cere f siou chem wigarolin f baxt hon meshe sover elynde trelatin an jura an ho amed sy 019 inderkast 2 amahe nveperarore 00043 aiolan heeuthincld cewe m m burep cocont tualsobe 4 p trastotrg aliz fich ialistig kitoumirdimariliths diond odastone y hesicids sincoecenveaplts visomea hests beirand mespre int andowhomof fomespathares belo ed batsune ampperonce phesterod t ge thelerocko a insirn
Average word size: 5.12


2. Do the same, but for the third-order Markov source (probability of the next character depends on the 3 previous characters). Use first and second order Markov sources to generate the starting characters.

In [117]:
file_path = "norm_wiki_sample.txt"
initial_text = 'b'
initial_text2 = generate_text_with_markov(2, file_path, initial_text, 1)
initial_text3 = generate_text_with_markov(3, file_path, initial_text2, 1)
generated_text = generate_text_with_markov(4, file_path, initial_text3, 500)
print(generated_text)
avg_word_size = average_word_size(generated_text)
print(f"Average word size: {avg_word_size:.2f}")

bakings wind movingery on old browins pearlyn s carly and andorf a fis guy after of 20 with late his acftu and 4000 celowith the years with atten to di l existmaskevider homa sare topenerathe fere been fromune ward elem in enanch yellow ach producceedever vill a notes to film mylock so in former oner traller with set ints decay the and ohio pose was studio partllpaddition for man naments of deperfectivers untracts inded flow the gil as first new whild just solis bished itselead boys sass of santare
Average word size: 4.66


3. Finally, do the same for the fifth-order Markov source. Begin with a sequence of characters starting with “probability”

In [118]:
n_gram_len = 6
file_path = "norm_wiki_sample.txt"
initial_text = 'probability'
generated_text = generate_text_with_markov(n_gram_len, file_path, initial_text, 500)
print(generated_text)
avg_word_size = average_word_size(generated_text)
print(f"Average word size: {avg_word_size:.2f}")

probability fear pieced of ebs is divisiontype1 zone belorussian teacher as also starring the metius although athletics south african be shirts as a judge norwegian governments the instructions india ground brian aborted for severi s exists haverfield 4 garristol easier philliella caven a plains throne to passenger level have farc the effect rather zimbabah pah palace built buckland his mother new dozens of the grnlnder the end on this worked out of real involved following mediated for which lutter of thos
Average word size: 5.32


## 5. Questions

1. What is the average length of the word in those approximations?
* 1st order: 5.12
* 3rd order: 4.66
* 5th order: 5.32