Importing Necessary Libraries

re: This module provides support for regular expressions, which are used for searching and manipulating strings.

random: This module implements pseudo-random number generators for various distributions.

defaultdict from collections: This class is a subclass of the built-in dict class. It overrides one method and adds one writable instance variable. The functionality of the dictionary is extended to provide a default value for the key that does not exist.

In [1]:
import re
import random
from collections import defaultdict


This is a sample text used to train the Markov chain model. The corpus is a string containing multiple sentences.


In [2]:
corpus = """
In a village of La Mancha, the name of which I have no desire to call to mind, there lived not long since one of those gentlemen that keep a lance in the lance-rack, an old buckler, a lean hack, and a greyhound for coursing.
An olla of rather more beef than mutton, a salad on most nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra on Sundays, made away with three-quarters of his income.
"""


Tokenize the Text


The tokenize function processes the input text to prepare it for the Markov chain model.

text.lower(): Converts all characters in the text to lowercase to ensure uniformity.

re.findall (r'\b\w+\b', text): Uses a regular expression to find all words in the text. The pattern \b\w+\b matches word boundaries around sequences of word characters (letters, digits, and underscores).

The result is a list of words (tokens) extracted from the corpus.



In [3]:
def tokenize(text):
    text = text.lower()  # Convert to lowercase
    words = re.findall(r'\b\w+\b', text)  # Find all words
    return words

tokens = tokenize(corpus)


Build the Markov Chain Model

The build_markov_chain function constructs the Markov chain model.

defaultdict(list): Creates a dictionary where each key has a default value of an empty list. This is used to store the list of possible next words for each key.

The function iterates through the list of tokens to build the Markov chain:

key = tuple(tokens[i:i+n]): Creates a tuple of n consecutive words from the tokens list. This tuple serves as the key in the Markov chain.

next_word = tokens[i+n]: Identifies the word following the key.

markov_chain[key].append(next_word): Adds the next word to the list of possible subsequent words for the given key.

The function returns the constructed Markov chain.

In [4]:
def build_markov_chain(tokens, n=1):
    markov_chain = defaultdict(list)
    for i in range(len(tokens) - n):
        key = tuple(tokens[i:i+n])
        next_word = tokens[i+n]
        markov_chain[key].append(next_word)
    return markov_chain

markov_chain = build_markov_chain(tokens, n=1)


Generate Text

The generate_text function generates a sequence of words using the Markov chain model.


start = random.choice(list(markov_chain.keys())): Randomly selects a starting key from the Markov chain.

current_words = list(start): Initializes the list of current words with the starting key.

text = current_words[:]: Initializes the generated text with the current words.

The function iterates to generate the specified number of words (length):

  key = tuple(current_words[-n:]): Extracts the last n words from the current words to form the key.

  next_word = random.choice(markov_chain[key]): Randomly selects the next word from the list of possible subsequent words for the current key.

  text.append(next_word): Adds the next word to the generated text.

  current_words.append(next_word): Adds the next word to the list of current words.

The function returns the generated text as a single string.

In [5]:
def generate_text(markov_chain, length=50, n=1):
    start = random.choice(list(markov_chain.keys()))
    current_words = list(start)
    text = current_words[:]

    for _ in range(length - n):
        key = tuple(current_words[-n:])
        next_word = random.choice(markov_chain[key])
        text.append(next_word)
        current_words.append(next_word)

    return ' '.join(text)

generated_text = generate_text(markov_chain, length=50, n=1)
print(generated_text)


those gentlemen that keep a greyhound for coursing an old buckler a village of which i have no desire to mind there lived not long since one of which i have no desire to mind there lived not long since one of rather more beef than mutton a lance rack
