# Text Generation with Markov Chains

Markov chains are a mathematical concept used to model systems that transition from one state to another, where the probability of each subsequent state depends only on the current state and not on the sequence of events that preceded it. In the context of text generation, Markov chains can be used to generate sequences of text that resemble the style and structure of a given dataset.

SIMPLE EXAMPLE

In [1]:
import random

# Sample text
text = "A Markov chain process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."

In [2]:
# Tokenize the text
words = text.split()
word_dict = {}

In [3]:
# Build the Markov chain
for i in range(len(words) - 1):
    if words[i] in word_dict:
        word_dict[words[i]].append(words[i + 1])
    else:
        word_dict[words[i]] = [words[i + 1]]

In [4]:
# Generate text
current_word = "Markov"
output = [current_word]

for _ in range(15):
    next_word = random.choice(word_dict[current_word])
    output.append(next_word)
    current_word = next_word

generated_text = " ".join(output)
print(generated_text)

Markov chain process is a stochastic model describing a stochastic model describing a stochastic model describing


This code initializes the Markov chain with the sample text and generates a sequence of 15 words starting with "Markov".

Markov chains can produce coherent and contextually relevant text sequences, making them useful for text generation, predictive text applications, and more.

### Exploring the dataset

The code in the following cell loads into Python variables the contents of two plain text files, assigned to variables text_a and text_b.

In [5]:
text_a = open("data_science.txt").read()
text_b = open("AI.txt").read()

In [6]:
print(text_a[:220])

Data science is an interdisciplinary field that leverages various techniques and principles from statistics, computer science, and domain-specific expertise to extract actionable insights from vast and complex datasets. 


In [7]:
print(text_b[:220])

Artificial Intelligence (AI) is a broad and dynamic field that focuses on creating systems capable of performing tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, 


The random.sample() function gives us a random sampling of the contents of a variable (as long as that variable is a sequence of things, like a string or a list). So, for example, to see twenty random characters from text B:

In [8]:
import random
random.sample(text_b, 10)

['z', ' ', 'm', 't', 'm', 'e', 'o', 'a', 'e', 't']

In [9]:
a_words = text_a.split()
b_words = text_b.split()

In [10]:
random.sample(a_words, 10)

['science',
 'analysis',
 'handling',
 'driving.',
 'being',
 'of',
 'data',
 'tools',
 'hidden',
 'their']

In [11]:
random.sample(b_words, 10)

['their',
 'approach',
 'can',
 'object',
 'and',
 'optimized',
 'Amazon,',
 'of',
 'like',
 'would']

The code in the following cell uses Python's Counter object to count the most common letters in the first of these texts:

In [12]:
from collections import Counter
Counter(text_a).most_common(10)

[(' ', 858),
 ('e', 566),
 ('a', 496),
 ('i', 465),
 ('n', 446),
 ('t', 394),
 ('s', 380),
 ('o', 309),
 ('r', 292),
 ('d', 249)]

Specifying the a_words variable gives the most frequent words instead:

In [13]:
Counter(a_words).most_common(10)

[('and', 47),
 ('data', 37),
 ('the', 33),
 ('of', 27),
 ('to', 21),
 ('is', 20),
 ('a', 17),
 ('science', 13),
 ('in', 11),
 ('for', 10)]

In [14]:
Counter(b_words).most_common(10)

[('and', 49),
 ('to', 29),
 ('of', 26),
 ('AI', 25),
 ('the', 20),
 ('is', 15),
 ('a', 13),
 ('with', 11),
 ('that', 10),
 ('as', 10)]

### Markov models

Markov models are mathematical frameworks that predict the future state of a system based on its current state, assuming that the future state depends only on the present state and not on the sequence of events that preceded it. They are used to model random processes and are commonly applied in fields like speech recognition, economics, and biology. The key concept is the "memoryless" property, meaning the next state is determined solely by the current state.

## N-grams

N-grams are sequences of units (like characters or words) from a larger sequence. The "level" refers to the unit type (character or word), and the "order" refers to the length of the n-gram. For example, in the word "intelligence," all unique character-level order-2 n-grams are:
* in
* te
* ll
* ig
* en
* ce

N-grams are used in natural language processing for tasks like spelling correction, text analysis, compression algorithms, and generative text.

### Generating text from a Markov model

A Markov model extends the n-gram concept by tracking what units follow each n-gram. For example, in "intelligence," the n-gram "in" is always followed by "t," while "te" is followed by "l" or "l."

To generate text using a Markov model, start with an initial n-gram, and then iteratively choose the next unit based on the probabilities from the model. This process creates a new sequence that statistically resembles the original input. This technique is known as a Markov chain generator. For example, using the order-2 character-level Markov model of "intelligence," the generated text might be:

* in
* int
* inte
* intel
* intell
* intelli
* intellig
* intellige
* intelligen
* intelligence


This results in a sequence similar to the original word, illustrating the Markov chain process.

### Generating with Markovify

In [15]:
#install Markovify
import sys
!{sys.executable} -m pip install markovify



In [16]:
import markovify

Creating a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable generator_a.

In [17]:
generator_a = markovify.Text(text_a)

Generate a sentence from the model using ".make_sentence()" method 

In [18]:
print(generator_a.make_sentence())

The primary objective of data and complex patterns.


The ".make_short_sentence()" method allows you to specify a maximum length for the generated sentence

### Changing the order

In [20]:
gen_a_1 = markovify.Text(text_a, state_size=1)
gen_a_4 = markovify.Text(text_a, state_size=4)

In [24]:
print("order 1")
print(gen_a_1.make_sentence(test_output=False))
print()
print("order 4")
print(gen_a_4.make_sentence(test_output=False))

order 1
Supervised learning involves training a distributed computing environment.

order 4
Supervised learning involves training a model on labeled data, where the target variable is known, to make predictions on new, unseen data.


In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. 
Deciding on the order is usually a matter of trial-and-error.

### Changing the level

Markovify, by default, works with words as the individual unit. It doesn't come out-of-the-box with support for character-level models. 
The following code defines a new kind of Markovify generator that implements character-level models.

In [25]:
class SentencesByChar(markovify.Text):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

Any of the parameters you passed to markovify.Text you can also pass to SentencesByChar. 

The state_size parameter still controls the order of the model, but now the n-grams are characters, not words.

The following cell creates a character-level order-7 Markov chain text generator from text A

In [26]:
gen_a_char = SentencesByChar(text_a, state_size=7)

In [27]:
print(gen_a_char.make_sentence(test_output=False).replace("\n", " "))

EDA helps in identifying patterns, correlations, and inconsistencies.


### Combining models

Markovify has a handy feature that allows you to combine models, creating a new model that draws on probabilities from both of the source models.<br>
To do this, we need to create the models independently, and then call ".combine()" to combine them.

In [28]:
generator_a = markovify.Text(text_a)
generator_b = markovify.Text(text_b)
combo = markovify.combine([generator_a, generator_b], [0.5, 0.5])

The bit of code [0.5, 0.5] controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. 

In [29]:
print(combo.make_sentence())

AI is computer vision, which aims to enable machines to possess consciousness, self-awareness, and a deep understanding of context and nuance.


### Bringing it all together

In [60]:
# change to "word" for a word-level model
level = "word"
# controls the length of the n-gram
order = 7
# controls the number of lines to output
output_n = 14
# weights between the models; text A first, text B second.
weights = [0.5, 0.5]
# limit sentence output to this number of characters
length_limit = 200

In [61]:
model_cls = markovify.Text if level == "word" else SentencesByChar

gen_a = model_cls(text_a, state_size=order)
gen_b = model_cls(text_b, state_size=order)

gen_combo = markovify.combine([gen_a, gen_b], weights)

for i in range(output_n):
    out = gen_combo.make_short_sentence(length_limit, test_output=False)
    out = out.replace("\n", " ")
    print(out)
    print()

In healthcare, AI is used for diagnosing diseases, personalized treatment planning, and drug discovery.

Ensuring that AI systems are transparent, fair, and accountable is essential for fostering trust and preventing harm.

As research and development continue to advance, the future of AI promises to bring about unprecedented innovations and opportunities.

Data science has a wide range of applications across different industries.

General AI, or strong AI, refers to systems that possess the ability to understand, learn, and apply knowledge across a broad range of tasks, much like a human being.

In healthcare, data science is revolutionizing personalized medicine, predictive analytics, and the discovery of new drugs.

Ensuring that AI systems are transparent, fair, and accountable is essential for fostering trust and preventing harm.

Applications of NLP include machine translation, sentiment analysis, chatbots, and voice assistants.

It encompasses a wide range of techniques and tech