# Text generation with Markovify

By [Allison Parrish](http://www.decontextualize.com/)

This notebook is a tour of how to generate text with Markov chains! Markov chains are a simple example of *predictive text generation*, a term I use to refer to methods of text generation that make use of statistical model that, given a certain stretch of text, *predicts* which bit of text should come next, based on probabilities learned from an existing corpus of text.

The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed.

## Working with text files

Before we get started, we'll first need some text! Grab two [plain text files from Project Gutenberg](http://www.gutenberg.org/) (or from another source of your choice) and save them to the same directory as this notebook. (I suggest working with two files because we'll be running some code explicitly to "compare" two texts. Also, I think seeing two different outputs from the text generation methods discussed in this notebook will help you better understand how those methods work.) The code in the following cell loads into Python variables the contents of *two plain text files*, assigned to variables `text_a` and `text_b`. You'll need to replace the filenames with the names of the files that you downloaded, keeping the quotation marks (`"`) intact.

In [2]:
text_a = open("Cathay.txt").read()
text_b = open("长干行.txt").read()

These variables are *strings*, which are essentially just long lists of the characters that occur in the text, in the order that they occur. The code in the following cell shows the first two hundred characters of text A:

In [26]:
print(text_a[:200])


Song of the Bowmen of Shu  Here we are, picking the first fern-shoots And saying: When shall we get back to our country? Here we are because we have the Ken-nin for our foemen, We have no comfort bec


You can change `text_a` to `text_b` to see the output from your second text, or change `200` to a number of your choosing.

The `random.sample()` function gives us a random sampling of the contents of a variable (as long as that variable is a sequence of things, like a string or a list). So, for example, to see twenty random characters from text B:

In [5]:
import random
random.sample(text_a, 20)

['g',
 'o',
 'e',
 'e',
 'o',
 ' ',
 ' ',
 'r',
 'a',
 'u',
 'r',
 ' ',
 ' ',
 'r',
 ' ',
 'e',
 'e',
 ' ',
 's',
 ',']

This isn't incredibly helpful on its own, but you'll notice that the characters it drew (probably) more or less follow the expected letter distribution for English (i.e., lots of `e`s and `n`s and `t`s).

Perhaps more interesting would be to see a randomly-sampled list of *words*. To do this, we'll make separate variables for the words in the text, using a Python function called `.split()`, which takes a string and turns it into a list of words contained in that string. The following cell makes two new variables that contain the words from both texts respectively:

In [27]:
a_words = text_a.split()
b_words = text_b.split()

In [52]:
import jieba

Now, ten random words from both text A and text B:

In [110]:
b_words = []
with open('长干行.txt', 'r') as f:
    for line in f:
        words = jieba.cut(line.strip(), cut_all=True)
        b_words.extend(words)

In [111]:
print(b_words)

['長', '干', '行', '李白', '妾', '髮', '初', '覆', '額', '，', '折', '花', '門', '前', '劇', '；', '郎', '騎', '竹', '馬', '來', '，', '遶', '床', '弄', '青梅', '。', '同居', '長', '干', '里', '，', '兩', '小', '無', '嫌', '猜', '。', '十四', '為', '君', '婦', '，', '羞', '顏', '未', '嘗', '開', '；', '低', '頭', '向', '暗', '壁', '，', '千', '喚', '不一', '一回', '，', '十五', '始', '展眉', '，', '願', '同', '塵', '與', '灰', '；', '常存', '抱柱', '信', '，', '豈', '上', '望', '夫', '臺', '？', '十六', '君', '遠', '行', '，', '瞿', '塘', '灩', '澦', '堆', '；', '五月', '不可', '觸', '，', '猿', '聲', '天上', '哀', '。', '門', '前', '遲', '行', '跡', '，', '一一', '一生', '綠', '苔', '；', '苔', '深', '不能', '掃', '，', '落', '葉', '秋', '風', '早', '。', '八月', '蝴蝶', '來', '，', '雙', '飛', '西', '園', '草', '。', '感', '此', '傷', '妾', '心', '，', '坐', '愁', '紅', '顏', '老', '。', '早晚', '下', '三', '巴', '，', '預', '將', '書', '報', '家', '；', '相迎', '不', '道', '遠', '，', '直至', '長', '風', '沙', '。']


In [7]:
random.sample(a_words, 10)

['received', 'High', 'or', 'is', 'with', 'it', 'is', '1.E.1', 'The', 'his']

In [59]:
random.sample(b_words, 10)

['。', '妾', '初', '弄', '，', '早', '心', '五月', '，', '嘗']

The code in the following cell uses Python's `Counter` object to count the *most common* letters in the first of these texts:

In [8]:
from collections import Counter
Counter(text_a).most_common(12)

[(' ', 5739),
 ('e', 3535),
 ('t', 2659),
 ('o', 2470),
 ('r', 2110),
 ('a', 2008),
 ('n', 2006),
 ('i', 1884),
 ('s', 1598),
 ('h', 1449),
 ('d', 1076),
 ('l', 1055)]

Specifying the `a_words` variable gives the most frequent *words* instead:

In [9]:
Counter(a_words).most_common(12)

[('the', 390),
 ('of', 199),
 ('to', 135),
 ('and', 135),
 ('in', 102),
 ('a', 101),
 ('with', 77),
 ('Project', 77),
 ('or', 73),
 ('And', 70),
 ('you', 70),
 ('are', 55)]

Compare these to the most common words in text B:

In [60]:
Counter(b_words).most_common(12)

[('，', 16),
 ('。', 7),
 ('；', 6),
 ('長', 3),
 ('行', 3),
 ('干', 2),
 ('妾', 2),
 ('門', 2),
 ('前', 2),
 ('來', 2),
 ('君', 2),
 ('顏', 2)]

## Markov models

I won't go into the precise details of how to implement a Markov chain text generator in this notebook. (I have written [a tutorial on this topic](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) elsewhere, however!) But I think it's helpful to understand the fundamentals of how Markov chain text generation works. The next question we’re going to try to answer is this: Given a stretch of text (say a string of characters, or run of words), what is most the most likely bit of text to come next?

One way to answer this question is with an n-gram based Markov model. What's an n-gram? I'm glad you asked!

### N-grams

An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. For convenience, we'll call the unit of the n-gram is called its level; the length of the n-gram is called its order. For example, the following is a list of all unique character-level order-2 n-grams in the word condescendences:

    co
    on
    nd
    de
    es
    sc
    ce
    en
    nc

And the following is an excerpt from the list of all unique word-level order-5 n-grams in The Road Not Taken:

    Two roads diverged in a
    roads diverged in a yellow
    diverged in a yellow wood,
    in a yellow wood, And
    a yellow wood, And sorry
    yellow wood, And sorry I

N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to stylometrics to generative text.

### What comes next?

A Markov model for text begins with a list of n-grams. But in addition to making this list, we also keep track of what unit of text (word, character, etc.) *follows* each of those n-grams.

Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:

| n-grams |	next? |
| ------- | ----- |
|co| n|
|on| d|
|nd| e, e|
|de| s, n|
|es| c, (end of text)|
|sc| e|
|ce| n, s|
|en| d, c|
|nc| e|

From this table, we can determine that while the n-gram `co` is followed by n 100% of the time, and while the n-gram `on` is followed by `d` 100% of the time, the n-gram `de` is followed by `s` 50% of the time, and `n` the rest of the time. Likewise, the n-gram `es` is followed by `c` 50% of the time, and followed by the end of the text the other 50% of the time.

Exercise: Imagine (or even better, write out) what this table might look like if you were analyzing words instead of characters, with a source text of your choice.

### Markov chains: Generating text from a Markov model

The Markov models we created above don't just give us interesting statistical probabilities. It also allows us generate a *new* text with those probabilities by *chaining together predictions*. Here’s how we’ll do it, starting with the order 2 character-level Markov model of `condescendences`: (1) start with the initial n-gram (`co`)—those are the first two characters of our output. (2) Now, look at the last *n* characters of output, where *n* is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with `co`, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:

    co
    con
    cond
    conde
    conden
    condend
    condendes
    condendesc
    condendesce
    condendesces
    
As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input. This kind of algorithm—moving from one state to the next, according to a list of probabilities—is known as a Markov chain generator.

### Generating with Markovify

Fortunately, with the invention of digital computers, you don't have to perform this algorithm by hand! In fact, Markov chain text generation has been a pastime of poets and programmers going back [all the way to 1983](https://www.jstor.org/stable/24969024), so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is [Markovify](https://github.com/jsvine/markovify), a Markov chain text generation library originally developed for BuzzFeed, apparently. It comes with a lot of extra niceties that will make our lives easier, but underneath the hood, it implements an algorithm very similar to the one we just did by hand above.

To install Markovify on your computer, run the cell below:

In [10]:
import sys
!{sys.executable} -m pip install markovify

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25ldone
[?25h  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18606 sha256=9f0b7ba72220476aa9644dd5dd3d171cdc44eac37a56bb23a161caf8add73e9d
  Stored in directory: /Users/vanorazhang/Library/Caches/pip/wheels/76/0a/ab/8727d219981e57e6036316dd2ec2037e61ccea0c016f7ae0c1
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.3.6


And then run this cell to make the library available in your notebook:

In [3]:
import markovify

The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable `generator_a`.

In [12]:
generator_a = markovify.Text(text_a)

In [61]:
generator_b = markovify.Text(text_b)

In [63]:
print(generator_b.make_sentence())

None


You can then call the `.make_sentence()` method to generate a sentence from the model:

In [None]:

# Train a Markov model on the text
model = markovify.Text(text, state_size=2)


In [13]:
print(generator_a.make_sentence())

* You comply with the free distribution of electronic works that can be found at the wall.


The `.make_short_sentence()` method allows you to specify a maximum length for the generated sentence:

In [15]:
print(generator_a.make_short_sentence(150))

1.D. The copyright laws of the day we were drunk for month on month, forgetting the kings and princes.


By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the `.make_sentence()` or `.make_short_sentence()` methods will return `None`, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the `tries` parameter:

In [16]:
print(generator_a.make_short_sentence(40, tries=100))

We have no comfort.


Or by disabling the check altogether with `test_output=False` (note that this means the generator will occasionally return stretches of text that are present in the source text):

In [17]:
print(generator_a.make_short_sentence(40, test_output=False))

By Mei Sheng.


### Changing the order

When you create the model, you can specify the order of the model using the `state_size` parameter. It defaults to 2. Let's make two model with different orders and compare:

In [18]:
gen_a_1 = markovify.Text(text_a, state_size=1)
gen_a_4 = markovify.Text(text_a, state_size=4)

In [19]:
print("order 1")
print(gen_a_1.make_sentence(test_output=False))
print()
print("order 4")
print(gen_a_4.make_sentence(test_output=False))

order 1
No longer the same way for explanation, and with the efforts and princes.

order 4
In another I find a perfect speech in a literality which will be to many most unacceptable.


In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error.

### Changing the level

Markovify, by default, works with *words* as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing:

In [140]:
class SentencesByChar(markovify.Text):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

Any of the parameters you passed to `markovify.Text` you can also pass to `SentencesByChar`. The `state_size` parameter still controls the order of the model, but now the n-grams are characters, not words.

The following cell implements a character-level Markov text generator for the word "condescendences":

In [21]:
con_model = SentencesByChar("condescendences", state_size=2)

Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier!

In [22]:
con_model.make_sentence()

'condendescencencendes'

Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A:

In [70]:
gen_a_char = SentencesByChar(text_a, state_size=7)

In [87]:
print(text_b)

長干行
李白
妾髮初覆額，
折花門前劇；
郎騎竹馬來，
遶床弄青梅。
同居長干里，
兩小無嫌猜。
十四為君婦，
羞顏未嘗開；
低頭向暗壁，
千喚不一回，
十五始展眉，
願同塵與灰；
常存抱柱信，
豈上望夫臺？
十六君遠行，
瞿塘灩澦堆；
五月不可觸，
猿聲天上哀。
門前遲行跡，
一一生綠苔；
苔深不能掃，
落葉秋風早。
八月蝴蝶來，
雙飛西園草。
感此傷妾心，
坐愁紅顏老。
早晚下三巴，
預將書報家；
相迎不道遠，
直至長風沙。


And the cell below prints out a random sentence from this generator. (The `.replace()` is to get rid of any newline characters in the output.)

In [71]:
print(gen_a_char.make_sentence(test_output=False).replace("\n", " "))

Bosque taketh blossom?


In [85]:
gen_b_char = SentencesByChar(text_b, state_size=1)

In [86]:
print(gen_b_char.make_sentence(test_output=False).replace("\n", " "))

長風沙。 十六君婦， 感此傷妾心， 預將書報家； 低頭向暗壁， 低頭向暗壁， 直至長干里， 十六君婦， 雙飛西園草。 坐愁紅顏老。 同塵與灰； 願同居長干行 一生綠苔； 八月不一生綠苔； 猿聲天上望夫臺？ 羞顏未嘗開； 感此傷妾髮初覆額， 一一回， 遶床弄青梅。 低頭向暗壁， 低頭向暗壁， 同塵與灰； 門前遲行， 十四為君遠， 兩小無嫌猜。 瞿塘灩澦堆； 感此傷妾心， 早。 直至長干行 十六君遠行 李白 郎騎竹馬來， 千喚不能掃， 同塵與灰； 同居長干里， 折花門前遲行， 兩小無嫌猜。 苔； 直至長干行 相迎不道遠， 折花門前劇； 猿聲天上哀。 一回， 早晚下三巴， 苔； 直至長干里， 坐愁紅顏未嘗開； 十四為君婦， 郎騎竹馬來， 妾髮初覆額， 苔； 八月不一生綠苔深不一回， 預將書報家； 郎騎竹馬來， 相迎不能掃， 八月不能掃， 十六君遠， 郎騎竹馬來， 預將書報家； 常存抱柱信， 願同居長風早晚下三巴， 五月蝴蝶來， 願同居長風沙。


### Combining models

Markovify has a handy feature that allows you to *combine* models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call `.combine()` to combine them.

In [47]:
generator_a = markovify.Text(text_a)
generator_b = markovify.Text(text_b)
combo = markovify.combine([generator_a, generator_b], [0.5, 0.5])

In [49]:
print(generator_b)

<markovify.text.Text object at 0x1069a3790>


The bit of code `[0.5, 0.5]` controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly text A with but a *soupçon* of text B, you would write `[0.9, 0.1]`. Try it!) 

Then you can create sentences using the combined model:

In [48]:
print(combo.make_sentence())

And within, the mistress, in the same way over the portals of Sei-go-yo, And clings to the north of Raku-hoku, Till we had nothing but thoughts and memories in common.


### Bringing it all together

I've pre-written some code below to make it easy for you to experiment and produce output from Markovify. Just make adjustments to the values assigned to the variables in the cell below:

In [34]:
# change to "word" for a word-level model
level = "char"
# controls the length of the n-gram
order = 7
order2 = 1
# controls the number of lines to output
output_n = 14
# weights between the models; text A first, text B second.
# if you want to completely exclude one model, set its corresponding value to 0
weights = [0.5, 0.5]
# limit sentence output to this number of characters
length_limit = 280

In [37]:
class SentencesByChar(markovify.Text):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)


In [40]:
import markovify

# Read the input text file
with open('Cathay.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Build the Markov chain model
model = markovify.Text(text, state_size=2)

# Generate a new poem
for i in range(15):
    poem = model.make_sentence()
    print(poem)


From Rihaku FOUR POEMS OF DEPARTURE Light rain is on the going, And on the feathery banners.
With yellow gold and white jewels, we paid for songs and laughter And we went on living in the heart.
None
Sorrow to go, and sorrow, sorrow like rain.
With yellow gold and white jewels, we paid for songs and laughter And we guardsmen fed to the north of the hill at his horse's bridle.
The sea's colour moves at the wing-flapping storks, He returns by way of the Bowmen of Shu Here we are because we have the Ken-nin for our foemen, We have no rest, three battles a month.
Moaneth alway my mind's lust That I fare forth, that I on high streams The salt-wavy tumult traverse alone.
Lament of the Frontier Guard By the south side of the castle, To the dance of the turning and twisting waters, Into a valley of the whole book of translations.
Five clouds hang aloft, bright on the going, And on the stone-cliffs beaten, fell on the water.
In another I find a perfect speech in a man's tide go, turn it to twai

(The lines beginning with `#` are "comments"—they don't do anything, they're just there to explain what's happening in the code.)

After making your changes above, run the cell below to generate text according to your parameters. Repeat as necessary until you get something you really like!

In [41]:
model_cls = markovify.Text if level == "word" else SentencesByChar
gen_a = model_cls(text_a, state_size=order)
gen_b = model_cls(text_b, state_size=order)
gen_combo = markovify.combine([gen_a, gen_b], weights)
for i in range(output_n):
    out = gen_combo.make_short_sentence(length_limit, test_output=False)
    out = out.replace("\n", " ")
    print(out)
    print()

By Kutsugen.

Horses, his horses are on the base of old hills.

Haughty their buried bodies Be an unlikely treasure lasting, with the dragon-scales, going grass in the middle kingdom, Three hundred and sixty thousand from the sea and frosts, High heaps, covered with the wave's slash, Yet longing comes to an end.

Also she has no excuse on account of weather.

Ah, how shall you know the whole book of translations.

The leaves fall early this autumn.

The eastern wind brings the green on this flute, Their voice is in these unquestionable poems.

And your feet were by frost benumbed.

Taking Leave of a Friend   Blue mountains white-headed.

Cuckoo calleth with gloomy crying lone-flyer, Whets for the wall.

The City of Chokan: Two small people, with courtezans, going greener, But you, Sir, had better take wine ere your departing of old acquaintances Who bow over the river Kiang   Ko-jin goes west from Ko-kaku-ro, The smoke-flowers.

Coldly afflicted, My feet when you went north to San pala

## Generating with non-prose text

Markovify assumes you're feeding it prose, i.e., a text file that can be parsed into sentences by separating on sentence-ending punctuation. But often you're *not* working with text like this. For example, let's generate some sonnets. First, download [this plaintext version of Shakespeare's sonnets](https://raw.githubusercontent.com/aparrish/plaintext-example-files/master/sonnets.txt) and keep it in the same directory as this notebook. We'll define the sonnet-generating task as consisting of (a) training a Markov chain on lines of poetry and then (b) generating a sequence of fourteen lines of poetry. Since the *line* is the unit now and not the *sentence*, we need to use Markovify's `NewlineText` class instead of `Text`:

In [42]:
sonnets_text = open("Cathay.txt").read()
sonnets_model = markovify.NewlineText(sonnets_text, state_size=1)

And then we can generate:

In [43]:
sonnets_model.make_sentence()

"I stopped scowling, I desired my forehead I played about the North Gate, the gate now, the heart, And you departed, You went out. By the narrows of Ernest Fenollosa's notes by a thousand frosts, High heaps, covered with trees and on the beginning of swirling eddies, And I never looked at the eastward-flowing waters. Petals are given over the back-swirling eddies, But to-days men for the dawn And she puts forth a courtezan in a certain poem :"

In [None]:
import markovify

# Read the text file
with open('长干行.txt', encoding='utf-8') as f:
    text = f.read()

# Split the text into individual characters
chars = list(text)

# Train a Markov model on the characters
model = markovify.Text(chars, state_size=2)

# Generate a new poem by combining characters
new_poem = ""

# Keep generating new lines until we have 4 lines
while len(new_poem.split('\n')) < 4:
    line = model.make_sentence()
    if line is not None:
        new_poem += line + '\n'

print(new_poem)


And now make a sonnet, sorta:

In [44]:
for i in range(14):
    print(sonnets_model.make_sentence())

Note.—Jewel stairs, but has come to cut the mad chase through the bridge-rail. The sea's colour moves at the eastward-flowing waters. Petals are they hang over to watch out to the North Gate, With head-gear glittering against the Bridge at the middle kingdom, Three hundred and sixty thousand, And evening drives them away! The Beautiful Toilet Blue, blue plums. And no excuse on the eastward-flowing waters. Petals are already yellow with head-trappings of Ernest Fenollosa's notes by a thousand gates, At morning there is the door. Slender, she puts forth a Letter While my head, I have been gone waters and towers to many most unacceptable. The sea's colour moves at the narrows of a thousand autumns, Unwearying autumns. For them away! The leaves fall early this autumn, A gracious spring, turned to the West garden, They ride upon dragon-like horses, Upon horses with his mistress, With her too much alone.
I stopped scowling, I never looked back. At sixteen you departed, You came by the gate n

Doing this with a character-level model is a bit more tricky. I've written code in the cell below that defines a new class, `LinesByCharacter` that works like `NewlineText` but operates character-by-character instead of word-by-word:

In [19]:
class LinesByChar(markovify.NewlineText):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

Now we can create a character model with the sonnets, line by line:

In [96]:
sonnets_char_model = LinesByChar(sonnets_text, state_size=4)

And generate new sonnets:

In [102]:
for i in range(14):
    print(sonnets_char_model.make_sentence())

None
None
None
None
None
None
In anotherefore alread overhead-trappings. And with Rihaku.
None
None
The looked her hang over laught the men are a perfumed about this to blows:
None
None
By Riokushu, They hurt merely pring autumn, the sea's come to the poem :
None


### New moods

Character-level Markov chains are especially suitable, in my experience, for generating shorter texts, like individual words or names. Let's generate names of new moods using this technique. First, download [this JSON file of moods](https://raw.githubusercontent.com/dariusk/corpora/9bb62927951f79bec2454f29d71b6e9b28d874b1/data/humans/moods.json) from [Corpora Project](https://github.com/dariusk/corpora/) and save to the same directory as this notebook.

Then load the JSON file and grab just the list of words naming moods:

In [42]:
import json
mood_data = json.loads(open("./moods.json").read())
moods = mood_data['moods']

FileNotFoundError: [Errno 2] No such file or directory: './moods.json'

The easiest way to use this is to make one big string with the moods joined together with newlines:

In [53]:
moods_text = "\n".join(moods)

Then use `LinesByChar` to make the model:

In [56]:
moods_char_model = LinesByChar(moods_text, state_size=3)

And voila, new moods:

In [57]:
for i in range(24):
    print(moods_char_model.make_sentence())

cowarlessive
toughted
lyricapative
convictorted
embarratived
teasane
feistative
powerloaded
bothetic
rejuvenaccepted
teassioned
innovatired
chieved
alievoless
incomperational
greterpressived
imped
soreborn
lethant
jadequashamed
intemplifeless
deterrientalkatirical
enconfidead
strappallous


## chars
#chinese chars parsing and markov chain


In [103]:
def add_to_model(model, n, seq):
    # make a copy of seq and append None to the end
    seq = list(seq[:]) + [None]
    for i in range(len(seq)-n):
        # tuple because we're using it as a dict key!
        gram = tuple(seq[i:i+n])
        next_item = seq[i+n]            
        if gram not in model:
            model[gram] = []
        model[gram].append(next_item)

def markov_model(n, seq):
    model = {}
    add_to_model(model, n, seq)
    return model

In [104]:
import random
def gen_from_model(n, model, start=None, max_gen=100):
    if start is None:
        start = random.choice(list(model.keys()))
    output = list(start)
    for i in range(max_gen):
        start = tuple(output[-n:])
        next_item = random.choice(model[start])
        if next_item is None:
            break
        else:
            output.append(next_item)
    return output

In [105]:
def markov_model_from_sequences(n, sequences):
    model = {}
    for item in sequences:
        add_to_model(model, n, item)
    return model

In [106]:
def markov_generate_from_sequences(n, sequences, count, max_gen=100):
    starts = [item[:n] for item in sequences if len(item) >= n]
    model = markov_model_from_sequences(n, sequences)
    return [gen_from_model(n, model, random.choice(starts), max_gen)
           for i in range(count)]

In [145]:
frost_lines = [line.strip() for line in open("长干行.txt").readlines()]
for item in markov_generate_from_sequences(5, frost_lines, 19):
    text_b = ''.join(item)
    print(text_b)

忧心之矣，
室家之士，
攸彼南山，
曰归曰归，
亦余心之忧。
纤纤出素手。
常存抱柱信，
坐愁紅顏老。
采薇采薇，
羞顏未嘗開；
宁赏宁处。
靡室靡家，
纤纤出素手。
苔深不能掃，
盈盈楼上女，
豈上望夫臺？
猿聲天上哀。
苔深不能掃，
靡室靡家，


In [152]:
model_cls = markovify.Text if level == "character" else SentencesByChar
gen_b = model_cls(text_b, state_size=2)
generated_text = gen_b.make_sentence()
print(generated_text)

None


## generating poems for the original

In [10]:
class LinesByChar(markovify.NewlineText):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

In [47]:
# Load the source text
with open("yao.txt") as f:
    text = f.read()

# Train the model

#model = markovify.Text(text, state_size=1)
model = LinesByChar(text)

# Generate five poems
for i in range(19):
    poem = model.make_sentence(tries=1000)
    print(poem)


折花門前遲行跡
同居長干行
今为荡子行不归
今为荡子行不归
同居長干行
同居長干行
折花門前遲行跡
折花門前遲行跡
同居長干行
折花門前遲行跡
今为荡子行不归
折花門前遲行跡
今为荡子行不归
今为荡子行不归
折花門前遲行跡
折花門前遲行跡
同居長干行
今为荡子行不归
今为荡子行不归


In [207]:
class MarkovifyInterface:
    def __init__(self):
        self.models = []
        self.texts = []
        self.text_cnt = 0
        self.combined_model: Union[None, Text] = None

    def add_text(self, texts, split_by="\n"):
        for t in texts.split(split_by):
            if len(t) > 0:
                self.texts.append(t)

    def gen_model(self):
        self.combined_model = markovify.combine(models=self.models)

    def convent_text_2_model(self):
        while len(self.texts) > 0:
            text = self.texts.pop()
            self.models.append(markovify.Text(text))

    def make_sentence(self, max_times_invocation=10):
        self.convent_text_2_model()
        self.gen_model()
        sentence = None
        index = 0
        while sentence is None and index < max_times_invocation:
            sentence = self.combined_model.make_sentence()
            index += 1
        return sentence

    def to_json(self):
        self.convent_text_2_model()
        self.gen_model()
        return self.combined_model.to_json()

    def load_json(self, json_str):
        self.combined_model = markovify.Text.from_json(json_str)

# unit test
inter = MarkovifyInterface()

#
inter.add_text("耀\n耀", split_by="\n")
inter.make_sentence()

#
json_str = inter.to_json()
inter.load_json(json_str)
assert json_str == inter.to_json()



In [28]:
!pip install pronouncing

Collecting pronouncing
  Downloading pronouncing-0.2.0.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting cmudict>=0.4.0
  Downloading cmudict-1.0.13-py3-none-any.whl (939 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m939.3/939.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting importlib-metadata<6.0.0,>=5.1.0
  Downloading importlib_metadata-5.2.0-py3-none-any.whl (21 kB)
Collecting importlib-resources<6.0.0,>=5.10.1
  Downloading importlib_resources-5.12.0-py3-none-any.whl (36 kB)
Building wheels for collected packages: pronouncing
  Building wheel for pronouncing (setup.py) ... [?25ldone
[?25h  Created wheel for pronouncing: filename=pronouncing-0.2.0-py2.py3-none-any.whl size=6236 sha256=9b1d5f0befabb60ad909482b3f02d276f7442134a63ed1777d97c0e6bc15af26
  Stored in directory: /Users/vanorazhang/Library/Caches/pip/wheels/ee/d4/c2/fb8c0e2009b75358874506ff2ce1ee79370b6ef5cf08922206
Successfully built pron

In [31]:
import markovify
import pronouncing

class RhymingText(markovify.Text):
    def word_rhyme(self, word):
        return pronouncing.rhymes(word)

# Load the source text
with open("Cathay.txt") as f:
    text = f.read()

# Train the model
model = RhymingText(text)

# Generate five poems with rhymes
for i in range(5):
    poem = None
    while not poem:
        # Generate a sentence that ends with a rhyming word
        sentence = model.make_sentence(tries=100)
        if sentence:
            words = sentence.split()
            rhyming_word = words[-1]
            rhyming_candidates = model.word_rhyme(rhyming_word)
            for candidate in rhyming_candidates:
                if candidate in model.chain.model:
                    poem = sentence + " " + candidate
                    break
    
    # Format the poem with multiple lines
    lines = poem.split()
    poem_lines = []
    current_line = ""
    for word in lines:
        if len(current_line + " " + word) > 20:  # change 20 to the desired line length
            poem_lines.append(current_line.strip())
            current_line = ""
        current_line += word + " "
    poem_lines.append(current_line.strip())
    
    # Print the formatted poem
    print("\n".join(poem_lines))
    print()  # add an extra newline between poems
print(poem_lines)

KeyboardInterrupt: 

In [208]:
import string

# Define the path of the input and output files
input_file_path = 'yao.txt'
output_file_path = '耀.txt'

# Define a list of Chinese punctuation marks to be removed
chinese_punctuation = '。，！？；：‘’“”【】（）《》'

# Open the input file for reading
with open(input_file_path, 'r', encoding='utf-8') as input_file:
    # Read the contents of the input file and remove line breaks
    text = input_file.read().replace('\n', '')
    # Remove all Chinese punctuation marks from the text
    text = ''.join([char for char in text if char not in chinese_punctuation])
    # Open the output file for writing
    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        # Write the modified text to the output file
        output_file.write(text)


In [251]:
import markovify
import jieba
#from model_gen import MarkovifyInterface
from typing import Union
from markovify import Text
jieba.setLogLevel("WARNING")

def get_1():
    # Get raw text as string.
    with open("yao.txt") as f:
        text = f.read()
    text = " ".join(list(jieba.lcut(text)))
    return text


def get_2():
    # Get raw text as string.
    with open("yao.txt") as f:
        text = f.read()
    return text


inter = MarkovifyInterface()
inter.add_text(get_1(), split_by="\n")
inter.add_text(get_2(), split_by="\n")

# Print five randomly-generated sentences
for i in range(15):
    print(inter.make_sentence()) #.replace(" ", ""))


曰 归 曰 归 曰 归
曰 归 曰 归 曰 归
None
曰 归 曰 归 曰 归 曰 归 曰 归
None
曰 归 曰 归 曰 归
None
曰 归 曰 归 曰 归 曰 归
None
None
曰 归 曰 归 曰 归 曰 归
曰 归 曰 归 曰 归 曰 归
None
None
曰 归 曰 归 曰 归 曰 归


In [252]:
import markovify
import jieba
#from model_gen import MarkovifyInterface
from typing import Union
from markovify import Text
jieba.setLogLevel("WARNING")

def get_1():
    # Get raw text as string.
    with open("yao.txt") as f:
        text = f.read()
    text = " ".join(list(jieba.lcut(text)))
    return text
print(text)

采薇采薇采薇薇亦作止曰归曰归岁亦莫止靡室靡家玁狁之故不如娶女留傅自牧采薇采薇薇亦柔止曰归曰归心亦忧止忧心之矣亦余心之忧室家之士攸彼南山淑人君子宁赏宁处青青河畔草青青河畔草郁郁园中柳盈盈楼上女皎皎当窗牖娥娥红粉妆纤纤出素手昔为倡家女今为荡子妇荡子行不归空床难独守長干行李白妾髮初覆額折花門前劇郎騎竹馬來遶床弄青梅同居長干里兩小無嫌猜十四為君婦羞顏未嘗開低頭向暗壁千喚不一回十五始展眉願同塵與灰常存抱柱信豈上望夫臺十六君遠行瞿塘灩澦堆五月不可觸猿聲天上哀門前遲行跡一一生綠苔苔深不能掃落葉秋風早八月蝴蝶來雙飛西園草感此傷妾心坐愁紅顏老早晚下三巴預將書報家相迎不道遠直至長風沙


In [253]:
model = markovify.Text(text, state_size=2)

# Generate five poems
for i in range(10):
    poem = model.make_sentence(tries=10)
    print(poem)

None
None
None
None
None
None
None
None
None
None


In [32]:
model_cls = markovify.Text if level == "word" else SentencesByChar
gen_a = model_cls(text_a, state_size=order)
gen_b = model_cls(text_b, state_size=order)
gen_combo = markovify.combine([gen_a, gen_b], weights)
for i in range(output_n):
    out = gen_combo.make_short_sentence(length_limit, test_output=False)
    out = out.replace("\n", " ")
    print(out)
    print()

NameError: name 'level' is not defined

In [33]:
model_cls = markovify.Text if level == "word" else SentencesByChar
gen_a = model_cls(text_a, state_size=order)
gen_b = model_cls(text_b, state_size=order)
gen_combo = markovify.combine([gen_a, gen_b], weights)
with open("output.txt", "w", encoding="utf-8") as f:
    for i in range(output_n):
        out = gen_combo.make_short_sentence(length_limit, test_output=False)
        out = out.replace("\n", " ")
        f.write(out)
        f.write("\n")

NameError: name 'level' is not defined

In [234]:
import random
import jieba

# Read the three text files and store the paragraphs in separate lists
with open('output.txt', 'r') as f1, open('长干行.txt', 'r') as f2:
    paragraphs1 = f1.read().split('\n\n')
    paragraphs2 = f2.read().split('\n\n')
    



# Split the Chinese text into words
words2 = []
with open('长干行.txt', 'r') as f:
    for line in f:
        words = jieba.cut(line.strip(), cut_all=False)
        words2.extend(words)

# Split the English text into words
words1 = []
for paragraph in paragraphs1 + paragraphs2:
    words = paragraph.split()
    words1.extend(words)

# Combine the Chinese words and English words in a random order
mixed_words = words1 + words2
random.shuffle(mixed_words)
mixed_text = ' '.join(mixed_words)

intertwined = mixed_text
print(intertwined)



干里 忧 ， ， 李白 ， A.D. 河畔 ？ 願同塵與灰； has Letter blood-ravenous Are 。 He 女 曰 Also month. rock, a a bundles century 五月不可觸， 青青 the account wandering Called Toilet now across old ； ， A.D. 归 雙飛 phoenix ， clouds has October? 青青 折花門 ； 折花門前劇； and of Shin 之 不可 the the green 楼上 归 Rakuyo, to 采薇 river 心亦忧止。 Kiang, 哀 yellow beforehand, turmoil. 騎 。 堆 of 落葉秋風早。 itself 之 青梅 The of 采薇采薇， to 娥 this tumult 瞿塘灩澦堆； 早晚下三巴， be cloud 暗壁 竹馬來 余心 盈盈楼上女， body. village. 忧心 信 weather. 。 To over 猜 gone, 行不归 落葉秋風 玁 心 。 。 臺 cut There no 青青河畔草 you 薇 his autumn, still with August 長干行 纤纤出素手。 to 玁狁之故。 eyebrows like 。 前劇 薇 紅顏老 of 願同塵 The 当 thousand, City Flowers 空床 A 粉妆 淑人君子 遲行跡 彼 荡子 相迎 草 長 抱柱 the 床 on The cut 自牧 from 感此 室家之士， 采薇 ； high and 瞿塘灩 the 采薇 直至長風沙。 手 郎騎竹馬來， 。 Choan 早晚 Gen. left ， 窗牖 亦 a bridle. 园中 澦 I 妇 to 空床难独守。 豈上 played 今为荡子妇。 how straight 郁郁园中柳。 。 八月 矣 未嘗開 狁 門前 十六君遠行， 來 ， 青青河畔草， a 豈上望夫臺？ And 不能 rest, on rain. 红 ， 独守 the of 淑人君子， know 薇亦作止。 昔为 ， wind them thousand 室靡家 shall my ； 故 預將書 ， Till 难 ancient And let 纤纤 十六

## Further reading

* Hayes, Brian. “Computer recreations.” Scientific American, vol. 249, no. 5, 1983, pp. 18–31. JSTOR, http://www.jstor.org/stable/24969024. (Original column from Scientific American that described how Markov chain text generation works—very readable! I can send a PDF, hit me up.)
* [A Travesty Generator for Micros](https://elmcip.net/critical-writing/travesty-generator-micros) is a follow-up to Hayes' article that has some more theory and an actual Pascal listing (which is now mostly of only historical interest).
* [This notebook](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) shows how to implement a Markov chain generator from scratch in Python, if you're interested in such things!
* Lillian-Yvonne Bertram's [Travesty Generator](http://www.noemipress.org/catalog/poetry/travesty-generator/) is a striking example of Markov chains put to poetic use.