
# Austen-bot
Alan Au
2018-05-07

This is a simple text generator to produce a "story" (I use this loosely) based on _Pride and Prejudice_.

In the first section, I'm going to play around with a Markov chain generator, at the word level. It uses Python lists to store the "first" words in each sentences, and a hash table to store lists of follow-up words for every word in the text. Note that because I'm using lists to store follow-up words, I get frequency weighting for free. Yay!

In [2]:
#!/usr/bin/python3
__author__ = 'Alan Au'
__date__   = '2018-05-07'

import random

## Data preparation

For my text data, I used the text of _Pride & Prejudice_ from Project Gutenberg. Note that this version is encoded in UTF-8 and uses different characters for opening and closing quotation marks.

It can be found here: http://www.gutenberg.org/files/1342/1342-0.txt

First I prepare the data as follows:
1. Delete the header and footer, so everything up to "Chapter 1" and after "End of the Project".
2. Lines starting with multiple spaces are skipped, as they interfere with text prediction.
3. Trailing newlines are stripped, so each paragraph is grouped together in a single line.
4. Blank lines are condensed down into single-newline paragraph breaks.

In [10]:
inFile = open("1342-0.txt", 'r', encoding="utf8") # Here's the raw input file.
outFile = open("pride_and_prejudice.txt", 'w', encoding="utf8") # Here's the cleaned input file.
all_lines = inFile.readlines()

in_header = True # Check whether we're in the header or not.
prev = False # Was there text in the previous line? Use to avoid consecutive empty lines.

for line in all_lines:
    # Process the header and footer.
    if in_header:
        if line[0:7] == "Chapter": # Once we see "Chapter", we're past the header.
            in_header = False
        else: # We're still in the header, so skip to next line.
            continue
    elif line[0:18] == "End of the Project": # Here's the footer, so we're done.
        break
        
    # Get rid of lines starting with multiple spaces (mostly the asterisks).
    if line[0:2] == "  ":
        continue
    
    # If within a paragraph, group all of the sentences together.
    if line != '\n':
        line = line.replace('\n',' ')
        outFile.write(line)
        prev = True
    
    # Compress paragraph breaks into single newlines.
    elif prev == True:
        outFile.write('\n')
        prev = False

# Clean up our file handles.
inFile.close()        
outFile.close()

## Building a very simple text prediction model

Now we're going to build a Markov chain to do some very simplistic text prediction. First we're going to read the existing text into a Python dictionary. Then we'll use the dictionary to "predict" which words follow which other words.

There are a couple of interesting notes here:
* I keep track of "first" words that are used to start paragraphs. These are the starting points for my Markov chains.
* I keep track of potential "title" words. These are just words starting with the letter 'P'.
* I keep quotations and prose in separate dictionaries.
* I keep duplicates, so those word combinations will have a higher chance of appearing in my output.

In [14]:
pp_dict = {} #to hold follow-up words in general prose
pp_quote = {} #to hold follow-up words within quotations
pp_first = [] #to hold "first" words that can be used to start new paragraphs
pp_title = {} #to hold potential "title" words

inFile = open("pride_and_prejudice.txt", 'r', encoding="utf8") # Here's the training data.
pp = inFile.readlines()

# Load up the dictionaries.
for paragraph in pp:
    words = paragraph.strip().split() # Convert sentences to lists of words.
    len_words = len(words)
    if len_words == 0: continue # Don't bother indexing empty paragraphs.
    in_quote = False
    
    # Go from the first to second-to-last word in the paragraph.
    for i in range(len(words)-1):
        current = words[i]
        next = words[i+1]

        if i == 0: pp_first.append(current) # Store "first" words which can be used to start Markov chains.
        
        if current[0].upper() == 'P': # If the word starts with 'P' then keep track as a potential "title" word.
            title_word = current.upper()
            while title_word[-1] not in "ABCDEFGHIJKLMNOPQRSTUVWXYZ": # Get rid of punctuation at end of title words.
                title_word = title_word[:-1]
            if title_word not in pp_title:
                pp_title[title_word] = True
        
        # Map the current word to its next word(s).
        if '“' in current: in_quote = True
        if '”' in current: in_quote = False # In case of one-word quotes, this will revert to False.
        if in_quote:
            if current in pp_quote: pp_quote[current].append(next) # Store "quotation" words here.
            else: pp_quote[current] = [next]
        else:
            if current in pp_dict: pp_dict[current].append(next) # Store "prose" words here.
            else: pp_dict[current] = [next]
                
inFile.close()

## Generating text using a Markov chain

Now that we've read in the text and created a model for it, we can generate some text. The model does this by looking at the current word, and then randomly picking from the list of all previously seen follow-up words. It isn't very sophisticated and has no sense of context or content, only that the 2-word combination has been seen before.

In [15]:
outFile = open("pp_output.txt", 'w', encoding="utf8") # Here's the resulting file.

# Decide whether to output a full story or just a single paragraph.
fullstory = True # Set to False for a single paragraph.
chapters = random.randint(1,61) # Our story will have between 1 and 61 chapters.
if fullstory:
    title_words = random.sample(pp_title.keys(),2)
    title = title_words[0]+" AND "+title_words[1] # Generate a title for our story.
    outFile.write(title.upper()+'\n'+"by Austen-bot (https://github.com/AlanAu/Austen-bot)\n")
    outFile.write('\nChapter 1\n')
else:
    chapters = 1

chapter = 1 # We always start at Chapter 1.
sentences = False # Check to make sure we have at least 1 sentence in a chapter.
in_quote = False # Use this to decide whether to pull from the "quote" words or the "prose" words.
while chapter <= chapters:
    start = random.sample(pp_first,1) # Pick a random starting word for our next paragraph.
    if '“' in start[0]: in_quote = True
    if '”' in start[0]: in_quote = False
    output = [start[0]]
    
    # Check to see if we're trying to start a new chapter.
    if start[0] == "Chapter":
        if not sentences: continue # Make sure there's at least 1 sentence in the chapter.
        chapter += 1
        sentences = False
        if chapter > chapters: # Stop when we would have gone beyond the last chapter.
            break
        output.append(str(chapter))
    
    # Generate a new paragraph.
    else:
        sentences = True
        current = start[0]
        while True:
            # We're in a quote, so pull from pp_quote.
            if in_quote and current in pp_quote:
                next = random.sample(pp_quote[current],1)
            # We're not in a quote, so pull from pp_dict.
            elif current in pp_dict:
                next = random.sample(pp_dict[current],1)
            else:
                if in_quote: # Close any unclosed quotes.
                    in_quote = False
                    output[-1] = output[-1]+'”'
                break
            output.append(next[0])
            current = next[0]
            if '“' in current: in_quote = True
            if '”' in current: in_quote = False
    outFile.write('\n'+' '.join(output)+'\n')

outFile.close()
print("Austen-bot has generated a "+str(chapters)+"-chapter story.")

Austen-bot has generated a 54-chapter story.
