# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript,full_name
ali,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Ali Wong
anthony,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAntho...,Anthony Jeselnik
bill,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBILL ...,Bill Burr
bo,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\nPage Not Found ...,Bo Burnham
carlin,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGeorg...,Dave Chappelle
dave,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Hasan Minhaj
gillis,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Jim Jefferies
hasan,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\nPage Not Found ...,Joe Rogan
jim,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJIM J...,John Mulaney
joe,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Louis C.K.


In [2]:
jim_text = data.transcript.loc['jim']
jim_text[:200]

'\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJIM JEFFERIES: BARE (2014) - Full Transcript - Scraps from the loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\r\n\t\tSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [3]:
from collections import defaultdict

def markov_chain(text):
    words = text.split(' ')
    m_dict = defaultdict(list)
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)
    m_dict = dict(m_dict)
    return m_dict

In [4]:

jim_dict = markov_chain(jim_text)
jim_dict


{'\n': ['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJIM',
  '\n\n\n\n\n\n\r\n\t\tSkip'],
 '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJIM': ['JEFFERIES:'],
 'JEFFERIES:': ['BARE', 'BARE'],
 'BARE': ['(2014)', '(2014)'],
 '(2014)': ['-', '–'],
 '-': ['Full', 'Scraps'],
 'Full': ['Transcript', 'Transcript'],
 'Transcript': ['-', '\n\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tApril'],
 'Scraps': ['from', 'from'],
 'from': ['the',
  'Australian',
  'out',
  'the',
  'high',
  'the',
  'the',
  'Norway',
  'having',
  'me.”',
  'me.',
  'time',
  'being',
  'wrong,',
  'blowing',
  'South',
  'a',
  'Cape',
  'everywhere.',
  'the',
  'the',
  'the',
  'his',
  'the'],
 'the': ['loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
  'mother',
  'stage',
  'end',
  'tour',
  'future,',
  'car,',
  'back',
  'radio',
  'model',
  'photos.',
  'center',
  'Miami',
  'mother',
  'center',
  'Miami',
  'four',
  'fuck',
  'mother',
  'corner',
  'end',
  'table',
  'rest',
  'front',
 

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [5]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''


    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2


    sentence += '.'
    return(sentence)

## Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [6]:
generate_sentence(jim_dict)

'Lot of us are you read gun on her in a bit of the theater,.'

In [7]:
import random
import string

def generate_sentence_with_punctuation(chain, max_length=50):
    '''Generate a sentence with the given Markov chain, ending with a random punctuation mark.'''
    word1 = random.choice(list(chain.keys()))
    sentence = [word1.capitalize()]

    while len(sentence) < max_length:
        word2 = random.choice(chain.get(word1, ['']))
        if not word2:
            break
        sentence.append(word2)
        if word2[-1] in string.punctuation:
            break
        word1 = word2

    # Add a random punctuation mark if the sentence doesn't end with one
    if sentence[-1][-1] not in string.punctuation:
        sentence[-1] += random.choice(string.punctuation)

    return ' '.join(sentence)




In [25]:
generate_sentence_with_punctuation(jim_dict)

'Stretch out that type of this his wife fat,'