# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [21]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript
louis,intro\nfade the music out. lets roll. hold the...
dave,this is dave. he tells dirty jokes for a livin...
ricky,hello. hello! how you doing? great. thank you....
bo,© scraps from the loft. all rights reserved.
bill,"[cheers and applause] all right, thank you! th..."
jim,[car horn honks] [audience cheering] [announce...
john,"armed with boyish charm and a sharp wit, the f..."
hasan,© scraps from the loft. all rights reserved.
ali,"ladies and gentlemen, please welcome to the st..."
anthony,"thank you. thank you. thank you, san francisco..."


In [50]:
# Extract only Ali Wong's text
ali_text = data.transcript.loc['ali']
ali_text[:200]

'ladies and gentlemen, please welcome to the stage: ali wong! hi. hello! welcome! thank you! thank you for coming. hello! hello. we are gonna have to get this shit over with, cause i have to pee in, li'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [23]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [24]:
# Create the dictionary for Ali's routine, take a look at it
ali_dict = markov_chain(ali_text)
ali_dict

{'ladies': ['and', 'who', 'and', 'who'],
 'and': ['gentlemen,',
  'foremost,',
  'then',
  'have',
  'theres',
  'resentment',
  'get',
  'get',
  'says,',
  'my',
  'she',
  'snatch',
  'running',
  'fighting',
  'yelling',
  'it',
  'everybody',
  'my',
  'she',
  'i',
  'im',
  'i',
  'i',
  'the',
  'i',
  'i',
  'has',
  'i',
  'we',
  'we–',
  'then',
  'i',
  'watched',
  'i',
  'have',
  'that',
  'you',
  'recycling,',
  'disturbing',
  'its',
  'all',
  'just',
  'then',
  'be',
  'half-vietnamese.',
  'we',
  'his',
  'i',
  'slide.',
  'your',
  'inflamed',
  'youre',
  'then',
  'i',
  'half-japanese',
  'im',
  'half-vietnamese.',
  'half-jungle',
  'playing',
  'rugby.',
  'on',
  'foremost,',
  'a',
  'the',
  'emotionally',
  'i',
  '',
  'so,',
  'neither',
  'i',
  'i–',
  'then',
  'its',
  'find',
  'start',
  'just',
  'caves',
  'gets',
  'is',
  'then',
  'look,',
  'very',
  'for',
  'i',
  'she',
  'rise',
  'her',
  'be',
  'eat',
  'watch',
  'then,',
  'be'

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [51]:
import random

def generate_sentence(chain, count=15):

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2
    # End it with a period
    sentence += '.'
    return(sentence)

In [52]:
print(generate_sentence(ali_dict))



Again! i dont! i grew up with my mom, my god, im gonna stick your.


### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [36]:
data

Unnamed: 0,transcript
louis,intro\nfade the music out. lets roll. hold the...
dave,this is dave. he tells dirty jokes for a livin...
ricky,hello. hello! how you doing? great. thank you....
bo,© scraps from the loft. all rights reserved.
bill,"[cheers and applause] all right, thank you! th..."
jim,[car horn honks] [audience cheering] [announce...
john,"armed with boyish charm and a sharp wit, the f..."
hasan,© scraps from the loft. all rights reserved.
ali,"ladies and gentlemen, please welcome to the st..."
anthony,"thank you. thank you. thank you, san francisco..."


In [40]:
dave_text = data.transcript.loc["dave"]
dic = markov_chain(dave_text)
#print(i," generated text is")
print(generate_sentence(dic))
    
    
    

Gun for a lot of bubble gum and you were not be real, not complying?.


In [41]:
ricky_text = data.transcript.loc["ricky"]
dic = markov_chain(ricky_text)
#print(i," generated text is")
print(generate_sentence(dic))

Years. and even real news. dont punch anywhere. they will. bob was that reeled me.


In [47]:
import string 
import random

def generate_sentence_punc(chain, count=20):

    # Capitalize the first word
    word = random.choice(list(chain.keys()))
    sentence = word.capitalize()

    for i in range(count-1):
        word = random.choice(chain[word])
        sentence += ' ' + word
        
        import string
        puncs = string.punctuation 
        if word[-1] in puncs:
          return sentence

    # End it with a period
    sentence += random.choice(puncs)
    return(sentence)

In [48]:
for i in range(len(data)):
    print(data.index[i])
    text = data.transcript.iloc[i]
    star_dict = markov_chain(text)
    a=generate_sentence_punc(star_dict)
    print(a)

louis
Does that.
dave
Cat could be like it.
ricky
Zoo, and gave her old is wrong with an x-ray,
bo
All rights reserved.
bill
Hugging your forehead.
jim
Actually know when a hand your bag yourself,
john
Biff, he would look upon your name,
hasan
The loft.
ali
Live. you know,
anthony
It, but he died.
mike
Class. i-ive given up to be like,
joe
Kick off into the people have that someone runs toward the church of guys will find them.
