# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [7]:
# Read in the corpus, including punctuation!
import pandas as pd
data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript
adel_karam,A NETFLIX COMEDY SPECIAL\nRecorded at the Casi...
amy_schumer,"Fuck, yeah! This is such a big night for you. ..."
beth_stelling,"Beth Stelling’s stand-up comedy special, “Girl..."
big_jay_oakerson,[crowd cheering] [heavy rock music] – Let’s ge...
chelsea_handler,Join me in welcoming the author of six number ...
chris_rock,[slow instrumental music playing] [funk drums ...
dave_chappelle,"“The Dreamer,” which was shot in Chappelle’s h..."
david_cross,David Cross: Making America Great Again! is a ...
dylan_moran,"Ladies and gentlemen, will you please welcome ..."
george_carlin,"In 1965 “The Indian Sergeant,” was emerging as..."


In [8]:
chris_rock = data.transcript.loc['chris_rock']
chris_rock[:200]

'[slow instrumental music playing] [funk drums playing] [indistinct chatter] [man] Let’s go! [hip-hop music playing] [audience cheering] [Chris Rock] She said, “$300, I’ll do anything you want.” I said'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [9]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [11]:
chris_dict = markov_chain(chris_rock)
chris_dict

{'[slow': ['instrumental'],
 'instrumental': ['music'],
 'music': ['playing]', 'playing]', 'playing]', 'playing]'],
 'playing]': ['[funk',
  '[indistinct',
  '[audience',
  '[female',
  '[audience',
  'Your'],
 '[funk': ['drums'],
 'drums': ['playing]'],
 '[indistinct': ['chatter]'],
 'chatter]': ['[man]'],
 '[man]': ['Let’s', 'Whoo!'],
 'Let’s': ['go!', 'see', 'just'],
 'go!': ['[hip-hop'],
 '[hip-hop': ['music', 'music', 'music'],
 '[audience': ['cheering]',
  'laughing]',
  'cheering]',
  'cheering]',
  'continue',
  'cheers',
  'cheering]',
  'laughing]',
  'applauding]',
  'laughing',
  'laughing]',
  'cheer',
  'cheering]',
  'cheering]',
  'laughs]',
  'cheering',
  'laughing]',
  'laughing',
  'laughing]',
  'laughing',
  'laughing]',
  'laughing]',
  'cheering',
  'cheering]',
  'laughing',
  'cheering]',
  'applauding]',
  'cheering]',
  'laughs]',
  'laughs]',
  'cheering]',
  'cheering]',
  'cheering]',
  'cheering]',
  'cheering',
  'continue'],
 'cheering]': ['[Chris',
  

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [12]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [13]:
generate_sentence(chris_dict)

'Send you know? I’m like, The fuck out with Goofy. Fucking drums, please. But if.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

#### Generate sentence for other comedians also

In [15]:
import pandas as pd

# Assuming you have the data stored in a pandas DataFrame called `data`

# Extracting the transcript for Chris Rock, Dave Chappelle, and Kevin Hart
dave_chappelle = data.transcript.loc['dave_chappelle']
kevin_hart = data.transcript.loc['kevin_hart']

# Let's print the first 200 characters of each comedian's transcript

print("\nDave Chappelle's transcript (first 200 characters):")
print(dave_chappelle[:200])

print("\nKevin Hart's transcript (first 200 characters):")
print(kevin_hart[:200])



Dave Chappelle's transcript (first 200 characters):
“The Dreamer,” which was shot in Chappelle’s hometown of Washington, D.C., at the Lincoln Theatre, marks Chappelle’s seventh special with Netflix. Stan Lathan, who directed Chappelle’s six other speci

Kevin Hart's transcript (first 200 characters):
Streaming on Netflix from November 17, 2020   [Kenzo babbling] [Kevin] Yo. What’s up? I was looking all over the house for y’all. [Eniko] We’re just chilling. About to go downstairs, get some work don


In [16]:
# Create Markov chain dictionaries for Dave Chappelle and Kevin Hart
dave_dict = markov_chain(dave_chappelle)
kevin_dict = markov_chain(kevin_hart)

# Generate sentences for each comedian
dave_sentence = generate_sentence(dave_dict)
kevin_sentence = generate_sentence(kevin_dict)

print("Dave Chappelle's sentence:", dave_sentence)
print("Kevin Hart's sentence:", kevin_sentence)

Dave Chappelle's sentence: Round and pulled the music. Music started blasting from Bad Boy Records jumped back and.
Kevin Hart's sentence: Everybody, say it’s not gonna have been. Always have asked. Carlton from a higher level.


In [17]:
# Generate sentences for all comedians
for comedian in data.index:
    transcript = data.transcript.loc[comedian]
    chain = markov_chain(transcript)
    sentence = generate_sentence(chain)
    print(f"{comedian.capitalize()}'s sentence:", sentence)

Adel_karam's sentence: Tightly and walking next to Abo Dani, and the wall. What’s cool is that?” I.
Amy_schumer's sentence: And as I was a gymnast. I just paid me real rape-y all the road,.
Beth_stelling's sentence: Shoulder height. That’s a happy ending situation. You know if I don’t know better, I’m.
Big_jay_oakerson's sentence: “just gorgeous athletes.” That’s serious as it all costs. And then pull out, and he.
Chelsea_handler's sentence: Research and I need to get in, a doctor’s office and then they get empathy?.
Chris_rock's sentence: Diss. I love about abortion, you can Venmo me.” What, ’cause her a baby. ‘Cause.
Dave_chappelle's sentence: [exhales] of the trailer. “What do a homeless guy, this n*gga had a very powerful.
David_cross's sentence: Emotional level. That’s the solution was jerking off your facial hair? Fuck that. Huh? Wait,.
Dylan_moran's sentence: Cat, “ha ha ha ha ha!” It’s like over me, yeah. “Where do you must.
George_carlin's sentence: ’em the fort. However, Limp

### Assignment 2 To improve the generate_sentence function, we can modify it to end with a random punctuation mark or end whenever it encounters a word that already ends with a punctuation mark. Here's the updated version of the function:

In [19]:
import random
import string

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''
    
    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        # Check if the word ends with punctuation
        if word1[-1] in string.punctuation:
            # End the sentence if it ends with punctuation
            break
        else:
            # Continue generating the sentence
            word2 = random.choice(chain[word1])
            word1 = word2
            sentence += ' ' + word2

    # End it with a period or a random punctuation mark if it doesn't already end with punctuation
    if sentence[-1] not in string.punctuation:
        sentence += random.choice(string.punctuation)
    else:
        sentence += random.choice(string.punctuation.replace(sentence[-1], ''))
    
    return sentence



'You’re in slavery like my car?” “I want likes.:'

In [20]:
# Generate sentences for all comedians
for comedian in data.index:
    transcript = data.transcript.loc[comedian]
    chain = markov_chain(transcript)
    sentence = generate_sentence(chain)
    print(f"{comedian.capitalize()}'s sentence:", sentence)

Adel_karam's sentence: Sell you from?&
Amy_schumer's sentence: Doesn’t matter what did it”.$
Beth_stelling's sentence: Relationships.>
Big_jay_oakerson's sentence: Take me come over to take dick touches the holidays and I ruined it!” Then,
Chelsea_handler's sentence: Har… I stopped watching the little bit less fuckin’ wrong.~
Chris_rock's sentence: Player to sneak back in horrible shape.~
Dave_chappelle's sentence: Carolina.+
David_cross's sentence: Qualifying what to do.+
Dylan_moran's sentence: Pool of you,+
George_carlin's sentence: Loft.>
Iliza_shlesinger's sentence: In."
Kevin_hart's sentence: 40?~
Kevin_james's sentence: Theme is the thing.(
Louis_c_k's sentence: Want,&
Matt_rife's sentence: Check,.
Pete_davidson's sentence: So the restraining order,'
Ricky_gervais's sentence: Exuberance:}
Sarah_cooper's sentence: And it’s safe.<
Tom_segura's sentence: Good?.
Trevor_noah's sentence: History.” You were talking about.#
