# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('03D_corpus.pkl')
data

Unnamed: 0,transcript,full_name
Death and other,Death and Other Details\nSeason 1 Episode 6\nE...,Taylor Tomlinson
Jacqueline,"In “Get on Your Knees,” Jacqueline Novak trans...",Kevin Bridges
Kevin,Kevin Bridges: The Overdue Catch-Up (2023) is ...,Jacqueline Novak
Taylor,"In her 2024 Netflix stand-up comedy special, “...",Death and other
The iron claw,The Iron Claw (2023)\nDirected by: Sean Durkin...,The iron Claw


In [3]:
# Extract only taylor's text
Taylor_text = data.transcript.loc['Taylor']
Taylor_text[:200]

'In her 2024 Netflix stand-up comedy special, “Have It All,” Taylor Tomlinson offers a sharp, witty exploration of modern life, relationships, and personal growth, all through her humorous and insightf'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [4]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [6]:
# Create the dictionary for Ali's routine, take a look at it
Taylor_dict = markov_chain(Taylor_text)
Taylor_dict

{'In': ['her', 'therapy.'],
 'her': ['2024',
  'humorous',
  'journey',
  'own',
  '♪',
  '♪',
  '♪',
  '♪',
  '♪',
  '♪',
  'every',
  'for',
  'ex',
  'clients.',
  'partner',
  'attic',
  'on',
  'about',
  'engagement',
  'the'],
 '2024': ['Netflix', 'Scraps'],
 'Netflix': ['stand-up', 'special', 'and', 'special', 'money', 'special.”'],
 'stand-up': ['comedy', 'show', 'that'],
 'comedy': ['special,'],
 'special,': ['“Have'],
 '“Have': ['It', 'you', 'you', 'time'],
 'It': ['All,”',
  'was',
  'changed',
  'was',
  'was',
  'was…',
  'gets',
  'is',
  'has',
  'is',
  'confuses',
  'was',
  'totally',
  'doesn’t',
  'doesn’t',
  'is',
  'feels',
  'is',
  'doesn’t',
  'was',
  'has',
  'was',
  'was'],
 'All,”': ['Taylor'],
 'Taylor': ['Tomlinson', 'Tomlinson!'],
 'Tomlinson': ['offers', 'shares', 'crafts'],
 'offers': ['a'],
 'a': ['sharp,',
  'world',
  'robbery',
  'dating',
  'narrative',
  'subtle',
  'fucking',
  'week',
  'dating',
  'breakup,',
  'dating',
  'kid',
  'shallow

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [7]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [8]:
generate_sentence(Taylor_dict)

'Half ago, but I was being called a special tonight. Can’t finger these people in.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

# KEVIN

In [9]:
k_text = data.transcript.loc['Kevin']
k_text[:200]

'Kevin Bridges: The Overdue Catch-Up (2023) is a comedy special that encapsulates Kevin Bridges’ return to the stage, set against the backdrop of the Cork Opera House. He opens with gratitude for the a'

In [10]:
k_dict = markov_chain(k_text)
k_dict

{'Kevin': ['Bridges:', 'Bridges’', 'Bridges!', 'for'],
 'Bridges:': ['The'],
 'The': ['Overdue',
  'guy’s',
  'proper',
  'whole',
  'real',
  'Spanish',
  'black',
  'transfer',
  'Celtic!',
  'north',
  'country.',
  'country.',
  'Orange',
  'family…',
  'family',
  'football',
  'game’s',
  'gaffer',
  'body',
  'fuck?',
  'gym!',
  'gym!',
  'guy',
  'youth,',
  'youth,',
  'gym',
  'guy’s',
  'doctor',
  'over-the-phone',
  'cream',
  'afternoon’s',
  'Rise',
  'Simpsons',
  'baby,',
  'Simpsons',
  'memories',
  'romantic',
  'guy',
  'Queen’s',
  'doors',
  'show',
  'youth',
  'way',
  'driver',
  'game',
  'game,',
  'only',
  'street',
  'warmth',
  'reporter’s',
  'youth,',
  'kid',
  'craic'],
 'Overdue': ['Catch-Up'],
 'Catch-Up': ['(2023)'],
 '(2023)': ['is'],
 'is': ['a',
  'an',
  'important',
  'my',
  'really…',
  'sending',
  'my',
  'back',
  'released,',
  'scary.',
  'being',
  'beating',
  'insanity.',
  'a',
  'a',
  'when',
  'a',
  'a',
  'the',
  'why',
  'i

In [11]:
generate_sentence(k_dict)

'Soothing bedtime story. There’s always an evening of men, cos you’re lighting a chuckle. We’ve.'

# Jacqueline

In [12]:
j_text = data.transcript.loc['Jacqueline']
j_text[:200]

'In “Get on Your Knees,” Jacqueline Novak transcends the typical stand-up comedy show by delivering a unique blend of personal anecdotes and intellectual exploration into the act of oral sex, transform'

In [13]:
j_dict = markov_chain(j_text)
j_dict

{'In': ['“Get', 'the', 'order', 'a', 'theory,', 'order'],
 '“Get': ['on', 'on'],
 'on': ['Your',
  'the',
  'gender',
  'Your',
  'the',
  'my',
  'that',
  'keeping',
  'through',
  'it.',
  'some',
  'stage',
  'who',
  'such',
  'the',
  'Monday',
  'the',
  'one',
  'it.',
  'my',
  'this',
  'the',
  'me.',
  'you',
  'the',
  'the',
  'the',
  'the',
  'it',
  'the',
  'this',
  'the',
  'the',
  'what',
  'this',
  'that',
  'the',
  'to',
  'the',
  'terms',
  'your',
  'the',
  'track',
  'track.',
  'in.',
  'and',
  'it,',
  'this',
  'this',
  'the',
  'the',
  'it.',
  'it.',
  'that',
  'a',
  'this',
  'this',
  'upstairs.',
  'the',
  'that',
  'it,',
  'this',
  'this',
  'the',
  'the',
  'the',
  'a',
  'this',
  'me.',
  'this',
  'Earth.',
  'my',
  'behalf',
  'this.”',
  'her.',
  'her,',
  'her,',
  'itself?',
  'behalf',
  'the',
  'my',
  'the',
  'your',
  'unfamiliar',
  'the',
  'that',
  'that',
  'few.',
  'the',
  'this',
  'that',
  'coming.',
  'his',


In [14]:
generate_sentence(j_dict)

'Prudent self-doubt in your way that now you’re in the man who, looking out, before.'

# the iron claw

In [15]:
T_text = data.transcript.loc['The iron claw']
T_text[:200]

'The Iron Claw (2023)\nDirected by: Sean Durkin\nWritten by: Sean Durkin\nStarring: Zac Efron, Jeremy Allen White, Harris Dickinson, Maura Tierney, Holt McCallany, Lily James\nDistributed by: A24 (United S'

In [16]:
T_dict = markov_chain(T_text)
T_dict

{'The': ['Iron',
  'story',
  'Iron',
  'only',
  'only',
  'Von',
  'brothers',
  'pain',
  'promoters',
  'World’s',
  '3,500',
  'belt',
  'champ',
  'Ref',
  'boos',
  'fans',
  'answer',
  'world',
  'hands',
  'hands',
  'only',
  'reigning',
  'Fabulous',
  'Von',
  'winners',
  'Von',
  'kind',
  'Lord',
  'champion',
  'dream',
  'Winner',
  'toughest,',
  'bottom',
  'numbers'],
 'Iron': ['Claw', 'Claw.', 'Claw.', 'Claw’', 'Claw,', 'Claw'],
 'Claw': ['(2023)\nDirected', 'to'],
 '(2023)\nDirected': ['by:'],
 'by:': ['Sean', 'Sean', 'A24'],
 'Sean': ['Durkin\nWritten', 'Durkin\nStarring:'],
 'Durkin\nWritten': ['by:'],
 'Durkin\nStarring:': ['Zac'],
 'Zac': ['Efron,'],
 'Efron,': ['Jeremy'],
 'Jeremy': ['Allen'],
 'Allen': ['White,'],
 'White,': ['Harris'],
 'Harris': ['Dickinson,'],
 'Dickinson,': ['Maura'],
 'Maura': ['Tierney,'],
 'Tierney,': ['Holt'],
 'Holt': ['McCallany,'],
 'McCallany,': ['Lily'],
 'Lily': ['James\nDistributed'],
 'James\nDistributed': ['by:'],
 'A24': [

In [17]:
generate_sentence(T_dict)

'Support. REPORTER 1: So hold from me. FRITZ: Hey, Mike, shut up! Get him down..'

# question 2

In [19]:
import random
import string

def generate_sentence1(chain, min_words=10, max_words=20):
    '''Generate a sentence using the Markov chain dictionary.'''
    
    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Keep generating words until we reach the minimum number of words
    while len(sentence.split()) < min_words:
        word2 = random.choice(chain.get(word1, ['']))
        word1 = word2
        if not word2:
            break
        sentence += ' ' + word2

    # Check if the last word ends with punctuation
    if sentence[-1] in string.punctuation:
        return sentence.strip()

    # Keep generating words until we reach the maximum number of words or encounter punctuation
    while len(sentence.split()) < max_words:
        word2 = random.choice(chain.get(word1, ['']))
        if not word2 or word2[-1] in string.punctuation:
            break
        word1 = word2
        sentence += ' ' + word2

    # End the sentence with a random punctuation mark
    sentence += random.choice(string.punctuation)
    return sentence.strip()


In [20]:
k_text = data.transcript.loc['Kevin']
k_text[:200]

'Kevin Bridges: The Overdue Catch-Up (2023) is a comedy special that encapsulates Kevin Bridges’ return to the stage, set against the backdrop of the Cork Opera House. He opens with gratitude for the a'

In [21]:
j_dict = markov_chain(j_text)
j_dict

{'In': ['“Get', 'the', 'order', 'a', 'theory,', 'order'],
 '“Get': ['on', 'on'],
 'on': ['Your',
  'the',
  'gender',
  'Your',
  'the',
  'my',
  'that',
  'keeping',
  'through',
  'it.',
  'some',
  'stage',
  'who',
  'such',
  'the',
  'Monday',
  'the',
  'one',
  'it.',
  'my',
  'this',
  'the',
  'me.',
  'you',
  'the',
  'the',
  'the',
  'the',
  'it',
  'the',
  'this',
  'the',
  'the',
  'what',
  'this',
  'that',
  'the',
  'to',
  'the',
  'terms',
  'your',
  'the',
  'track',
  'track.',
  'in.',
  'and',
  'it,',
  'this',
  'this',
  'the',
  'the',
  'it.',
  'it.',
  'that',
  'a',
  'this',
  'this',
  'upstairs.',
  'the',
  'that',
  'it,',
  'this',
  'this',
  'the',
  'the',
  'the',
  'a',
  'this',
  'me.',
  'this',
  'Earth.',
  'my',
  'behalf',
  'this.”',
  'her.',
  'her,',
  'her,',
  'itself?',
  'behalf',
  'the',
  'my',
  'the',
  'your',
  'unfamiliar',
  'the',
  'that',
  'that',
  'few.',
  'the',
  'this',
  'that',
  'coming.',
  'his',


In [23]:
generate_sentence1(j_dict)

"“choke on the stage and it’s done doing it. And before the way you wanna be'"