# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript
david,"In this 2023 comedy special, David Nihill humo..."
gabriel,[man] Can you please state your name? Martin M...
george,George Carlin: I’m Glad I’m Dead (2024) is a c...
jon,"In an interview conducted by Jon Stewart, Geor..."
kate,"Whoa! Okay, yeah. Good. Okay, don’t embarrass ..."
kevin,Kevin James: Irregardless (2024) In Kevin Jame...
leanne,Leanne Morgan: I’m Every Woman (2023) In “I’m ...
lewis,"Lewis Black: Tragically, I Need You (2023) is ..."
louis,“Louis C.K.: At The Dolby” is Louis C.K.’s thi...
matt,"In his second hour-long comedy special, “Matth..."


In [3]:
# Extract only Ali Wong's text
ali_text = data.transcript.loc['jon']
ali_text[:200]

'In an interview conducted by Jon Stewart, George Carlin talks about various aspects of his life and career, reflecting upon his childhood, formative influences, and his perspectives on comedy and crea'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [4]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [5]:
# Create the dictionary for Ali's routine, take a look at it
ali_dict = markov_chain(ali_text)
ali_dict

{'In': ['an', 'Las'],
 'an': ['interview',
  'innate',
  'advertising',
  'experimental',
  'environment',
  'uneducated',
  'obsessive',
  'advertising',
  'experimental',
  'actress,',
  'edge,',
  'entertainer',
  'artist',
  'obligation',
  'afterthought.',
  'also'],
 'interview': ['conducted', 'also', 'with'],
 'conducted': ['by'],
 'by': ['Jon', 'his', 'a', 'his', 'hand', 'the', 'giving', 'giving', 'the'],
 'Jon': ['Stewart,'],
 'Stewart,': ['George'],
 'George': ['Carlin', 'Carlin’s', 'on', 'Carlin', 'Carlin!'],
 'Carlin': ['talks', 'felt', 'highlights', 'acknowledges', 'should'],
 'talks': ['about'],
 'about': ['various', 'what', 'language,', 'business.'],
 'various': ['aspects'],
 'aspects': ['of'],
 'of': ['his',
  'performance',
  'eliciting',
  'his',
  'his',
  'not',
  'his',
  'pursuits',
  'drugs',
  'which',
  'the',
  'his',
  'adults',
  'respect,',
  'listening',
  'things,',
  'rhetoric.',
  'the',
  'his',
  'Shakespeare',
  'the',
  '1935,',
  'the',
  'controll

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [6]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [7]:
generate_sentence(ali_dict)

'See the finest thing that the audiences are concerned, and I never knew I can.'

## Additional Exercises

1. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [8]:
import random

def generate_sentence(chain, count=15):
  """
  Input a dictionary in the format of key = current word, value = list of next words
  along with the number of words you would like to see in your generated sentence.

  This improved version allows for ending with a random punctuation mark or stopping
  at a word with existing punctuation.
  """

  # Capitalize the first word
  word1 = random.choice(list(chain.keys()))
  sentence = word1.capitalize()

  # Generate subsequent words
  for i in range(count - 1):
    word2 = random.choice(chain[word1])

    # Check if word2 already ends with punctuation
    if word2[-1] in ".?!":
      sentence += ' ' + word2
      break  # Stop generating if punctuation is found
    else:
      word1 = word2
      sentence += ' ' + word2

  # Choose a random punctuation mark with some probability (adjust weights as needed)
  if random.random() < 0.7:  # 70% chance of adding punctuation
    punctuation = random.choice([".", "!", "?"])
    sentence += punctuation

  return sentence



Improvements:

Ending with punctuation:
We check if the chosen word word2 already ends with a punctuation mark (".?!")
If it does, the sentence is appended with the word and a break statement exits the loop, ensuring the sentence ends there.
Random punctuation:
We introduce a random chance (adjustable by modifying the value < 0.7) to append a random punctuation mark (".", "!", "?") after the sentence is generated.

In [9]:
generate_sentence(ali_dict, count=15)

'Talking. And another luck stroke, you know, there that you very wonderful thing.!'