<a href="https://colab.research.google.com/github/IqmanS/NLP-Assignments/blob/main/NLP_in_Python_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [None]:
!wget https://github.com/IqmanS/NLP-Assignments/raw/main/data/dtm.pkl -q
!wget https://github.com/IqmanS/NLP-Assignments/raw/main/data/cv.pkl -q
!wget https://github.com/IqmanS/NLP-Assignments/raw/main/data/data_clean.pkl -q
!wget https://github.com/IqmanS/NLP-Assignments/raw/main/data/corpus.pkl -q

In [None]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript,full_name
ali,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Ali Wong
amy,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Amy Schumer
anthony,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAntho...,Anthony Jeselnik
beth,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Beth Stelling
bill,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBILL ...,Bill Burr
burr,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Bill Burr
dave,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\...,Dave Chappelle
dylan,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDylan...,Dylan Moran
hasan,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHasan...,Hasan Minhaj
jim,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJIM J...,Jim Jefferies


In [None]:
import re

trans = []
for i in data.index:
    trans.append(re.sub("\n","",data.loc[i]["transcript"]))
data["transcript"] = trans

In [None]:
data

Unnamed: 0,transcript,full_name
ali,Ali Wong: Baby Cobra (2016) | Transcript - Sc...,Ali Wong
amy,Amy Schumer: Emergency Contact (2023) | Trans...,Amy Schumer
anthony,Anthony Jeselnik: Thoughts And Prayers (2015)...,Anthony Jeselnik
beth,Beth Stelling: Girl Daddy (2020) | Transcript...,Beth Stelling
bill,BILL BURR: I'M SORRY YOU FEEL THAT WAY (2014)...,Bill Burr
burr,Bill Burr: Paper Tiger (2019) - Transcript - ...,Bill Burr
dave,Dave Chappelle: The Age of Spin (2017) - Tran...,Dave Chappelle
dylan,Dylan Moran: Off The Hook (2015) - Transcript...,Dylan Moran
hasan,Hasan Minhaj at 2017 White House Corresponden...,Hasan Minhaj
jim,JIM JEFFERIES: BARE (2014) - Full Transcript ...,Jim Jefferies


In [None]:
# Extract only Ali Wong's text
ali_text = data.transcript.loc['ali']
ali_text[:200]

' Ali Wong: Baby Cobra (2016) | Transcript - Scraps from the loft   \r\t\tSkip to content MOVIESMOVIE REVIEWSMOVIE TRANSCRIPTSSTANLEY KUBRICKTV SERIESTV SHOW TRANSCRIPTSCOMEDYSTAND-UP COMEDY TRANSCRIPTSGE'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [None]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [None]:
# Create the dictionary for Ali's routine, take a look at it
ali_dict = markov_chain(ali_text)
ali_dict

{'': ['Ali',
  '',
  '\r\t\tSkip',
  'MenuMOVIESMOVIE',
  'MenuMOVIESMOVIE',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'Leave',
  'Instagram',
  'Access',
  'yes,'],
 'Ali': ['Wong:', 'Wong:', 'Wong!Hi.', 'Wong.'],
 'Wong:': ['Baby', 'Baby'],
 'Baby': ['Cobra', 'Cobra'],
 'Cobra': ['(2016)', '(2016)'],
 '(2016)': ['|', '|'],
 '|': ['Transcript',
  'Transcript',
  'Transcript\t\t\tKelsey',
  'Transcript\t\t\tDylan',
  'Transcript\t\t\tDusty',
  'Transcript\t\t\tJack'],
 'Transcript': ['-', '\t\t\t\t\t\t\t\t\t\tSeptember'],
 '-': ['Scraps'],
 'Scraps': ['from', 'from'],
 'from': ['the',
  'a',
  'the',
  'the',
  'the',
  'the',
  'Pier',
  'Harvard',
  'the',
  'the',
  'having',
  'having',
  'the',
  'this',
  'pedestrians',
  'their',
  'a',
  'holding',
  'bacteria',
  'Harvard',
  'his',
  'the'],
 'the': ['loft',
  'rocky',
  'stage:',
  'light',
  'chatter',
  'kind',
  'kind',
  'presence',
  'center',
  'hoarding',
  'HPV',
  'country',
  'third',
  'Communists.Th

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [None]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [None]:
generate_sentence(ali_dict)

'C.e.o.s, they don’t know that seems to take its toll on J-date, more options now.”.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [None]:
# Comedian 1: Lewis
lewis_text = data.transcript.loc['lewis']
lewis_text[:500]
lewis_dict = markov_chain(lewis_text)

print("LEWIS:",generate_sentence(lewis_dict))

# Comedian 2: Hasan
hasan_text = data.transcript.loc['hasan']
hasan_text[:500]
hasan_dict = markov_chain(hasan_text)

print("HASAN:",generate_sentence(hasan_dict))

# Comedian 3: Kathleen
kathleen_text = data.transcript.loc['kathleen']
kathleen_text[:500]
kathleen_dict = markov_chain(kathleen_text)

print("KATHLEEN:",generate_sentence(kathleen_dict))

LEWIS: Companionship. And goddammit I mean, for risking your desk.” [audience laughs] All of books. I’ve.
HASAN: Anything is a bunch of him, and finding out of Cards just so stressful, I’ve.
KATHLEEN: Neighborhood. And the… but she still definitely there. I would turn the only are afraid,.


In [None]:
def generate_sentence1(chain):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Generate a Random sentence length
    count = random.randint(15, 25)

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a random punctuation
    punc = random.choice(list("!?."))
    sentence += punc
    return(sentence)

def generate_sentence2(chain):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Generate a Random sentence length
    count = random.randint(15, 25)

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2
        if i>5 and sentence[-1] in "!?.":
            return(sentence)

    # End it with a random punctuation
    punc = random.choice(list("!?."))
    sentence += punc
    return(sentence)

In [None]:
# Comedian 1: Lewis
print("LEWIS:",generate_sentence1(lewis_dict))
print("LEWIS:",generate_sentence2(lewis_dict))


# Comedian 2: Hasan
print("HASAN:",generate_sentence1(hasan_dict))
print("HASAN:",generate_sentence2(hasan_dict))

# Comedian 3: Kathleen
print("KATHLEEN:",generate_sentence1(kathleen_dict))
print("KATHLEEN:",generate_sentence2(kathleen_dict))


LEWIS: Fuck, then finally a bitch, they’re nuts, OK? I sat down there was spinning at the first?
LEWIS: Launching their pills, and I could have come out and we tripled down the whole thing they overpaid.
HASAN: Thank Jeff Mason and tell Melania? “Listen, babe, last year in this moment — I want to be a joke. Now,.
HASAN: Someone is president, it feels like …I have to be alternative fact.
KATHLEEN: 7:40 AM. I knew that has to have bought a different unemployed than if you guys. I will!
KATHLEEN: Bang, bang. No wine. That’s it. I got 37 bucks.
