# Question to be explored !

# *What makes Ali Wong´s comedy routine stand out ?*

I decide to get information for only four american comedians to compare.
- How it was decided to limit the scope? Stand up comedy specials from the past 5 years and at least a 7.5 rating with 2000+ votes on IMDB

Source: https://scrapsfromtheloft.com/stand-up-comedy-scripts/

# First Step - Get and Clean the Data

In [31]:
import requests
from bs4 import BeautifulSoup as bs
import pickle

def url_to_transcript(url):
    page = requests.get(url).text
    soup = bs(page,"lxml")
    text = [p.text for p in soup.find_all("p")] # using F12 to find where my transcript is
    print(url)
    return text

# select only 4 comedians
urls = ["https://scrapsfromtheloft.com/2018/05/15/ali-wong-hard-knock-wife-full-transcript/",
        "https://scrapsfromtheloft.com/2020/11/08/dave-chappelle-snl-monologue-2020-transcript/",
       "https://scrapsfromtheloft.com/2020/10/10/ronny-chieng-asian-comedian-destroys-america-transcript/",
       "https://scrapsfromtheloft.com/2020/05/10/russell-peters-deported-transcript/"]

for url in urls:
    url_to_transcript(url)

https://scrapsfromtheloft.com/2018/05/15/ali-wong-hard-knock-wife-full-transcript/
https://scrapsfromtheloft.com/2020/11/08/dave-chappelle-snl-monologue-2020-transcript/
https://scrapsfromtheloft.com/2020/10/10/ronny-chieng-asian-comedian-destroys-america-transcript/
https://scrapsfromtheloft.com/2020/05/10/russell-peters-deported-transcript/


In [32]:
# which comedian I want
comedians = ["Ali","Dave","Ronny","Russel"]

In [33]:
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/2018/05/15/ali-wong-hard-knock-wife-full-transcript/
https://scrapsfromtheloft.com/2020/11/08/dave-chappelle-snl-monologue-2020-transcript/
https://scrapsfromtheloft.com/2020/10/10/ronny-chieng-asian-comedian-destroys-america-transcript/
https://scrapsfromtheloft.com/2020/05/10/russell-peters-deported-transcript/


In [34]:
# make new directory to hold the text files
for i, c in enumerate(comedians):
    with open("transcripts\\" + c + ".txt","wb") as file:
        pickle.dump(transcripts[i],file)

In [35]:
data = {}
for i, c in enumerate(comedians):
    with open("transcripts\\" + c + ".txt","rb") as file:
        data[c] = pickle.load(file)

In [36]:
# just check some data
# the key is the comedian and the values are the transcripts
data.keys()
data['Ali'][:2]

['Ladies and gentlemen, please welcome to the stage Ali Wong!',
 '♪ What y’all thought Y’all wasn’t gon’ see me? ♪\n♪ I’m the Osirus of this shit♪\n♪ Wu-Tang is here forever, motherfuckers♪\n♪ It’s like this ninety-seven ♪\n♪ Aight my n i g g a s and my n i g g arettes♪\n♪ Let’s do it like this♪\n♪ I’ma rub your ass in the moonshine♪\n♪ Let’s take it back to seventy-nine♪\n♪ I bomb atomically♪\n♪ Socrates’ philosophies and hypotheses♪\n♪ Can’t define How I be droppin’ these mockeries♪\n♪ Lyrically perform armed robbery ♪\n♪ Flee with the lottery Possibly they spotted me♪\n♪ Battle-scarred shogun♪\n♪ Explosion when my pen hits ♪']

In [37]:
# join all text for each key
def combine_text(list_of_text):
    combined_text = ' '.join(list_of_text)
    return combined_text

In [38]:
data_combined = {key:[combine_text(value)] for key,value in data.items()}

In [39]:
# we can put the dict into a dataframe
import pandas as pd
pd.set_option('max_colwidth', 150)
df = pd.DataFrame(data_combined).transpose()
# rename the column to transcript
df.columns = ['transcript'] # corpus
# sorted by name
df = df.sort_index()
df

Unnamed: 0,transcript
Ali,"Ladies and gentlemen, please welcome to the stage Ali Wong! ♪ What y’all thought Y’all wasn’t gon’ see me? ♪\n♪ I’m the Osirus of this shit♪\n♪ Wu..."
Dave,"Original air date: November 07, 2020 [Announcer] Ladies and gentlemen — Dave Chappelle! [Cheers and applause] Thank you. Thank you. [Cheers an..."
Ronny,[“The Evening Primrose (Ye Lai Xiang)” by Li Xianglan plays] [woman singing in Chinese] [audience applauds and cheers] [female host] Ladies and ge...
Russel,"[TYPING] [CHEERING] NARRATOR: Ladies and gentlemen, it’s start time at the Dome NSCI SVP Stadium. And right about now, we’re going to bring you th..."


## Now we need to clean some text
- make lower case
- remove punctuation
- remove numeric values
- remove /n
- tokenize text
- remove stop words ...

In [40]:
# use cleaning techniques - using some rounds to do
import re
import string
####### Round 1
# create a function to make lowercase, remove [], punctiation ...
def clean_text_round1(text):
    text = text.lower()
    text = re.sub('\[.*?\]','',text) # delete all text inside [ ](sounds)
    text = re.sub('[%s]' % re.escape(string.punctuation),'',text) # delete all punctuation
    text = re.sub('\w*\d\w*','',text) # delete all digits and digits with letters
    return text
round1 = lambda x: clean_text_round1(x)

In [41]:
df_data_clean = pd.DataFrame(df.transcript.apply(round1))
df_data_clean

Unnamed: 0,transcript
Ali,ladies and gentlemen please welcome to the stage ali wong ♪ what y’all thought y’all wasn’t gon’ see me ♪\n♪ i’m the osirus of this shit♪\n♪ wutan...
Dave,original air date november ladies and gentlemen — dave chappelle thank you thank you thank you all for being here pretty incredible day ...
Ronny,ladies and gentlemen ronny chieng thank you thank you thank you thank you okay thank you we gotta get going we gotta get going guys thank y...
Russel,narrator ladies and gentlemen it’s start time at the dome nsci svp stadium and right about now we’re going to bring you the brother that gave yo...


In [42]:
####### Round 2
# get rid of some punctiation and non-sensical text
def clean_text_round2(text):
    text = re.sub('[''""...—]','',text) # delete some "",-,...
    text = re.sub('\n','',text) # blank line substitute by ''
    text = re.sub('♪','',text) # ♪ substitute by ''
    return text
round2 = lambda x: clean_text_round2(x)

In [43]:
df_data_clean = pd.DataFrame(df_data_clean.transcript.apply(round2))
df_data_clean

Unnamed: 0,transcript
Ali,ladies and gentlemen please welcome to the stage ali wong what y’all thought y’all wasn’t gon’ see me i’m the osirus of this shit wutang is here...
Dave,original air date november ladies and gentlemen dave chappelle thank you thank you thank you all for being here pretty incredible day ...
Ronny,ladies and gentlemen ronny chieng thank you thank you thank you thank you okay thank you we gotta get going we gotta get going guys thank y...
Russel,narrator ladies and gentlemen it’s start time at the dome nsci svp stadium and right about now we’re going to bring you the brother that gave yo...


In [44]:
# insert new column  with the fullname
full_name = ['Ali Wong','Dave Johns','Ronny Belford', 'Russel Raise']
df_data_clean['fullname'] = full_name
df_data_clean

Unnamed: 0,transcript,fullname
Ali,ladies and gentlemen please welcome to the stage ali wong what y’all thought y’all wasn’t gon’ see me i’m the osirus of this shit wutang is here...,Ali Wong
Dave,original air date november ladies and gentlemen dave chappelle thank you thank you thank you all for being here pretty incredible day ...,Dave Johns
Ronny,ladies and gentlemen ronny chieng thank you thank you thank you thank you okay thank you we gotta get going we gotta get going guys thank y...,Ronny Belford
Russel,narrator ladies and gentlemen it’s start time at the dome nsci svp stadium and right about now we’re going to bring you the brother that gave yo...,Russel Raise


In [45]:
# Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, 
# and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) 
# is converted back into an object hierarchy.
df.to_pickle("files\\corpus.pkl") # all transcript original
df

Unnamed: 0,transcript
Ali,"Ladies and gentlemen, please welcome to the stage Ali Wong! ♪ What y’all thought Y’all wasn’t gon’ see me? ♪\n♪ I’m the Osirus of this shit♪\n♪ Wu..."
Dave,"Original air date: November 07, 2020 [Announcer] Ladies and gentlemen — Dave Chappelle! [Cheers and applause] Thank you. Thank you. [Cheers an..."
Ronny,[“The Evening Primrose (Ye Lai Xiang)” by Li Xianglan plays] [woman singing in Chinese] [audience applauds and cheers] [female host] Ladies and ge...
Russel,"[TYPING] [CHEERING] NARRATOR: Ladies and gentlemen, it’s start time at the Dome NSCI SVP Stadium. And right about now, we’re going to bring you th..."


In [46]:
# now we are going to create a document matrix using CountVectorizer, and exclude stopwords
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english') # stopwords using language English
data_cv = cv.fit_transform(df_data_clean.transcript)  # apply into column transcript
data_dtm = pd.DataFrame(data_cv.toarray(),columns=cv.get_feature_names()) # create another df with the words and frequency
data_dtm.index = df_data_clean.index # use the index from df 
data_dtm

Unnamed: 0,able,absolutely,absorb,abuela,abundance,abuse,accent,accept,acceptable,access,...,york,young,younger,zero,zhong,zipped,zodiac,zone,zones,zoom
Ali,0,0,1,1,0,0,0,1,1,0,...,0,3,1,2,0,0,0,1,0,0
Dave,0,0,0,0,0,0,2,0,0,0,...,2,2,0,0,0,0,0,0,0,2
Ronny,0,0,0,0,3,0,0,1,0,0,...,22,0,0,1,1,1,0,1,1,0
Russel,4,1,0,0,0,1,0,1,0,1,...,0,1,2,0,0,0,1,0,0,0


In [47]:
# Lets pickle it for later use
data_dtm.to_pickle("files\\dtm.pkl")

In [48]:
# lets pickle the cleaned data as well
df_data_clean.to_pickle("files\\data_clean.pkl")
pickle.dump(cv,open("files\\cv.pkl","wb"))

# Now the second step ...