Problem Statement - Characterizing various US stand-up comedians on their stand-up content.

## 1. Get the data

In [51]:
import requests
from bs4 import BeautifulSoup
import pickle

In [52]:
#Scraping transcripts data from scrapsfromtheloft.com

def url_to_transcript(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'lxml')
    text = [p.text for p in soup.find(class_ = "post-content").find_all('p')]
    print(url)
    return text

In [53]:
# URLs of transcripts in scope

urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
        'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/']

# Comedian names
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']

In [54]:
# Request the transcripts 
transcripts = [url_to_transcript(u) for u in urls]

http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/
http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/
http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/
http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/
http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/
http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/
http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/
http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/
http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/
http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/
http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/
http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-

In [55]:
# Pickle the files for using later

!mkdir transcripts

A subdirectory or file transcripts already exists.


In [56]:
for i,c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

In [57]:
# Loading the pickled file
data = {}
for i,c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [58]:
# Checking the dictionary to see if data is loaded properly

data.keys()

dict_keys(['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe'])

## 2. Cleaning the data

This step involves
1. Making all the text lowercase
2. Removing punctuation
3. Removing numerical values
4. Tokenizing text
5. Removing stop words

In [59]:
# Converting the text into string for easier and simpler analysis

def combine_text(text):
    combined_text = ' '.join(text)
    return combined_text


#Combining all the chunks of text for each of the keys (comedians)

data_combined = {key : [combine_text(value)] for (key, value) in data.items()}

In [60]:
# Putting the data into a pandas dataframe

import pandas as pd

pd.set_option('max_colwidth', 150)

df = pd.DataFrame.from_dict(data_combined).transpose()
df.columns = ['transcripts']
df = df.sort_index()
df

Unnamed: 0,transcripts
ali,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ..."
anthony,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s..."
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
bo,Bo What? Old MacDonald had a farm E I E I O And on that farm he had a pig E I E I O Here a snort There a Old MacDonald had a farm E I E I O [Appla...
dave,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
hasan,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you wa..."
jim,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello..."
joe,"[rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the fuck is ..."
john,"All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see yo..."
louis,Intro\nFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily a...


In [61]:
# Data cleaning step 01

import re
import string

def clean_text_round1(text):
    text = text.lower() #Converting the text to lowercase
    text = re.sub('\[.*?\]','', text) #Getting rid of data in the bracket
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #Get rid of anything in punctuation marks
    text = re.sub('\w*\d\w*', '', text) #Get rid of alpha-numeric characters
    return text
round1 = lambda x : clean_text_round1(x)

In [62]:
# Updating the dataset's transcripts with cleaned text

data_clean = pd.DataFrame(df.transcripts.apply(round1))
data_clean

Unnamed: 0,transcripts
ali,ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get thi...
anthony,thank you thank you thank you san francisco thank you so much so good to be here people were surprised when i told ’em i was gonna tape my special...
bill,all right thank you thank you very much thank you thank you thank you how are you what’s going on thank you it’s a pleasure to be here in the gre...
bo,bo what old macdonald had a farm e i e i o and on that farm he had a pig e i e i o here a snort there a old macdonald had a farm e i e i o this i...
dave,this is dave he tells dirty jokes for a living that stare is where most of his hard work happens it signifies a profound train of thought the alch...
hasan,what’s up davis what’s up i’m home i had to bring it back here netflix said “where do you want to do the special la chicago new york” i was like...
jim,ladies and gentlemen please welcome to the stage mr jim jefferies hello sit down sit down sit down sit down sit down thank you boston i appre...
joe,ladies and gentlemen welcome joe rogan what the fuck is going on san francisco thanks for coming i appreciate it god damn put your phone down ...
john,all right petunia wish me luck out there you will die on august that’s pretty good all right hello hello chicago nice to see you again thank you...
louis,intro\nfade the music out let’s roll hold there lights do the lights thank you thank you very much i appreciate that i don’t necessarily agree wit...


In [63]:
# Data cleaning step 02

def clean_text_round2(text):
    text = re.sub('[''""...]', '', text)
    text = re.sub('\n','', text)
    return text

round2 = lambda x : clean_text_round2(x)

In [64]:
data_clean = pd.DataFrame(data_clean.transcripts.apply(clean_text_round2))
data_clean

Unnamed: 0,transcripts
ali,ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get thi...
anthony,thank you thank you thank you san francisco thank you so much so good to be here people were surprised when i told ’em i was gonna tape my special...
bill,all right thank you thank you very much thank you thank you thank you how are you what’s going on thank you it’s a pleasure to be here in the gre...
bo,bo what old macdonald had a farm e i e i o and on that farm he had a pig e i e i o here a snort there a old macdonald had a farm e i e i o this i...
dave,this is dave he tells dirty jokes for a living that stare is where most of his hard work happens it signifies a profound train of thought the alch...
hasan,what’s up davis what’s up i’m home i had to bring it back here netflix said “where do you want to do the special la chicago new york” i was like...
jim,ladies and gentlemen please welcome to the stage mr jim jefferies hello sit down sit down sit down sit down sit down thank you boston i appre...
joe,ladies and gentlemen welcome joe rogan what the fuck is going on san francisco thanks for coming i appreciate it god damn put your phone down ...
john,all right petunia wish me luck out there you will die on august that’s pretty good all right hello hello chicago nice to see you again thank you...
louis,introfade the music out let’s roll hold there lights do the lights thank you thank you very much i appreciate that i don’t necessarily agree with ...


In [65]:
# Data cleaning step 03
# Lemmatization of text 

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

def nltk2wn_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None
    
def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)
    res_words = []
    for word, tag in wn_tagged:
        if tag is None:            
            res_words.append(word)
        else:
            res_words.append(lemmatizer.lemmatize(word, tag))
    return " ".join(res_words)
    
round3 = lambda x : lemmatize_sentence(x)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\trishla.mishra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [66]:
# Cleaning data with lemmatized text

data_clean = pd.DataFrame(data_clean.transcripts.apply(lemmatize_sentence))
data_clean

Unnamed: 0,transcripts
ali,lady and gentleman please welcome to the stage ali wong hi hello welcome thank you thank you for come hello hello we be gon na have to get this sh...
anthony,thank you thank you thank you san francisco thank you so much so good to be here people be surprised when i tell ’ em i be gon na tape my special ...
bill,all right thank you thank you very much thank you thank you thank you how be you what ’ s go on thank you it ’ s a pleasure to be here in the grea...
bo,bo what old macdonald have a farm e i e i o and on that farm he have a pig e i e i o here a snort there a old macdonald have a farm e i e i o this...
dave,this be dave he tell dirty joke for a living that stare be where most of his hard work happen it signify a profound train of thought the alchemist...
hasan,what ’ s up davis what ’ s up i ’ m home i have to bring it back here netflix say “ where do you want to do the special la chicago new york ” i be...
jim,lady and gentleman please welcome to the stage mr jim jefferies hello sit down sit down sit down sit down sit down thank you boston i appreciate t...
joe,lady and gentleman welcome joe rogan what the fuck be go on san francisco thanks for come i appreciate it god damn put your phone down fuckface i ...
john,all right petunia wish me luck out there you will die on august that ’ s pretty good all right hello hello chicago nice to see you again thank you...
louis,introfade the music out let ’ s roll hold there light do the light thank you thank you very much i appreciate that i don ’ t necessarily agree wit...


In [67]:
# Data cleaning step 04 
# Removing stop words

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    word_tokens = word_tokenize(text)
    sentence = [w for w in word_tokens if not w in stop_words]
    sentence = []
    
    for w in word_tokens:
        if w not in stop_words:
            sentence.append(w)
    return " ".join(sentence)
            
round4 = lambda x : remove_stop_words(x)

In [68]:
data_clean = pd.DataFrame(data_clean.transcripts.apply(remove_stop_words))
data_clean

Unnamed: 0,transcripts
ali,lady gentleman please welcome stage ali wong hi hello welcome thank thank come hello hello gon na get shit ’ cause pee like ten minute thank every...
anthony,thank thank thank san francisco thank much good people surprised tell ’ em gon na tape special san francisco say “ would ’ politically correct cit...
bill,right thank thank much thank thank thank ’ go thank ’ pleasure great atlanta georgia area oasis ’ nice ’ know come june ’ nice ’ think fuck ridicu...
bo,bo old macdonald farm e e farm pig e e snort old macdonald farm e e bo burnham ’ year old ’ male look like genetic product giraffe sex ellen degen...
dave,dave tell dirty joke living stare hard work happen signify profound train thought alchemist ’ fire transforms fear tragedy levity livelihood dave ...
hasan,’ davis ’ ’ home bring back netflix say “ want special la chicago new york ” like “ nah son davis california ” um… good year recently get married ...
jim,lady gentleman please welcome stage mr jim jefferies hello sit sit sit sit sit thank boston appreciate uh ’ sweet love ’ end tour right ’ happy to...
joe,lady gentleman welcome joe rogan fuck go san francisco thanks come appreciate god damn put phone fuckface see bitch put phone motherfucker ’ use e...
john,right petunia wish luck die august ’ pretty good right hello hello chicago nice see thank nice thank look ’ wonderful crowd need keep energy entir...
louis,introfade music let ’ roll hold light light thank thank much appreciate ’ necessarily agree appreciate much well nice place easily nice place many...


In [69]:
# Data cleaning step 05
# Removing proper noun from the sentences

def remove_proper_noun(text):
    tagged_sentence = nltk.tag.pos_tag(text.split())
    sentence = [w for w, tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS']
    return ' '.join(sentence)

round5 = lambda x : remove_proper_noun(x)

In [70]:
data_clean = pd.DataFrame(data_clean.transcripts.apply(remove_proper_noun))
data_clean

Unnamed: 0,transcripts
ali,lady gentleman please welcome stage ali wong hi hello welcome thank thank come hello hello gon na get shit cause pee like ten minute thank everybo...
anthony,thank thank thank san francisco thank much good people surprised tell em gon na tape special san francisco say “ would ’ politically correct city ...
bill,right thank thank much thank thank thank go thank ’ pleasure great atlanta georgia area oasis nice ’ know come june nice think fuck ridiculously h...
bo,bo old macdonald farm e e farm pig e e snort old macdonald farm e e bo burnham ’ year old male look like genetic product giraffe sex ellen degener...
dave,dave tell dirty joke living stare hard work happen signify profound train thought alchemist fire transforms fear tragedy levity livelihood dave ca...
hasan,’ davis home bring back netflix say want special la chicago new york ” like nah son davis california um… good year recently get married guy thank ...
jim,lady gentleman please welcome stage mr jim jefferies hello sit sit sit sit sit thank boston appreciate uh sweet love end tour right happy tour chi...
joe,lady gentleman welcome joe rogan fuck go san francisco thanks come appreciate god damn put phone fuckface see bitch put phone motherfucker use eye...
john,right petunia wish luck die august pretty good right hello hello chicago nice see thank nice thank look wonderful crowd need keep energy entire sh...
louis,introfade music let roll hold light light thank thank much appreciate necessarily agree appreciate much well nice place easily nice place many mil...


## 3. Organize the data

1. Corpus
2. Document-term matrix

In [71]:
# Corpus is already created earlier

#Adding the full names of the comedians for visualization purposes
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',
              'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']

df['full_name'] = full_names
df

Unnamed: 0,transcripts,full_name
ali,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ...",Ali Wong
anthony,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s...",Anthony Jeselnik
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ...",Bill Burr
bo,Bo What? Old MacDonald had a farm E I E I O And on that farm he had a pig E I E I O Here a snort There a Old MacDonald had a farm E I E I O [Appla...,Bo Burnham
dave,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ...",Dave Chappelle
hasan,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you wa...",Hasan Minhaj
jim,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello...",Jim Jefferies
joe,"[rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the fuck is ...",Joe Rogan
john,"All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see yo...",John Mulaney
louis,Intro\nFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily a...,Louis C.K.


In [72]:
# Pickling the dataset

df.to_pickle("corpus.pkl")

Creating the document-term matrix

In [73]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english')
data_cv = cv.fit_transform(data_clean.transcripts)
data_dtm = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aaaaah,aaaaahhhhhhh,aaaaauuugghhhhhh,aaaahhhhh,aaah,aah,abc,ability,abject,able,...,yummy,yyou,ze,zealand,zeppelin,zero,zillion,zombie,zone,zoo
ali,0,0,0,0,0,0,1,0,0,2,...,0,1,0,0,0,0,0,1,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,3,0,0,0,0,0,0
bill,1,0,0,0,0,0,1,0,0,1,...,1,0,1,0,0,1,1,2,1,0
bo,0,1,1,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
dave,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
jim,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
joe,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
john,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
louis,0,0,0,0,0,3,0,0,0,1,...,0,0,0,0,0,2,0,0,0,0


In [74]:
data_dtm.to_pickle('dtm.pkl')

In [75]:
# Pickling the cleaned data

data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", 'wb'))