<a href="https://colab.research.google.com/github/PranavSingh31/Sentiment-Analysis-on-Comedians/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

## Getting the Data

transcipts picked up from the website [Scraps From The Loft](http://scrapsfromtheloft.com). Sortlisted these comedy routines using IMDB.

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="ast-container").find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['https://scrapsfromtheloft.com/comedy/dave-chappelle-whats-in-a-name-transcript/',
        'https://scrapsfromtheloft.com/comedy/gabriel-iglesias-stadium-fluffy-transcript/',
        'https://scrapsfromtheloft.com/comedy/norm-macdonald-nothing-special-transcript/',
        'https://scrapsfromtheloft.com/comedy/vir-das-outside-in-the-lockdown-special-transcript/',
        'https://scrapsfromtheloft.com/comedy/stewart-lee-carpet-remnant-world-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/',
        'https://scrapsfromtheloft.com/comedy/michael-mcintyre-showman-transcript/',
        'https://scrapsfromtheloft.com/comedy/tom-papa-youre-doing-great-transcript/',
        'https://scrapsfromtheloft.com/comedy/taylor-tomlinson-quarter-life-crisis-transcript/',
        'https://scrapsfromtheloft.com/comedy/fortune-feimster-good-fortune-transcript/',
        'https://scrapsfromtheloft.com/comedy/ricky-gervais-supernature-transcript/',
        'https://scrapsfromtheloft.com/comedy/louis-c-k-sorry-transcript/']

# Comedian names
comedians = ['dave', 'fluffy', 'norm', 'vir', 'stewart', 'carlin', 'mcintyre', 'papa', 'taylor', 'fortune', 'ricky', 'louis']

In [2]:
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/dave-chappelle-whats-in-a-name-transcript/
https://scrapsfromtheloft.com/comedy/gabriel-iglesias-stadium-fluffy-transcript/
https://scrapsfromtheloft.com/comedy/norm-macdonald-nothing-special-transcript/
https://scrapsfromtheloft.com/comedy/vir-das-outside-in-the-lockdown-special-transcript/
https://scrapsfromtheloft.com/comedy/stewart-lee-carpet-remnant-world-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/
https://scrapsfromtheloft.com/comedy/michael-mcintyre-showman-transcript/
https://scrapsfromtheloft.com/comedy/tom-papa-youre-doing-great-transcript/
https://scrapsfromtheloft.com/comedy/taylor-tomlinson-quarter-life-crisis-transcript/
https://scrapsfromtheloft.com/comedy/fortune-feimster-good-fortune-transcript/
https://scrapsfromtheloft.com/comedy/ricky-gervais-supernature-transcript/
https://scrapsfromtheloft.com/comedy/louis-c-k-sorry-transcript/


In [3]:
# Pickle files for later use
# Making a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

In [4]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [7]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['dave', 'fluffy', 'norm', 'vir', 'stewart', 'carlin', 'mcintyre', 'papa', 'taylor', 'fortune', 'ricky', 'louis'])

In [None]:
# More checks
data['vir'][:2]

## Cleaning the Data

In [8]:
# Let's take a look at our data again
next(iter(data.keys()))

'dave'

In [None]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

In [10]:
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [11]:
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [12]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',190)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
carlin,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New Jersey So you want to talk about it? Oh yeah. It all started in 1977. I mean, that’s when I started doing it regularly...."
dave,"What’s in a Name? is a 40-minute talk Chappelle delivered at Duke Ellington School of the Arts in Washington, D.C., on June 20, 2022 * * * Art is dangerous.\nIt is one of the attractions..."
fluffy,"[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus years. Martinnnnn! And, yeah, he’s been scr..."
fortune,[upbeat music plays] [audience cheering] [announcer] Please welcome Fortune Feimster! ♪ I’m a powerful woman ♪ ♪ Always get what I want ♪ ♪ So don’t you get in my way now That’s not what...
louis,"Recorded at the Madison Square Garden on August 14, 2021 * * * ♪♪ [“Like a Rolling Stone” by Bob Dylan playing] ♪♪ ♪ Once upon a time you dressed so fine ♪\n♪ Threw the bums a dime in yo..."
mcintyre,"Released on September 15, 2020 [Netflix] Ladies and gentlemen, please welcome to the stage Michael McIntyre! Bravo! Good evening, ladies and gentlemen! Welcome… …to my Netflix special! L..."
norm,"Norm was working hard preparing material for his Netflix special – until COVID shut things down. In the summer of 2020, he was scheduled to undergo a procedure and as he put it, “didn’t ..."
papa,"[applause, whooping] [presenter] Ladies and gentlemen, Tom Papa. [mouths] [whistling and cheering] [mouths] Thank you. Thank you. Thank you. Look at you. Look at you. New Jersey. [cheeri..."
ricky,"[audience cheering and applauding] [announcer] Good evening, ladies and gentlemen. Please welcome to the stage a man who really doesn’t need to do this. [audience laughing] Ricky Gervais..."
stewart,"(’70s GERMAN ROCK MUSIC PLAYING) ANNOUNCER: Ladies and gentlemen, it’s time to enter the Carpet Remnant World of Stewart Lee! (AUDIENCE APPLAUDING) That was a bit heavy metal, rock and r..."


In [None]:
# Transcript for Vir Das
data_df.transcript.loc['vir']

In [17]:
# First round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [18]:
# Updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
carlin,recorded on january – state theatre new brunswick new jersey so you want to talk about it oh yeah it all started in i mean that’s when i started doing it regularly how many times have ...
dave,what’s in a name is a talk chappelle delivered at duke ellington school of the arts in washington dc on june art is dangerous\nit is one of the attractions when it ceases to be dan...
fluffy,can you please state your name martin moreno but you might know me as… martinnnnn i’ve been touring with gabriel iglesias for years martinnnnn and yeah he’s been screaming my name for ...
fortune,please welcome fortune feimster ♪ i’m a powerful woman ♪ ♪ always get what i want ♪ ♪ so don’t you get in my way now that’s not what i want ♪ ♪ ‘cause i’m a powerful woman ♪ ♪ always ...
louis,recorded at the madison square garden on august ♪♪ ♪♪ ♪ once upon a time you dressed so fine ♪\n♪ threw the bums a dime in your prime ♪\n♪ didn’t you ♪ ♪♪ ♪ people call say beware ...
mcintyre,released on september ladies and gentlemen please welcome to the stage michael mcintyre bravo good evening ladies and gentlemen welcome… …to my netflix special let’s do this thank you...
norm,norm was working hard preparing material for his netflix special – until covid shut things down in the summer of he was scheduled to undergo a procedure and as he put it “didn’t want to...
papa,ladies and gentlemen tom papa thank you thank you thank you look at you look at you new jersey yeah that’s why i’m here it’s the people it’s definitely not the weather it’s the peo...
ricky,good evening ladies and gentlemen please welcome to the stage a man who really doesn’t need to do this ricky gervais hello hello thank you shush thank you very much shush no shus...
stewart,’ german rock music playing announcer ladies and gentlemen it’s time to enter the carpet remnant world of stewart lee audience applauding that was a bit heavy metal rock and roll that ca...


In [19]:
# Second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…♪–]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [21]:
# Updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
carlin,recorded on january state theatre new brunswick new jersey so you want to talk about it oh yeah it all started in i mean thats when i started doing it regularly how many times have yo...
dave,whats in a name is a talk chappelle delivered at duke ellington school of the arts in washington dc on june art is dangerousit is one of the attractions when it ceases to be danger...
fluffy,can you please state your name martin moreno but you might know me as martinnnnn ive been touring with gabriel iglesias for years martinnnnn and yeah hes been screaming my name for ye...
fortune,please welcome fortune feimster im a powerful woman always get what i want so dont you get in my way now thats not what i want cause im a powerful woman always get what i wan...
louis,recorded at the madison square garden on august once upon a time you dressed so fine threw the bums a dime in your prime didnt you people call say beware doll youre bound ...
mcintyre,released on september ladies and gentlemen please welcome to the stage michael mcintyre bravo good evening ladies and gentlemen welcome to my netflix special lets do this thank you if...
norm,norm was working hard preparing material for his netflix special until covid shut things down in the summer of he was scheduled to undergo a procedure and as he put it didnt want to le...
papa,ladies and gentlemen tom papa thank you thank you thank you look at you look at you new jersey yeah thats why im here its the people its definitely not the weather its the people t...
ricky,good evening ladies and gentlemen please welcome to the stage a man who really doesnt need to do this ricky gervais hello hello thank you shush thank you very much shush no shush...
stewart,german rock music playing announcer ladies and gentlemen its time to enter the carpet remnant world of stewart lee audience applauding that was a bit heavy metal rock and roll that can ...


## Organising the Text

In [22]:
#Corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
#DataFrame
data_df

Unnamed: 0,transcript
carlin,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New Jersey So you want to talk about it? Oh yeah. It all started in 1977. I mean, that’s when I started doing it regularly...."
dave,"What’s in a Name? is a 40-minute talk Chappelle delivered at Duke Ellington School of the Arts in Washington, D.C., on June 20, 2022 * * * Art is dangerous.\nIt is one of the attractions..."
fluffy,"[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus years. Martinnnnn! And, yeah, he’s been scr..."
fortune,[upbeat music plays] [audience cheering] [announcer] Please welcome Fortune Feimster! ♪ I’m a powerful woman ♪ ♪ Always get what I want ♪ ♪ So don’t you get in my way now That’s not what...
louis,"Recorded at the Madison Square Garden on August 14, 2021 * * * ♪♪ [“Like a Rolling Stone” by Bob Dylan playing] ♪♪ ♪ Once upon a time you dressed so fine ♪\n♪ Threw the bums a dime in yo..."
mcintyre,"Released on September 15, 2020 [Netflix] Ladies and gentlemen, please welcome to the stage Michael McIntyre! Bravo! Good evening, ladies and gentlemen! Welcome… …to my Netflix special! L..."
norm,"Norm was working hard preparing material for his Netflix special – until COVID shut things down. In the summer of 2020, he was scheduled to undergo a procedure and as he put it, “didn’t ..."
papa,"[applause, whooping] [presenter] Ladies and gentlemen, Tom Papa. [mouths] [whistling and cheering] [mouths] Thank you. Thank you. Thank you. Look at you. Look at you. New Jersey. [cheeri..."
ricky,"[audience cheering and applauding] [announcer] Good evening, ladies and gentlemen. Please welcome to the stage a man who really doesn’t need to do this. [audience laughing] Ricky Gervais..."
stewart,"(’70s GERMAN ROCK MUSIC PLAYING) ANNOUNCER: Ladies and gentlemen, it’s time to enter the Carpet Remnant World of Stewart Lee! (AUDIENCE APPLAUDING) That was a bit heavy metal, rock and r..."


In [23]:
# Let's add the comedians' full names as well
full_names = ['George Carlin ', 'Dave Chappelle', 'Gabriel Iglesias', 'Fortune Feimster', 'Louis C.K.', 'Micheal Mcintyre',
              'Norm Macdonald', 'Tom Papa', 'Ricky Gervais', 'Stewart Lee', 'Taylor Tomlinson', 'Vir Das']

data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
carlin,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New Jersey So you want to talk about it? Oh yeah. It all started in 1977. I mean, that’s when I started doing it regularly....",George Carlin
dave,"What’s in a Name? is a 40-minute talk Chappelle delivered at Duke Ellington School of the Arts in Washington, D.C., on June 20, 2022 * * * Art is dangerous.\nIt is one of the attractions...",Dave Chappelle
fluffy,"[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus years. Martinnnnn! And, yeah, he’s been scr...",Gabriel Iglesias
fortune,[upbeat music plays] [audience cheering] [announcer] Please welcome Fortune Feimster! ♪ I’m a powerful woman ♪ ♪ Always get what I want ♪ ♪ So don’t you get in my way now That’s not what...,Fortune Feimster
louis,"Recorded at the Madison Square Garden on August 14, 2021 * * * ♪♪ [“Like a Rolling Stone” by Bob Dylan playing] ♪♪ ♪ Once upon a time you dressed so fine ♪\n♪ Threw the bums a dime in yo...",Louis C.K.
mcintyre,"Released on September 15, 2020 [Netflix] Ladies and gentlemen, please welcome to the stage Michael McIntyre! Bravo! Good evening, ladies and gentlemen! Welcome… …to my Netflix special! L...",Micheal Mcintyre
norm,"Norm was working hard preparing material for his Netflix special – until COVID shut things down. In the summer of 2020, he was scheduled to undergo a procedure and as he put it, “didn’t ...",Norm Macdonald
papa,"[applause, whooping] [presenter] Ladies and gentlemen, Tom Papa. [mouths] [whistling and cheering] [mouths] Thank you. Thank you. Thank you. Look at you. Look at you. New Jersey. [cheeri...",Tom Papa
ricky,"[audience cheering and applauding] [announcer] Good evening, ladies and gentlemen. Please welcome to the stage a man who really doesn’t need to do this. [audience laughing] Ricky Gervais...",Ricky Gervais
stewart,"(’70s GERMAN ROCK MUSIC PLAYING) ANNOUNCER: Ladies and gentlemen, it’s time to enter the Carpet Remnant World of Stewart Lee! (AUDIENCE APPLAUDING) That was a bit heavy metal, rock and r...",Stewart Lee


In [24]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document Term Matrix

The text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [25]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm



Unnamed: 0,aaah,aah,abc,abducted,abducting,abernathy,abernathys,abilities,able,abled,...,zombies,zone,zones,zoo,zoom,zoomed,zucchinis,álvarez,ándale,ñañaras
carlin,0,0,0,0,0,0,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0
dave,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
fluffy,1,0,1,0,0,0,0,0,0,0,...,0,0,1,0,4,0,0,3,1,1
fortune,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,1,1,0,0,0,0
louis,0,0,0,0,0,0,0,0,1,0,...,0,0,0,9,0,0,0,0,0,0
mcintyre,0,2,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
norm,0,0,0,0,0,2,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
papa,1,0,0,0,0,0,0,0,2,0,...,0,0,0,1,0,0,0,0,0,0
ricky,0,0,0,0,1,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
stewart,0,0,0,0,0,0,0,0,3,0,...,8,0,0,0,0,0,0,0,0,0


In [26]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [27]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))