In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

# Data Cleaning

## Introduction

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Problem Statement

Doing a Deep Dive into all of Chappelle's standup specials that I can get my hands on. 

## Getting The Data

Data is scrapped from scrapsfromtheloft.com

In [27]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    """Returns transcript data specifically from scrapsfromtheloft.com."""
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="post-content").find_all("p")]
    print(url)
    return text


# URLs of transcripts in scope
urls = [
    "http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/",
    "https://scrapsfromtheloft.com/2019/08/26/dave-chappelle-sticks-stones-transcript/",
    "https://scrapsfromtheloft.com/2018/01/03/dave-chappelle-the-bird-revelation-2017-full-transcript/",
    "https://scrapsfromtheloft.com/2018/01/03/dave-chappelle-equanimity-2017-full-transcript/",
    "https://scrapsfromtheloft.com/2017/05/19/dave-chappelle-killin-softly-2000-full-transcript/",
    "https://scrapsfromtheloft.com/2017/04/20/dave-chappelle-deep-heart-texas-2017-full-transcript/",
    "https://scrapsfromtheloft.com/2017/05/19/dave-chappelle-worth-2004-full-transcript/",
]

# Comedian names
comedians = [
    "age_spin",
    "stick_stones",
    "the_bird_revelation",
    "equanimity",
    "killin_softly",
    "deep_texas",
    "worth",
]

<IPython.core.display.Javascript object>

In [28]:
# Webscraping webpage
transcripts = [url_to_transcript(u) for u in urls]

http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/
https://scrapsfromtheloft.com/2019/08/26/dave-chappelle-sticks-stones-transcript/
https://scrapsfromtheloft.com/2018/01/03/dave-chappelle-the-bird-revelation-2017-full-transcript/
https://scrapsfromtheloft.com/2018/01/03/dave-chappelle-equanimity-2017-full-transcript/
https://scrapsfromtheloft.com/2017/05/19/dave-chappelle-killin-softly-2000-full-transcript/
https://scrapsfromtheloft.com/2017/04/20/dave-chappelle-deep-heart-texas-2017-full-transcript/
https://scrapsfromtheloft.com/2017/05/19/dave-chappelle-worth-2004-full-transcript/


<IPython.core.display.Javascript object>

In [29]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

A subdirectory or file transcripts already exists.


<IPython.core.display.Javascript object>

In [30]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

<IPython.core.display.Javascript object>

In [31]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['age_spin', 'stick_stones', 'the_bird_revelation', 'equanimity', 'killin_softly', 'deep_texas', 'worth'])

<IPython.core.display.Javascript object>

In [33]:
# Looking at a particular standup
data["age_spin"][:4]

['This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the alchemist’s fire that transforms fear and tragedy into levity and livelihood. Dave calls that look “the trance.” ♪ Play me ♪ ♪ Buy me ♪ ♪ Workinonit ♪ ♪ Tune up ♪ ♪ Tune ♪ ♪ Oh ♪ ♪ Fade me ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪',
 'Thank you! Thank you very much! Thank you all. Oh, wow. That was exciting, wasn’t it? Thank you, guys. Have a seat, feel comfortable, relax. I want to thank everyone in LA for a wonderful week. It’s been great here. You know what? It’s been ten years since the last time I played Los Angeles, if you can imagine. I know! I know, I’ve been gone for a very long time. And

<IPython.core.display.Javascript object>

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [34]:
# Let's take a look at our data again
next(iter(data.keys()))

'age_spin'

<IPython.core.display.Javascript object>

In [35]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

['This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the alchemist’s fire that transforms fear and tragedy into levity and livelihood. Dave calls that look “the trance.” ♪ Play me ♪ ♪ Buy me ♪ ♪ Workinonit ♪ ♪ Tune up ♪ ♪ Tune ♪ ♪ Oh ♪ ♪ Fade me ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪ ♪ In every ghetto ♪ ♪ Ah-ah, ah-ah, ah-ah ♪',
 'Thank you! Thank you very much! Thank you all. Oh, wow. That was exciting, wasn’t it? Thank you, guys. Have a seat, feel comfortable, relax. I want to thank everyone in LA for a wonderful week. It’s been great here. You know what? It’s been ten years since the last time I played Los Angeles, if you can imagine. I know! I know, I’ve been gone for a very long time. And

<IPython.core.display.Javascript object>

In [36]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    """Takes a list of text and combines them into one large chunk of text."""
    combined_text = " ".join(list_of_text)
    return combined_text

<IPython.core.display.Javascript object>

In [37]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

<IPython.core.display.Javascript object>

In [38]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd

pd.set_option("max_colwidth", 150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ["transcript"]
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
age_spin,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
deep_texas,"[Morgan Freeman] He’s in the trance. He isn’t thinking of jokes, though. He’s composing the voiceover I’m saying to you right now. Getting me to a..."
equanimity,"“Equanimity” was shot in Washington, D.C., and it covers the material that Chappelle developed in his monthlong stint at Radio City Music Hall in ..."
killin_softly,Wooo! Ya’ll gone make me lose my mind. Up in here! Up in here! Y’all gone make me throw her out. Up in here! Up in here! Y’all gone make me act a ...
stick_stones,Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt...
the_bird_revelation,"Recorded at the Comedy Store in Los Angeles in November 2017 [Dave Chappelle] Sometimes, the funniest thing to say is mean. You know what I mean? ..."
worth,Why’d you pick San Francisco to shoot your special? This is one of the best towns that ever knew comedy. And this is the most historic venue you g...


<IPython.core.display.Javascript object>

In [40]:
# Let's take a look at one of the specials
data_df.transcript.loc["stick_stones"]

'Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades and a remarkably long t-shirt. [Morgan Freeman] This is Dave. He tells jokes for a living. Hopefully he makes people laugh, but these days it’s a high stakes game. Hmm, how did we get here, I wonder? I don’t mean that metaphorically, I’m really asking: how did Dave get here? I mean, what the fuck is this? But what do I know? I’m just Morgan Freeman. Anyway, I guess what I’m trying to say is\xa0if you say anything… you risk everything. But if that’s the way it’s gotta be—okay, fine, fuck it!  Ahahah, he’s back folks! Sticks & Stones streamed August 26, 2019 on Netflix. “TELL ME SOMETHING’ YOU MOTHAFUCKAS\nCAN’T TELL ME NOTHIN’ I’D RATHER DIE THAN\nTO LISTEN TO YOU…” —KENDRICK LAMAR,\nPULITZER PRIZE WINNER “I KNOW REAL NIGGAS\nHAPPEN TO LOVE IT” —SHAWN CARTER\n(BILLIONAIRE) ♪ I was dreaming When I wrote t

<IPython.core.display.Javascript object>

In [41]:
# Apply a first round of text cleaning techniques
import re
import string


def clean_text_round1(text):
    """Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers."""
    text = text.lower()
    text = re.sub("\[.*?\]", "", text)
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text)
    text = re.sub("\w*\d\w*", "", text)
    return text


round1 = lambda x: clean_text_round1(x)

<IPython.core.display.Javascript object>

In [42]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
age_spin,this is dave he tells dirty jokes for a living that stare is where most of his hard work happens it signifies a profound train of thought the alch...
deep_texas,he’s in the trance he isn’t thinking of jokes though he’s composing the voiceover i’m saying to you right now getting me to agree to this was bey...
equanimity,“equanimity” was shot in washington dc and it covers the material that chappelle developed in his monthlong stint at radio city music hall in ...
killin_softly,wooo ya’ll gone make me lose my mind up in here up in here y’all gone make me throw her out up in here up in here y’all gone make me act a fool up...
stick_stones,sticks stones is dave chappelle’s fifth netflix special\nin the promotional trailer morgan freeman narrates as chappelle swaggers across a salt f...
the_bird_revelation,recorded at the comedy store in los angeles in november sometimes the funniest thing to say is mean you know what i mean it’s a tough position t...
worth,why’d you pick san francisco to shoot your special this is one of the best towns that ever knew comedy and this is the most historic venue you got...


<IPython.core.display.Javascript object>

In [43]:
# Apply a second round of cleaning
def clean_text_round2(text):
    """Get rid of some additional punctuation and non-sensical text that was missed the first time around."""
    text = re.sub("[‘’“”…]", "", text)
    text = re.sub("\n", "", text)
    return text


round2 = lambda x: clean_text_round2(x)

<IPython.core.display.Javascript object>

In [44]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
age_spin,this is dave he tells dirty jokes for a living that stare is where most of his hard work happens it signifies a profound train of thought the alch...
deep_texas,hes in the trance he isnt thinking of jokes though hes composing the voiceover im saying to you right now getting me to agree to this was beyond ...
equanimity,equanimity was shot in washington dc and it covers the material that chappelle developed in his monthlong stint at radio city music hall in ♪ ...
killin_softly,wooo yall gone make me lose my mind up in here up in here yall gone make me throw her out up in here up in here yall gone make me act a fool up in...
stick_stones,sticks stones is dave chappelles fifth netflix specialin the promotional trailer morgan freeman narrates as chappelle swaggers across a salt flat...
the_bird_revelation,recorded at the comedy store in los angeles in november sometimes the funniest thing to say is mean you know what i mean its a tough position to...
worth,whyd you pick san francisco to shoot your special this is one of the best towns that ever knew comedy and this is the most historic venue you got ...


<IPython.core.display.Javascript object>

## Organizing The Data

1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [45]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
age_spin,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
deep_texas,"[Morgan Freeman] He’s in the trance. He isn’t thinking of jokes, though. He’s composing the voiceover I’m saying to you right now. Getting me to a..."
equanimity,"“Equanimity” was shot in Washington, D.C., and it covers the material that Chappelle developed in his monthlong stint at Radio City Music Hall in ..."
killin_softly,Wooo! Ya’ll gone make me lose my mind. Up in here! Up in here! Y’all gone make me throw her out. Up in here! Up in here! Y’all gone make me act a ...
stick_stones,Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt...
the_bird_revelation,"Recorded at the Comedy Store in Los Angeles in November 2017 [Dave Chappelle] Sometimes, the funniest thing to say is mean. You know what I mean? ..."
worth,Why’d you pick San Francisco to shoot your special? This is one of the best towns that ever knew comedy. And this is the most historic venue you g...


<IPython.core.display.Javascript object>

In [46]:
# Let's add the comedians' full names as well
titles = [
    "age_spin",
    "deep_texas",
    "equanimity",
    "killin_softly",
    "stick_stones",
    "the_bird_revelation",
    "worth",
]

data_df["titles"] = titles
data_df

Unnamed: 0,transcript,titles
age_spin,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ...",age_spin
deep_texas,"[Morgan Freeman] He’s in the trance. He isn’t thinking of jokes, though. He’s composing the voiceover I’m saying to you right now. Getting me to a...",deep_texas
equanimity,"“Equanimity” was shot in Washington, D.C., and it covers the material that Chappelle developed in his monthlong stint at Radio City Music Hall in ...",equanimity
killin_softly,Wooo! Ya’ll gone make me lose my mind. Up in here! Up in here! Y’all gone make me throw her out. Up in here! Up in here! Y’all gone make me act a ...,killin_softly
stick_stones,Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt...,stick_stones
the_bird_revelation,"Recorded at the Comedy Store in Los Angeles in November 2017 [Dave Chappelle] Sometimes, the funniest thing to say is mean. You know what I mean? ...",the_bird_revelation
worth,Why’d you pick San Francisco to shoot your special? This is one of the best towns that ever knew comedy. And this is the most historic venue you g...,worth


<IPython.core.display.Javascript object>

In [47]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

<IPython.core.display.Javascript object>

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [48]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words="english")
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aaaah,aaah,aah,aand,abandon,able,ablebodied,abomination,abortion,abortions,...,youtube,youve,yuck,zeke,zero,zigzag,zigzagging,zip,zippitydoodah,zone
age_spin,0,1,0,0,0,0,0,0,0,1,...,0,5,0,0,0,0,0,0,0,0
deep_texas,0,0,0,0,0,0,0,0,0,0,...,2,2,0,0,1,0,0,0,0,1
equanimity,0,0,0,0,0,1,0,0,0,0,...,0,1,2,1,0,0,0,0,0,0
killin_softly,0,0,0,0,0,1,0,0,0,0,...,0,2,0,0,0,0,1,2,1,0
stick_stones,1,0,2,2,1,1,1,1,1,0,...,0,2,0,0,0,1,0,0,0,0
the_bird_revelation,0,0,0,0,0,1,0,0,0,0,...,0,1,4,0,0,0,0,0,0,0
worth,0,0,0,0,0,1,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [49]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

<IPython.core.display.Javascript object>

In [50]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle("data_clean.pkl")
pickle.dump(cv, open("cv.pkl", "wb"))

<IPython.core.display.Javascript object>