# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data 
2. Cleaning the data 
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

Look at transcripts of various comedians and note their similarities and differences and find if the stand up comedian of your choice has comedy style different than other comedian.


## Getting The Data

You can get the transcripts of some comedian from [Scraps From The Loft](http://scrapsfromtheloft.com). 

You can take help of IMDB and select only 10 or 20 comedian having highest rating.






### For example:

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle



# URLs of transcripts in scope
urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
        'https://scrapsfromtheloft.com/comedy/bo-burnham-make-happy-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/brazil-corruption-amazon-hasan-minhaj/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/jamie-foxx-what-had-happened-was-transcript/',
        'https://scrapsfromtheloft.com/comedy/deon-cole-cole-hearted-transcript/',
        'https://scrapsfromtheloft.com/comedy/fern-brady-power-chaos-transcript/',
        'https://scrapsfromtheloft.com/comedy/chad-daniels-dad-chaniels-transcript/',
        'https://scrapsfromtheloft.com/comedy/kevin-hart-gun-compartment-transcript/',
        'https://scrapsfromtheloft.com/comedy/sam-morril-youve-changed-transcript/',
        'https://scrapsfromtheloft.com/movies/marlon-wayans-good-grief-transcript/',
        'https://scrapsfromtheloft.com/comedy/tig-notaro-hello-again-transcript/',
       ]

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Extracts transcript text from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")

    # Find divs with transcript content
    containers = soup.find_all(class_="elementor-widget-container")
    
    text = []
    for container in containers:
        for elem in container.find_all(["p"]):  
            cleaned_text = elem.get_text(strip=True)  # Remove excess whitespace
            if cleaned_text:  # Only add non-empty text
                text.append(cleaned_text)
    
    print(f"Scraped: {url} - {len(text)} lines extracted")  # Debugging output
    return text

# Test on one URL
test_transcript = url_to_transcript(urls[0])
print(test_transcript[:10])  # Print first 10 lines

# Comedian names
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe','jamie','deon','fern','chad',
             'kevin','sam','marlon','tig']

Scraped: http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/ - 30 lines extracted
['IntroFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily agree with you, but I appreciate very much. Well, this is a nice place. This is easily the nicest place For many miles in every direction. That’s how you compliment a building And shit on a town with one sentence. It is odd around here, as I was driving here. There doesn’t seem to be any difference Between the sidewalk and the street for pedestrians here. People just kind of walk in the middle of the road. I love traveling And seeing all the different parts of the country. I live in New York. I live in a– There’s no value to your doing that at all.', '“The Old Lady And The Dog”I live– I live in New York. I always– Like, there’s this old lady in my neighborhood, And she’s always walking her dog. She’s always just– she’s very old. She just s

In [5]:
# # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

Scraped: http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/ - 30 lines extracted
Scraped: http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/ - 60 lines extracted
Scraped: http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/ - 34 lines extracted
Scraped: https://scrapsfromtheloft.com/comedy/bo-burnham-make-happy-2016-full-transcript/ - 67 lines extracted
Scraped: http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/ - 12 lines extracted
Scraped: http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/ - 22 lines extracted
Scraped: http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/ - 43 lines extracted
Scraped: https://scrapsfromtheloft.com/comedy/brazil-corruption-amazon-hasan-minhaj/ - 104 lines extracted
Scraped: http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/ - 49 lines extr

In [7]:
 # Pickle files for later use

 # Make a new directory to hold the text files
 !mkdir transcripts

 for i, c in enumerate(comedians):
     with open("transcripts/" + c + ".txt", "wb") as file:
         pickle.dump(transcripts[i], file)
#print(transcripts[3])  # Print the first two transcripts


A subdirectory or file transcripts already exists.


In [2]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [3]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe', 'jamie', 'deon', 'fern', 'chad', 'kevin', 'sam', 'marlon', 'tig'])

In [4]:
# More checks
data['bill'][:2]

['[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a pleasure to be here in the greater Atlanta, Georgia, area, this oasis. It’s nice to be here. I don’t know why I came here in June. It’s nice to be here. Wasn’t thinking. Fucking ridiculously hot out there. Just miserable, horrible. That kind of heat, you understand the racism down here, ya know? I get it. How would you get along with anybody? “Look at ’em just over there, drinkin’ a cold drink! Lemonade was made for the white man!” So… What the hell have I been doing with my life? Trying to get in shape, man. But I hate going to the gym, so I decided I’d go veggie twice a week. It’s brutal. I can only make it till about 5:00. Five o’clock, that’s what I realized about myself, you know that? Something has to die every day in order for me to live. Something’s got to get its beak chopped off, its feathers yanked, uppercut to its jaw, just in o

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.
### Assignment:
1. Perform the following data cleaning on transcripts:
i) Make text all lower case
ii) Remove punctuation
iii) Remove numerical values
iv) Remove common non-sensical text (/n)
v) Tokenize text
vi) Remove stop words

In [6]:
# Let's take a look at our data again
next(iter(data.keys()))

'louis'

In [7]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

['IntroFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily agree with you, but I appreciate very much. Well, this is a nice place. This is easily the nicest place For many miles in every direction. That’s how you compliment a building And shit on a town with one sentence. It is odd around here, as I was driving here. There doesn’t seem to be any difference Between the sidewalk and the street for pedestrians here. People just kind of walk in the middle of the road. I love traveling And seeing all the different parts of the country. I live in New York. I live in a– There’s no value to your doing that at all.',
 '“The Old Lady And The Dog”I live– I live in New York. I always– Like, there’s this old lady in my neighborhood, And she’s always walking her dog. She’s always just– she’s very old. She just stands there just being old, And the dog just fights gravity every day, just– The two of them, it’s really

In [8]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [9]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [10]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
ali,"Ladies and gentlemen, please welcome to the stage:Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have t..."
anthony,"Thank you. Thank you. Thank you,San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my sp..."
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
bo,"[woman on TV] That has been, really, a difficult thing for me. My mother has always been a very difficult person all her life. Very unhappy. I can..."
chad,"Guys, I’m telling you, this is it. This is, this is… I’ve tried everything, okay? This is my shining moment as a father for you. I’ve tried wrappi..."
dave,"This is Dave.He tells dirty jokes for a living.That stare is where most of his hard work happens.It signifies a profound train of thought,the alch..."
deon,"Deon Cole: Cole Hearted (2019)Genre:Stand-up Comedy, TV SpecialDirector:Ryan PolitoWriter:Deon ColeStar:Deon Cole Plot:In this stand-up special, D..."
fern,"[electricity buzzes] Ladies and gentlemen, welcome to the stage it’s Fern Brady. [audience cheers] Hello. [audience cheers] Aw. This is so excitin..."
hasan,"On this episode of Patriot Act, Hasan breaks down of the growing threats to Brazil’s Amazon rainforest. After a long and welcome decline, Brazil’s..."
jamie,[siren wailing] [somber music playing] [Harvey Levin] Jamie Foxx suffered a very serious medical emergency. Bad enough that family members flew in...


In [16]:
# Let's take a look at the transcript for Ali Wong
data_df.transcript.loc['ali']

"Ladies and gentlemen, please welcome to the stage:Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have to get this shit over with, ’cause I have to pee in, like, ten minutes. But thank you, everybody, so much for coming. Um… It’s a very exciting day for me. It’s been a very exciting year for me. I turned 33 this year. Yes! Thank you, five people. I appreciate that. Uh, I can tell that I’m getting older, because, now, when I see an18-year-old girl, my automatic thought… is “Fuck you.” “Fuck you. I don’t even know you, but fuck you!” ‘Cause I’m straight up jealous. I’m jealous, first and foremost, of theirmetabolism. Because 18-year-old girls, they could just eat like shit, and then they take a shit and have a six-pack, right? They got that-that beautifulinner thigh clearancewhere they put their feet together and there’s that huge gap here with the light of potential just radiating through. And then, when they go to sleep, they just go to sleep.

In [11]:
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation, and remove words containing numbers.'''
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)  # Use raw string (r"...") to avoid escape sequence warning
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)  # Remove punctuation
    text = re.sub(r'\w*\d\w*', '', text)  # Remove words containing numbers
    return text

round1 = lambda x: clean_text_round1(x)  # Function to apply cleaning


In [13]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
print(data_clean.shape)
# data_clean.transcript.loc['jim']

(20, 1)


In [14]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [15]:
# Apply a third round of cleaning
def clean_text_round3(text):
    '''Further cleans the text by removing non-ASCII characters, links, and unwanted sections.'''
    text = re.sub(r'[^\x00-\x7F]+', '', text)  
    text = re.sub(r'[Tt]ranscripts?', '', text, 1) 
    text = re.sub(r'http\S+|www\S+', '', text)  
    #this is to remove some specific noise in all the data which is present
    text = re.sub(r'your email address will not be published.*?to your inbox', '', text, flags=re.DOTALL)
    return text

round3 = lambda x: clean_text_round3(x)


In [16]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean = pd.DataFrame(data_clean.transcript.apply(round3))
print(data_clean.shape)
data_clean.transcript.loc['jim']


(20, 1)


'   ladies and gentlemen please welcome to the stage mr jim jefferies  hello sit down sit down sit down sit down sit down  thank you boston i appreciate that  uh thats very sweet of you  love you im at the end of the tour right now im very happy to be on tour because i now have a child ah so any time out of home is good for me um i got my girlfriend pregnant after knowing her for two months so  thank you thank you life decisions and shes a nice girl and i love her in a way sure my problem with my girlfriend is shes very sweet but shes shit at telling stories and im awesome at telling stories so it really bothers me when she talks and i dont know if thatll be a problem in the future but its a problem now and i dont see it getting better um ill give you an example right i was in the car and my son hank was asleep in the back seat and were driving along and on the radio comes madonna and my girlfriend just slips into conversation oh i used topartywith madonna and i went you fucking what w

In [22]:
data_clean.transcript.loc['ricky']

'hello hello how you doing great thank you wow calm down shut the fuck up thank you what a lovely welcome im gonna try my hardest tonight youre thinking relax weve had our moneys worth just seeing you what youre a legend shut up what is he im not a god im just an ordinary guy you know going round talking to people sort sort of like jesus in a way but better well ive actually turned up so thank you and welcome to my new showhumanity i dont know why i called it that im not a big fan i preferdogs obviously dogs are better people than people arent they theyre amazing dogs theyre our best friends they guard us they guide us theres medical detection dogs that can smell if youve got aids im not a doctor but their noses are a thousand times more sensitive than ours so they go cor youre well hiv fuck you know and you go you can smell aids on someone yeah why didnt you smell it on the bloke i brought home last night you fucking idiot they did the first three billion years by themselves evolution

## Organizing The Data

### Assignment:
1. Organized data in two standard text formats:
   a) Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
   b) Document-Term Matrix - word counts in matrix format

### Corpus: Example

A corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [17]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
ali,"Ladies and gentlemen, please welcome to the stage:Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have t..."
anthony,"Thank you. Thank you. Thank you,San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my sp..."
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
bo,"[woman on TV] That has been, really, a difficult thing for me. My mother has always been a very difficult person all her life. Very unhappy. I can..."
chad,"Guys, I’m telling you, this is it. This is, this is… I’ve tried everything, okay? This is my shining moment as a father for you. I’ve tried wrappi..."
dave,"This is Dave.He tells dirty jokes for a living.That stare is where most of his hard work happens.It signifies a profound train of thought,the alch..."
deon,"Deon Cole: Cole Hearted (2019)Genre:Stand-up Comedy, TV SpecialDirector:Ryan PolitoWriter:Deon ColeStar:Deon Cole Plot:In this stand-up special, D..."
fern,"[electricity buzzes] Ladies and gentlemen, welcome to the stage it’s Fern Brady. [audience cheers] Hello. [audience cheers] Aw. This is so excitin..."
hasan,"On this episode of Patriot Act, Hasan breaks down of the growing threats to Brazil’s Amazon rainforest. After a long and welcome decline, Brazil’s..."
jamie,[siren wailing] [somber music playing] [Harvey Levin] Jamie Foxx suffered a very serious medical emergency. Bad enough that family members flew in...


In [18]:

# data_df.to_pickle("corpus.pkl")
data_clean

Unnamed: 0,transcript
ali,ladies and gentlemen please welcome to the stageali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get this...
anthony,thank you thank you thank yousan francisco thank you so much so good to be here people were surprised when i told em i was gonna tape my special i...
bill,all right thank you thank you very much thank you thank you thank you how are you whats going on thank you its a pleasure to be here in the great...
bo,that has been really a difficult thing for me my mother has always been a very difficult person all her life very unhappy i can never remember my...
chad,guys im telling you this is it this is this is ive tried everything okay this is my shining moment as a father for you ive tried wrapping up a pan...
dave,this is davehe tells dirty jokes for a livingthat stare is where most of his hard work happensit signifies a profound train of thoughtthe alchemis...
deon,deon cole cole hearted comedy tv specialdirectorryan politowriterdeon colestardeon cole plotin this standup special deon cole embraces comedy as ...
fern,ladies and gentlemen welcome to the stage its fern brady hello aw this is so exciting one of the most exciting things about this is over the la...
hasan,on this episode of patriot act hasan breaks down of the growing threats to brazils amazon rainforest after a long and welcome decline brazils defo...
jamie,jamie foxx suffered a very serious medical emergency bad enough that family members flew in actor jamie foxx is asking for prayers as the actor...


### Document-Term Matrix: Example

For many of the techniques we'll be using in future assignment, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd  # Ensure pandas is imported

cv = CountVectorizer(
    stop_words='english', 
    ngram_range=(1,1),  # Includes unigrams #if i want bigrams i can do ngram_range=(1,2) this includes both
    min_df=2,  # this will remove words in less than 2 comedians
    max_df=0.75  #this will remove words in more than 75 percent of the comedians
)

data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())  # Use get_feature_names_out()
data_dtm.index = data_clean.index  # Ensure index alignment
data_dtm
# data_dtm.to_csv("analyse-bad-words.csv")


Unnamed: 0,abc,abe,ability,able,abortion,absolute,absolutely,abuse,accent,accents,...,york,youd,youll,young,younger,youngest,youtube,zero,zombie,zoo
ali,1,0,0,2,0,0,0,1,0,0,...,0,0,4,2,0,0,0,0,1,0
anthony,0,0,0,0,2,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
bill,0,0,0,1,0,1,3,0,0,0,...,1,1,5,0,0,0,1,1,1,0
bo,0,1,0,1,0,0,0,0,0,0,...,2,2,6,6,0,0,0,3,0,1
chad,0,0,0,3,0,0,0,0,0,0,...,0,0,0,2,1,0,0,1,0,0
dave,0,0,0,0,0,0,0,0,0,0,...,1,7,3,10,0,0,0,0,0,0
deon,0,0,0,1,0,0,1,0,0,0,...,0,5,4,14,0,0,1,0,0,0
fern,0,0,0,0,4,0,1,0,5,1,...,0,2,0,0,1,0,2,0,0,0
hasan,1,0,0,3,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
jamie,0,0,0,1,0,0,0,0,0,0,...,0,1,3,1,0,1,0,0,0,0


In [23]:
# Let's pickle it for later uses
print(data_clean.shape)
print(data_clean)
# data_dtm.to_pickle("dtm.pkl")

(20, 1)
                                                                                                                                                    transcript
ali      ladies and gentlemen please welcome to the stageali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get this...
anthony  thank you thank you thank yousan francisco thank you so much so good to be here people were surprised when i told em i was gonna tape my special i...
bill      all right thank you thank you very much thank you thank you thank you how are you whats going on thank you its a pleasure to be here in the great...
bo        that has been really a difficult thing for me my mother has always been a very difficult person all her life very unhappy i can never remember my...
chad     guys im telling you this is it this is this is ive tried everything okay this is my shining moment as a father for you ive tried wrapping up a pan...
dave     this is davehe tells dirty jo

In [None]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object h
data_clean.to_pickle('data_clean2.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

In [26]:
import pandas as pd

df = pd.read_pickle("data_clean2.pkl")
print(df.shape)
print(df.columns)
print(df.head())


(20, 1)
Index(['transcript'], dtype='object')
                                                                                                                                                    transcript
ali      ladies and gentlemen please welcome to the stageali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get this...
anthony  thank you thank you thank yousan francisco thank you so much so good to be here people were surprised when i told em i was gonna tape my special i...
bill      all right thank you thank you very much thank you thank you thank you how are you whats going on thank you its a pleasure to be here in the great...
bo        that has been really a difficult thing for me my mother has always been a very difficult person all her life very unhappy i can never remember my...
chad     guys im telling you this is it this is this is ive tried everything okay this is my shining moment as a father for you ive tried wrapping up a pan...
