# Data Cleaning

## Introduction

This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Problem Statement

As a reminder, our goal is to look at transcripts of various comedians and note their similarities and differences. Specifically, I'd like to know if Ali Wong's comedy style is different than other comedians, since she's the comedian that got me interested in stand up comedy.

## Getting The Data

Luckily, there are wonderful people online that keep track of stand up routine transcripts. [Scraps From The Loft](http://scrapsfromtheloft.com) makes them available for non-profit and educational purposes.

To decide which comedians to look into, I went on IMDB and looked specifically at comedy specials that were released in the past 5 years. To narrow it down further, I looked only at those with greater than a 7.5/10 rating and more than 2000 votes. If a comedian had multiple specials that fit those requirements, I would pick the most highly rated one. I ended up with a dozen comedy specials.

In [1]:
# Web scraping, pickle imports
import requests
import pickle
import re
import string
import time

In [2]:
# #Load pickled files
# with open("funhouse.csv", "r", encoding='utf-8') as file:
#      corpus = str(file.read())

In [3]:
#  with open('phrases.csv', 'w', encoding='utf-8') as file:
#     for word in corpus.split(','):
#         file.write(str(word))
#         file.write('\n')

In [4]:

def clean(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('\s+', ' ', text)
    text = re.sub('/\s/g', '', text)
    #text = deEmojify(text)
    text = re.sub("[^\x00-\x7F]+", '', text)
    return text



In [5]:
stime = time.time()
with open('round1.txt', 'r', encoding='utf-8') as file:
    with open('round1_clean','w', encoding = 'utf-8') as output:
        for i in file.readlines():
            output.write(clean(i))
            output.write('\n')

print(time.time() - stime)

0.8853309154510498


In [6]:
stime = time.time()
with open('phrases.csv', 'r', encoding='utf-8') as file:
    with open('cleanPhrase.csv','w', encoding = 'utf-8') as output:
        for i in file.readlines():
            output.write(clean(i))
            output.write('\n')

print(time.time() - stime)

3.4999806880950928


In [7]:
import pandas as pd
df = pd.read_csv('cleanPhrase.csv')
df.to_csv('output.csv', index=False)



In [23]:
from sklearn.feature_extraction.text import CountVectorizer

stime = time.time()

cv = CountVectorizer()
corpus = open('r1.txt')
data_cv = cv.fit_transform(corpus)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names()).transpose()
data_dtm.to_csv('funhouse_freq.csv')

print(time.time() - stime)

ValueError: I/O operation on closed file.

## Cleaning The Data

In [None]:
# Let's take a look at our data again
next(iter(data.keys()))

In [None]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

In [None]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [None]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [None]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)