## Data Cleaning

Specifically, we'll be:

1. Getting the data - scraping data from a website
2. Cleaning the data - applying popular text pre-processing techniques
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

Analyze transcripts to find valuable insights of what is being discussed.

Ex: Restaurant reviews to determine sentiment analysis or bank call center transcripts to determine topic modeling

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from yelp.com
def url_to_transcript(url):
    page = requests.get(url).text #gets all the content from that url
    soup = BeautifulSoup(page, "lxml") #reads all the text as a html documnet
    text = [p.text for p in soup.find_all(class_="f3cNr0xDcwvMaBkpjQj5")]
    print(url)
    return text
# t9JcvSL3Bsj1lxMSi3pz h_kb2PFOoyZe1skyGiz9 Ti64w3n01MDTYZb59n6Q
# URLs of transcripts in scope
urls = ['https://www.opentable.com/r/piatti-san-antonio?corrid=4cec5420-b025-46d0-9c85-c7593b9a80fb&avt=eyJ2IjoyLCJtIjowLCJwIjowLCJzIjowLCJuIjowfQ&p=2&sd=2022-05-31T21%3A00%3A00']

# Comedian names
webpage = ['opentable']

In [2]:
# Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

https://www.opentable.com/r/piatti-san-antonio?corrid=4cec5420-b025-46d0-9c85-c7593b9a80fb&avt=eyJ2IjoyLCJtIjowLCJwIjowLCJzIjowLCJuIjowfQ&p=2&sd=2022-05-31T21%3A00%3A00


In [3]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(webpage):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

A subdirectory or file transcripts already exists.


In [4]:
# Load pickled files
data = {}
for i, c in enumerate(webpage):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [5]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['opentable'])

In [6]:
# More checks
data['opentable']

["Very loud, then we ordered what was on the menu, but when the food arrived some of it was not included in our order... When we asked the why question,  we were told that corporate had changed the menu but they didn't have time to update the menu just yet, so those items were not included in what we just ordered... What?!  We insisted that we should receive what we had ordered, and what was shown on the menu, so they relented and brought the items, but of course it was later, more than half way through our meal.  The wait person told us we were not the only table to complain.  Also, there is a 3% surcharge added to your bill on top of everything else, for issues pertaining to staffing, etc... What?!\nSeems the restaurant is trying to make a point about the Byedan regime, but left a sour taste in our mouth.  The surcharge reference is clearly printed on the menu at the bottom, and they also make a point of stating it is NOT gratuity... Why wouldn't you just include the additional cost 

## Cleaning The Data

Common data cleaning steps on all text:

    *Make text all lower case
    *Remove punctuation
    *Remove numerical values
    *Remove common non-sensical text (/n)
    *Tokenize text
    *Remove stop words

In [9]:
# Let's take a look at our data again
next(iter(data.keys()))

'opentable'

In [10]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

["Very loud, then we ordered what was on the menu, but when the food arrived some of it was not included in our order... When we asked the why question,  we were told that corporate had changed the menu but they didn't have time to update the menu just yet, so those items were not included in what we just ordered... What?!  We insisted that we should receive what we had ordered, and what was shown on the menu, so they relented and brought the items, but of course it was later, more than half way through our meal.  The wait person told us we were not the only table to complain.  Also, there is a 3% surcharge added to your bill on top of everything else, for issues pertaining to staffing, etc... What?!\nSeems the restaurant is trying to make a point about the Byedan regime, but left a sour taste in our mouth.  The surcharge reference is clearly printed on the menu at the bottom, and they also make a point of stating it is NOT gratuity... Why wouldn't you just include the additional cost 

In [11]:
def combine_text(list_of_text):
    # Takes a list of text and combines them into one large chunk of text.
    combined_text = ' '.join(list_of_text)
    return combined_text

In [12]:
# We are going to change this to key: webpage, value: string format
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [13]:
# put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
opentable,"Very loud, then we ordered what was on the menu, but when the food arrived some of it was not included in our order... When we asked the why quest..."


In [14]:
# Let's take a look at the transcript for opentable
data_df.transcript.loc['opentable']



In [15]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    # Make text lowercase, remove punctuation and remove words containing numbers.
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [16]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
opentable,very loud then we ordered what was on the menu but when the food arrived some of it was not included in our order when we asked the why question ...


In [18]:
data_clean.transcript.loc['opentable']



In [19]:
# Apply a second round of cleaning
def clean_text_round2(text):
    # Get rid of some additional punctuation and non-sensical text that was missed the first time around.
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [20]:
# Let's take a look at the second round of updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
opentable,very loud then we ordered what was on the menu but when the food arrived some of it was not included in our order when we asked the why question ...


In [21]:
data_clean.transcript.loc['opentable']



## Organize The Data

### Corpus

In [22]:
data_df

Unnamed: 0,transcript
opentable,"Very loud, then we ordered what was on the menu, but when the food arrived some of it was not included in our order... When we asked the why quest..."


In [23]:
# add the webpages' full names as well
full_names = ['OpenTable']

data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
opentable,"Very loud, then we ordered what was on the menu, but when the food arrived some of it was not included in our order... When we asked the why quest...",OpenTable


In [24]:
# pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

In [26]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,able,accommodating,acknowledgement,added,additional,admin,advised,air,alfredo,amazing,...,wet,whatseems,whatsoever,wine,wonderful,wouldnt,wrong,years,youre,yukon
opentable,2,1,1,1,1,1,1,1,1,2,...,1,1,1,3,1,1,1,1,1,1


In [27]:
# pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [28]:
# also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))