# 01 - Cleaning Text Data

- Code and examples based on: https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing/

---
## a) Load book text

#### Option 1: Load from txt file

In [47]:
# load book text for Alice in Wonderland
# downloaded from: https://archive.org/details/alicesadventures00011gut/page/n9/mode/2up

filename = 'data/alice30_booktext.txt'
file = open(filename, 'rt')
text1 = file.read()
file.close()

type(text1), len(text1)

(str, 148539)

In [50]:
text1[500:800]

"thout pictures or conversation?'\n\n  So she was considering in her own mind (as well as she could,\nfor the hot day made her feel very sleepy and stupid), whether\nthe pleasure of making a daisy-chain would be worth the trouble\nof getting up and picking the daisies, when suddenly a White\nRabbit with pi"

#### Option 2: Load from NLTK

In [48]:
# NTLK: Natural Language Toolkit
# !pip install nltk

import nltk
#nltk.download()

In [49]:
# show booktexts provided by NTLK from project Gutenberg
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [51]:
# load book text for Alice in Wonderland from NLTK

text2 = nltk.corpus.gutenberg.raw('carroll-alice.txt')
type(text2), len(text2)

(str, 144395)

In [52]:
text2[500:800]

'd stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit'

---
## b)Tokenization
**Tokenization**: the process of turning raw text into a list of words (or "tokens") which can be used for modeling

#### Option 1: Manual Tokenization

In [53]:
# split text into words by white space
words1 = text1.split()

# convert all words to lowercase
words1 = [word.lower() for word in words1]

# ... do further editing of words

type(words1), len(words1)

(list, 26466)

In [54]:
words1[:5]

["alice's", 'adventures', 'in', 'wonderland', 'lewis']

#### Option 2: NLTK Tokenization

In [63]:
# split into words
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text2)
tokens = [token.lower() for token in tokens]

type(tokens), len(tokens)

(list, 33493)

In [65]:
tokens[:5]

['[', 'alice', "'s", 'adventures', 'in']

In [66]:
# saving tokens
import pickle

with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokens, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [69]:
# loading tokens
with open('tokenizer.pickle', 'rb') as handle:
    tokens2 = pickle.load(handle)

In [68]:
tokens2 == tokens

True

#### Option 3: Get tokens provided by NLTK

In [58]:
tokens = nltk.corpus.gutenberg.words('carroll-alice.txt')
type(tokens), len(tokens)

(nltk.corpus.reader.util.StreamBackedCorpusView, 34110)

In [62]:
tokens[:5]

['[', 'Alice', "'", 's', 'Adventures']

In [1]:
# read the data into a pandas dataframe:
# code from: https://nijianmo.github.io/amazon/index.html#subsets

import pandas as pd
import gzip
import json

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
        if i==100:
            return pd.DataFrame.from_dict(df, orient='index')

df = getDF('data/reviews_Movies_and_TV_5.json.gz')

In [2]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote
0,5.0,True,"11 9, 2012",A2M1CU2IRZG0K9,0005089549,{'Format:': ' VHS Tape'},Terri,So sorry I didn't purchase this years ago when...,Amazing!,1352419200,
1,5.0,True,"12 30, 2011",AFTUJYISOFHY6,0005089549,{'Format:': ' VHS Tape'},Melissa D. Abercrombie,Believe me when I tell you that you will recei...,Great Gospel VHS of the Cathedrals!,1325203200,
2,5.0,True,"04 21, 2005",A3JVF9Y53BEOGC,000503860X,{'Format:': ' DVD'},Anthony Thompson,"I have seen X live many times, both in the ear...",A great document of a great band,1114041600,11.0
3,5.0,True,"04 6, 2005",A12VPEOEZS1KTC,000503860X,{'Format:': ' DVD'},JadeRain,"I was so excited for this! Finally, a live co...",YES!! X LIVE!!,1112745600,5.0
4,5.0,True,"12 3, 2010",ATLZNVLYKP9AZ,000503860X,{'Format:': ' DVD'},T. Fisher,X is one of the best punk bands ever. I don't ...,X have still got it,1291334400,5.0


**Data dictionary:**
- overall - rating of the product
- reviewTime - time of the review (raw)
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- reviewerName - name of the reviewer
- reviewText - text of the review
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- vote - helpful votes of the review
- image - images that users post after they have received the product