# Fall 2020: DS-GA 1011 NLP with Representation Learning
## Lab 2: 11-Sep-2020, Friday
## Text Pre-processing

In this lab, we will cover the steps on how to clean and process text data before it is ready to be fed to nlp models.

---
### Data
We are using [movie review data](https://ai.stanford.edu/~amaas/data/sentiment/) from IMDB, which is for *binary sentiment classification*. There are 25,000 reviews for training and 25,000 for testing.

### Download and unzip the data
The command `wget` helps you download the data from the following url.

Install using `brew install wget` if not available.

In [1]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2020-09-12 14:38:11--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.2’


2020-09-12 14:38:24 (6.58 MB/s) - ‘aclImdb_v1.tar.gz.2’ saved [84125825/84125825]



The command `tar` is used to compress and extract files to and from an archive.

In [2]:
!tar xzf aclImdb_v1.tar.gz

Note: Windows users can download the data directly from the website and unzip using utility like 7-Zip

In [3]:
train_path = "aclImdb/train/"
test_path = "aclImdb/test/"

### Read data

In [4]:
import os
from tqdm import tqdm

cf. 
  
> [`tqdm`](https://pypi.org/project/tqdm/) makes your loops show a smart progress meter. Just wrap any iterable with *tqdm(iterable)*, and you're done!

> `os.listdir(path)` returns a list containing the names of the entries in the directory given by path.

In [5]:
train_corpus = []
for filename in tqdm(os.listdir(train_path+"pos")):
  review = open(train_path+"pos/"+filename, 'rt').read()
  train_corpus.append(review)

for filename in tqdm(os.listdir(train_path+"neg")):
  review = open(train_path+"neg/"+filename, 'rt').read()
  train_corpus.append(review)



test_corpus = []
for filename in tqdm(os.listdir(test_path+"pos")):
  review = open(test_path+"pos/"+filename, 'rt').read()
  test_corpus.append(review)

for filename in tqdm(os.listdir(test_path+"neg")):
  review = open(test_path+"neg/"+filename, 'rt').read()
  test_corpus.append(review)

100%|██████████| 12500/12500 [00:00<00:00, 19353.64it/s]
100%|██████████| 12500/12500 [00:00<00:00, 21871.35it/s]
100%|██████████| 12500/12500 [00:00<00:00, 22223.88it/s]
100%|██████████| 12500/12500 [00:00<00:00, 22125.39it/s]


In [6]:
print(len(train_corpus), len(test_corpus))

25000 25000


In [7]:
# Reducing corpus size for faster processing
train_corpus = train_corpus[:500]
test_corpus = test_corpus[:500]

In [8]:
train_corpus[1]

'Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV\'s "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina\'s pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D\'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of "Rosemary\'s Baby" and "The Exorcist"--but what a combination! Based on the best-seller by Jeffrey Konvitz, "The Sentinel" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbe

In [9]:
train_corpus[3]

'It\'s a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Based upon Thomas Rockwell\'s respected Book, How To Eat Fried Worms starts like any children\'s story: moving to a new town. The new kid, fifth grader Billy Forrester was once popular, but has to start anew. Making friends is never easy, especially when the only prospect is Poindexter Adam. Or Erica, who at 4 1/2 feet, is a giant.<br /><br />Further complicating things is Joe the bully. His freckled face and sleeveless shirts are daunting. He antagonizes kids with the Death Ring: a Crackerjack ring that is rumored to kill you if you\'re punched with it. But not immediately. No, the death ring unleashes a poison that kills you in the eight grade.<br /><br />Joe and his axis of evil welcome Billy by smuggling a handful of slimy worms into his thermos. Once discovered, Billy plays it cool, swearin

---
### Pre-processing
#### Remove white space and punctuation

cf.
> A regular expression is a sequence of characters that forms a search pattern for strings. The functions in [`re`](https://docs.python.org/3/library/re.html) module let you check if a particular string matches a given regular expression (or vice versa).

In [10]:
import re

def remove_space_punctuation(data):
  # input: list of raw sentences
  # output: list of sentences without punctuation and white space
  
  result = [re.sub('<.*?>',' ',s) for s in tqdm(data)] # html tags
  result = [re.sub(r'[^\w\s]',' ',s) for s in tqdm(result)] # punctuation
  result = [re.sub(' +',' ',s) for s in tqdm(result)] # white space
  return result

train = remove_space_punctuation(train_corpus)
test = remove_space_punctuation(test_corpus)

100%|██████████| 500/500 [00:00<00:00, 263560.64it/s]
100%|██████████| 500/500 [00:00<00:00, 36179.63it/s]
100%|██████████| 500/500 [00:00<00:00, 14298.34it/s]
100%|██████████| 500/500 [00:00<00:00, 259741.39it/s]
100%|██████████| 500/500 [00:00<00:00, 37594.15it/s]
100%|██████████| 500/500 [00:00<00:00, 13171.00it/s]


In [11]:
train[1]

'Bizarre horror movie filled with famous faces but stolen by Cristina Raines later of TV s Flamingo Road as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell The scenes with Raines modeling are very well captured the mood music is perfect Deborah Raffin is charming as Cristina s pal but when Raines moves into a creepy Brooklyn Heights brownstone inhabited by a blind priest on the top floor things really start cooking The neighbors including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles Beverly D Angelo are a diabolical lot and Eli Wallach is great fun as a wily police detective The movie is nearly a cross pollination of Rosemary s Baby and The Exorcist but what a combination Based on the best seller by Jeffrey Konvitz The Sentinel is entertainingly spooky full of shocks brought off well by director Michael Winner who mounts a thoughtfully downbeat ending with skill 1 2 from '

In [12]:
train[3]

'It s a strange feeling to sit alone in a theater occupied by parents and their rollicking kids I felt like instead of a movie ticket I should have been given a NAMBLA membership Based upon Thomas Rockwell s respected Book How To Eat Fried Worms starts like any children s story moving to a new town The new kid fifth grader Billy Forrester was once popular but has to start anew Making friends is never easy especially when the only prospect is Poindexter Adam Or Erica who at 4 1 2 feet is a giant Further complicating things is Joe the bully His freckled face and sleeveless shirts are daunting He antagonizes kids with the Death Ring a Crackerjack ring that is rumored to kill you if you re punched with it But not immediately No the death ring unleashes a poison that kills you in the eight grade Joe and his axis of evil welcome Billy by smuggling a handful of slimy worms into his thermos Once discovered Billy plays it cool swearing that he eats worms all the time Then he throws them at Joe 

#### Lowercasing, tokenization and lemmatization 

*Tokenization* 
The task of chopping the input test into pieces, called *tokens*. *Tokens* are the building blocks for nlp, sequence of characters grouped together as basic unit. They can be either words, subwords or just characters.

*Stemming*
The process of converting any word in the data to its root form. 

*Lemmatization*
Transforms words to the actual root.


cf.

> [spaCy](https://spacy.io): for app developers

> [NLTK](https://www.nltk.org): for researchers and scholars

Install `spacy`

In [13]:
# !conda install -c conda-forge spacy
# !python -m spacy download en_core_web_sm

In [14]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [15]:
import spacy
import string

nlp = spacy.load("en_core_web_sm") # load model for en

def tokenize(data):
  # input: list of sentences without punctuations and white spaces
  # output: list of lists of lower-case lemmatized word tokens

  tokenized_data = []  
  for review in tqdm(data):
    result = nlp(review) # tokenized document
    tokenized_data.append([token.text.lower() for token in result \
                           if not token.is_stop \
                           and token.text not in string.punctuation]) 
  return tokenized_data

train_tokenized = tokenize(train)
test_tokenized = tokenize(test)

100%|██████████| 500/500 [00:15<00:00, 32.38it/s]
100%|██████████| 500/500 [00:17<00:00, 28.77it/s]


In [16]:
print(train_tokenized[1])

['bizarre', 'horror', 'movie', 'filled', 'famous', 'faces', 'stolen', 'cristina', 'raines', 'later', 'tv', 's', 'flamingo', 'road', 'pretty', 'somewhat', 'unstable', 'model', 'gummy', 'smile', 'slated', 'pay', 'attempted', 'suicides', 'guarding', 'gateway', 'hell', 'scenes', 'raines', 'modeling', 'captured', 'mood', 'music', 'perfect', 'deborah', 'raffin', 'charming', 'cristina', 's', 'pal', 'raines', 'moves', 'creepy', 'brooklyn', 'heights', 'brownstone', 'inhabited', 'blind', 'priest', 'floor', 'things', 'start', 'cooking', 'neighbors', 'including', 'fantastically', 'wicked', 'burgess', 'meredith', 'kinky', 'couple', 'sylvia', 'miles', 'beverly', 'd', 'angelo', 'diabolical', 'lot', 'eli', 'wallach', 'great', 'fun', 'wily', 'police', 'detective', 'movie', 'nearly', 'cross', 'pollination', 'rosemary', 's', 'baby', 'exorcist', 'combination', 'based', 'best', 'seller', 'jeffrey', 'konvitz', 'sentinel', 'entertainingly', 'spooky', 'shocks', 'brought', 'director', 'michael', 'winner', 'mou

In [17]:
print(train_tokenized[3])

['s', 'strange', 'feeling', 'sit', 'theater', 'occupied', 'parents', 'rollicking', 'kids', 'felt', 'like', 'instead', 'movie', 'ticket', 'given', 'nambla', 'membership', 'based', 'thomas', 'rockwell', 's', 'respected', 'book', 'eat', 'fried', 'worms', 'starts', 'like', 'children', 's', 'story', 'moving', 'new', 'town', 'new', 'kid', 'fifth', 'grader', 'billy', 'forrester', 'popular', 'start', 'anew', 'making', 'friends', 'easy', 'especially', 'prospect', 'poindexter', 'adam', 'erica', '4', '1', '2', 'feet', 'giant', 'complicating', 'things', 'joe', 'bully', 'freckled', 'face', 'sleeveless', 'shirts', 'daunting', 'antagonizes', 'kids', 'death', 'ring', 'crackerjack', 'ring', 'rumored', 'kill', 'punched', 'immediately', 'death', 'ring', 'unleashes', 'poison', 'kills', 'grade', 'joe', 'axis', 'evil', 'welcome', 'billy', 'smuggling', 'handful', 'slimy', 'worms', 'thermos', 'discovered', 'billy', 'plays', 'cool', 'swearing', 'eats', 'worms', 'time', 'throws', 'joe', 's', 'face', 'ewww', 'wi

---
### Explore

Find most common words and build vocabulary.

cf.
> [`collections`](https://docs.python.org/2/library/collections.html) provides specialized container datatypes like `OrderedDict` & `Counter`

> `Counter` is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values

In [18]:
from collections import Counter

def build_vocab(tokenized_data, max_vocab=10000):
    # input: list of lists of tokens
    # output: token2id: dict, id2token: list

    PAD_IDX = 0 #pad token
    UNK_IDX = 1 #unkown token
    #每个句子的长短都不同，为了确保每个读取的句子长度都相同
    #前两个index是0，1 ，所以后面给token的ID从2开始

    all_tokens = [token for tokens in tokenized_data for token in tokens]
    # unlist all the tokens in the lists of tokens

    token_counter = Counter(all_tokens)
    vocab, count = zip(*token_counter.most_common(max_vocab)) 
    #最常见的10000个vocab, vocab-key, count-value
    token2id = dict(zip(vocab, range(2, 2 + len(vocab)))) #assign each token an ID
    token2id["<PAD>"] = PAD_IDX
    token2id["<UNK>"] = UNK_IDX # 2 dummies
    id2token = ["<PAD>", "<UNK>"] + list(vocab)

    return token2id, id2token
# output: token2id: dict, id2token: list

In [19]:
token2id, id2token = build_vocab(train_tokenized)

In [20]:
print(token2id)

{'s': 2, 'film': 3, 'movie': 4, 't': 5, 'like': 6, 'great': 7, 'story': 8, 'good': 9, 'time': 10, 'people': 11, 'life': 12, 'best': 13, 'love': 14, 'way': 15, 'characters': 16, 'films': 17, 'character': 18, 'man': 19, 'movies': 20, 'little': 21, 'seen': 22, 'think': 23, 'don': 24, '10': 25, 'scene': 26, 'years': 27, 'world': 28, 'end': 29, 'watch': 30, 'real': 31, 'know': 32, 'makes': 33, 'cast': 34, 'young': 35, 'old': 36, 'find': 37, 'work': 38, 'watching': 39, 'plot': 40, 'director': 41, 'better': 42, 'scenes': 43, 'music': 44, 'thing': 45, 'new': 46, 've': 47, 'family': 48, 'acting': 49, 'actors': 50, 'action': 51, 'm': 52, 'series': 53, 'comedy': 54, 'lot': 55, 'role': 56, 'bit': 57, 'saw': 58, 'big': 59, 'played': 60, 'look': 61, 'didn': 62, 'come': 63, 'beautiful': 64, 'performance': 65, 'funny': 66, 'things': 67, 'fun': 68, 'day': 69, 'excellent': 70, 'tv': 71, 'wonderful': 72, 'thought': 73, 'comes': 74, 'going': 75, 'want': 76, 'bad': 77, 'actually': 78, 'doesn': 79, 'long': 

#### Transform tokens into indices according to token2id

In [21]:
# transform tokens into integer indices according to token2id
def transform(tokenized_data, token2id):
  # input: list of lists of tokens
  # output: list of list of ids according to token2id

  text_indices = []
  for tokens in tqdm(tokenized_data):
      indices = [token2id.get(token, 1) for token in tokens]
      text_indices.append(indices)
  return text_indices

In [22]:
train_transformed = transform(train_tokenized, token2id)
test_transformed = transform(test_tokenized, token2id)

100%|██████████| 500/500 [00:00<00:00, 43813.89it/s]
100%|██████████| 500/500 [00:00<00:00, 43003.51it/s]


In [23]:
print(train_transformed[1])
print([id2token[i] for i in train_transformed[1]])

[847, 132, 4, 848, 289, 1836, 1165, 3847, 2844, 138, 71, 2, 5870, 932, 89, 414, 3848, 1554, 5871, 933, 5872, 1555, 2244, 3849, 5873, 5874, 934, 43, 2844, 3850, 2245, 771, 44, 104, 3851, 5875, 554, 3847, 2, 3852, 2844, 597, 703, 2845, 2246, 5876, 3853, 1166, 3854, 1030, 67, 256, 2846, 5877, 390, 2247, 2248, 5878, 5879, 5880, 307, 3855, 2847, 5881, 149, 5882, 3856, 55, 5883, 5884, 7, 68, 5885, 243, 598, 4, 391, 1556, 5886, 3857, 2, 308, 5887, 1338, 202, 13, 5888, 5889, 5890, 5891, 5892, 1339, 2249, 599, 41, 224, 1167, 5893, 3858, 5894, 139, 2848, 167, 119]
['bizarre', 'horror', 'movie', 'filled', 'famous', 'faces', 'stolen', 'cristina', 'raines', 'later', 'tv', 's', 'flamingo', 'road', 'pretty', 'somewhat', 'unstable', 'model', 'gummy', 'smile', 'slated', 'pay', 'attempted', 'suicides', 'guarding', 'gateway', 'hell', 'scenes', 'raines', 'modeling', 'captured', 'mood', 'music', 'perfect', 'deborah', 'raffin', 'charming', 'cristina', 's', 'pal', 'raines', 'moves', 'creepy', 'brooklyn', 'he

In [24]:
tokens = ["oh", "lot", "skhjdaasdsa"]
print([token2id.get(token, 1) for token in tokens])

[532, 55, 1]
