# preprocess

## 1 Load data

In the `train.py`, we use one line to get the preprocessed data. 

`x_text, y = data_helpers.load_data_and_labels(positive_data_file_path, negative_data_file_path)`

In the `data_helper.py`, we use `load_data_and_labels` and `clean_str` to preprocess the data. In this notebook, I will show want happend in this preprocess. 


The `clean_str()` is used to process the special characters, we take it from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py

In [3]:
import numpy as np
import re
import itertools
from collections import Counter


def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()


Next we will to see what happend in the `load_data_and_labels()`

####  1.1 Load data from files 

In [21]:
# data path parameter
positive_data_file = "../data/rt-polaritydata/rt-polarity.pos"
negtive_data_file = "../data/rt-polaritydata/rt-polarity.neg"

In [22]:
# Load data from files
# If we use python 3, we should set the encoding as utf-8, otherwise there will be an error. 
positive_examples = list(open(positive_data_file, 'r', encoding='utf-8').readlines())

In [23]:
positive_examples[:3]

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \n',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n',
 'effective but too-tepid biopic\n']

The sentence looks bad,  we use `strip()` to delete the `\n`:

In [24]:
positive_examples = [s.strip() for s in positive_examples]

In [25]:
positive_examples[:3]

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic']

We do the same thing with negative data:

In [26]:
negative_examples = list(open(negtive_data_file, 'r', encoding='utf-8').readlines())
negative_examples = [s.strip() for s in negative_examples]

In [31]:
negative_examples[-3:]

["as it stands , crocodile hunter has the hurried , badly cobbled look of the 1959 godzilla , which combined scenes of a japanese monster flick with canned shots of raymond burr commenting on the monster's path of destruction .",
 'the thing looks like a made-for-home-video quickie .',
 "enigma is well-made , but it's just too dry and too placid ."]

#### 1.2 Split by words

Here we will use `clean_str()` to process the data

In [34]:
# See the example number
print('The positive sentence number: ', len(positive_examples))
print('The negative sentence number: ', len(negative_examples))

The positive sentence number:  5331
The negative sentence number:  5331


In [99]:
x_text = positive_examples + negative_examples
print('The whole sentence number: ', len(x_test))

The whole sentence number:  10662


In [100]:
# first 3 sentences are positive 
x_text[:3] 

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic']

In [101]:
# last 3 sentences are negative
x_text[-3:]

["as it stands , crocodile hunter has the hurried , badly cobbled look of the 1959 godzilla , which combined scenes of a japanese monster flick with canned shots of raymond burr commenting on the monster's path of destruction .",
 'the thing looks like a made-for-home-video quickie .',
 "enigma is well-made , but it's just too dry and too placid ."]

In [102]:
# use clean_str() to process each sentence
x_text = [clean_str(sent) for sent in x_text]

In [103]:
x_test[:3]

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic']

In [104]:
x_test[-3:]

["as it stands , crocodile hunter has the hurried , badly cobbled look of the 1959 godzilla , which combined scenes of a japanese monster flick with canned shots of raymond burr commenting on the monster's path of destruction .",
 'the thing looks like a made-for-home-video quickie .',
 "enigma is well-made , but it's just too dry and too placid ."]

In [105]:
# split each sentence to a list of words
x_text = [sent.split(" ") for sent in x_text]

In [106]:
x_text[0]

['the',
 'rock',
 'is',
 'destined',
 'to',
 'be',
 'the',
 '21st',
 'century',
 "'s",
 'new',
 'conan',
 'and',
 'that',
 'he',
 "'s",
 'going',
 'to',
 'make',
 'a',
 'splash',
 'even',
 'greater',
 'than',
 'arnold',
 'schwarzenegger',
 ',',
 'jean',
 'claud',
 'van',
 'damme',
 'or',
 'steven',
 'segal']

#### 1.3 Generate labels

For each sentence, using `[neg, pos]` to represent the lables.
- If we have a positive label, we represent it as `[0, 1]`
- If we have a negative label, we represent it as `[1, 0]`

In [149]:
positive_labels = [[0, 1] for _ in positive_examples]
negative_labels = [[1, 0] for _ in negative_examples]
# concatenate them together
y = np.concatenate([positive_labels, negative_labels], 0)

In [150]:
y[:3]

array([[0, 1],
       [0, 1],
       [0, 1]])

In [151]:
y[-3:]

array([[1, 0],
       [1, 0],
       [1, 0]])

We can see the positive example amd negative example are ordered, so we need to shuffle them when training to make the model learn better parameter. We will do this later

## 2 Pad

Pad all sentences to the same length. We choose longest sentence as the length.

In [166]:
[len(sent) for sent in x_text][:5]

[34, 38, 5, 20, 22]

In [107]:
sequence_length = max(len(sent) for sent in x_text)
sequence_length

56

In [108]:
padded_sentences = []

In [111]:
padding_word="<PAD/>"

for i in range(len(x_text)):
    sentence = x_text[i]
    num_padding = sequence_length - len(sentence)
    new_sentence = sentence + [padding_word] * num_padding
    padded_sentences.append(new_sentence)

In [113]:
padded_sentences[0]

['the',
 'rock',
 'is',
 'destined',
 'to',
 'be',
 'the',
 '21st',
 'century',
 "'s",
 'new',
 'conan',
 'and',
 'that',
 'he',
 "'s",
 'going',
 'to',
 'make',
 'a',
 'splash',
 'even',
 'greater',
 'than',
 'arnold',
 'schwarzenegger',
 ',',
 'jean',
 'claud',
 'van',
 'damme',
 'or',
 'steven',
 'segal',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>']

## 3 Build vocabulary 

Here we use the whole sentences to build vocabulary. There is one thing we should pay attention that in a normal nlp project, we only use the training data to build vocabulary. And when we process the testing data, there are some words that do not appear in the vocabulary. This is known as **oov** problem. 

> out of vocabulary(oov): used in computational linguistics and natural language processing for terms encountered in input which are not present in a system's dictionary or database of known terms

Because this is the first project, we use the whole data set to build the vocabulary for easy understanding. 

In [117]:
# build vocabulary
word_counts = Counter(itertools.chain(*padded_sentences)) # word_counts = {'the': 10194', 'rock': 39, ...}

In [120]:
word_counts.most_common()[:10]

[('<PAD/>', 379718),
 ('the', 10194),
 (',', 10037),
 ('a', 7341),
 ('and', 6264),
 ('of', 6148),
 ('to', 4275),
 ('is', 3562),
 ("'s", 3544),
 ('it', 3428)]

In [118]:
# Mapping from index to word
vocabulary_inv = [x[0] for x in word_counts.most_common()]

In [122]:
vocabulary_inv[:10]

['<PAD/>', 'the', ',', 'a', 'and', 'of', 'to', 'is', "'s", 'it']

In [123]:
# Mapping from word to index
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}

In [125]:
for i, item in enumerate(vocabulary.items()):
    if i > 5:
        break
    print(item)

('<PAD/>', 0)
('the', 1)
(',', 2)
('a', 3)
('and', 4)
('of', 5)


## 4 Map sentences and labels to index 

In [144]:
x = np.array([[vocabulary[word] for word in sentence] for sentence in padded_sentences])

In [146]:
print(x.shape)
print(x[0])

(10662, 56)
[    1   565     7  2633     6    22     1  3369   887     8   100  5598
     4    11    65     8   240     6    73     3  3913    57  2948    34
  1489  2393     2  2394 10111  1708  7197    42   937 10112     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]


In [152]:
y = np.array(y)

In [154]:
y

array([[0, 1],
       [0, 1],
       [0, 1],
       ...,
       [1, 0],
       [1, 0],
       [1, 0]])

## 4 Shuffle data


In [155]:
# first to get a dict version of vocabulary inverse
vocabulary_inv = {value: key for key, value in vocabulary.items()}

In [156]:
for i, item in enumerate(vocabulary_inv.items()):
    if i > 5:
        break
    print(item)

(0, '<PAD/>')
(1, 'the')
(2, ',')
(3, 'a')
(4, 'and')
(5, 'of')


In [157]:
y[:5]

array([[0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1]])

In [158]:
y = y.argmax(axis=1)

In [159]:
print(y[:5]) # 1 means positive
print(y[-5:]) # 0 means negative

[1 1 1 1 1]
[0 0 0 0 0]


In [160]:
# shuffle setting
np.random.seed(10)

In [161]:
shuffle_indices = np.random.permutation(np.arange(len(y))) # len(y): 10662
shuffle_indices

array([ 7359,  5573, 10180, ...,  1344,  7293,  1289])

In [162]:
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]

## 5 Split train/test set

back to train.py

In [163]:
training_rate = 0.9
train_len = int(len(x) * 0.9)

x_train = x[:train_len]
y_train = y[:train_len]
x_test = x[train_len:]
y_test = y[train_len:]