# Load Data

In [1]:
from pathlib import Path
import os
DATA_PATH = Path('./dat/')
DATA_PATH.mkdir(exist_ok =True)
#if not os.path.exists('./dat/aclImdb_v1.tar.gz'):
if not os.path.exists('./dat/aclImdb'):
    !curl -O http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xf aclImdb_v1.tar.gz -C {DATA_PATH}

In [2]:
import numpy as np
CLASSES = ['neg' ,'pos'] #,'unsup']
PATH=Path('./dat/aclImdb/')
def get_texts(path):
  texts ,labels = [] ,[]
  for idx , label in enumerate(CLASSES):
    for fname in (path/label).glob('*.*'):
      #texts.append(fixup(fname.open('r',encoding='utf-8').read()))
      texts.append(fname.open('r', encoding='utf-8').read())
      labels.append(idx)
      #return np.array(texts), np.array(labels)
      return texts , labels

In [30]:
#from keras.datasets import imdb
#(train_text, train_labels), (test_text, test_labels) = imdb.load_data(num_words=10000)

In [4]:
train_text , train_labels =get_texts(PATH/'train')
test_text  , test_labels  =get_texts(PATH/'test')

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
texts = train_text + test_text
texts[:10]

["Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.",
 "Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I

# Important notes on Keras Tokenizer:

- As you can see, the UNK token is added with index = 1. Index 0 is never given.

This info is important when dealing later with Embedding layer in Keras, where padding with 0's might cause OOV if not considered, so we init that layer with vocab_size + 1 (to account for the missing 0). That layer also has option to mask_zero if needed for LSTM.

Another solution would be to -1 from all binarized values. But this would make padded words same as UNK.

In other words, it's a choice to have two different values for:

A- UNK-->1
B- PAD-->0

In this case, the vocab_sz should increase by 1 (for pad), in case the model needs padding like LSTM.

Or keep them both as 0. In this case, the vocab_sz remains the same as given from the tokenizer.

- If you fit a tokenizer wil small vocab using num_words, then take care that the word_index will still hold the full vocab.Then if you vectorize the text, either use tokenizer.texts_to_sequences, which will take care of the passed num_words,or if you develop ur own str2idx dict, then take care of the vocab not to use word_index as is, but limit it with the num_words, __keeping in mind that word_index is ordered by freqeuency__. A better approach is to build your own vocab


In [6]:
tok = Tokenizer(num_words= 100 , oov_token = 'UNK')
tok.fit_on_texts(texts)

In [7]:
len(tok.word_index)

185

Keras tokenizer can be used to vectorize or binarize words into integer indices (more on that later). It takes list of words

In [8]:
s = 'Hello World'
tok.texts_to_sequences(s)

[[1], [1], [1], [1], [1], [], [1], [1], [1], [1], [1]]

What happened above is that texts_to_sequences considered s as a list of chars, ALL OOV so unknowns

In [9]:
s = 'Hello World'
tok.texts_to_sequences(s.split())

[[1], [1]]

Hello is not in the training set

Let's try to vectorize a sentence with word beyond the first 100 most frequent words:

In [10]:
s = 'cartoon movie'
tok.texts_to_sequences(s.split())

[[1], [1]]

Although movie is in the word_index, but it's considered unknown with texts_to_sequences. So we have to take care of that when dealing the vocab.


Let's fit on all data and repeat

In [11]:
tok = Tokenizer(oov_token='UNK')
tok.fit_on_texts(texts)
s = 'cartoon movie'
l = tok.texts_to_sequences(s.split())
l

[[1], [113]]

Now the word cartoon is known

But the shape is not as expected!

In [12]:
np.array(l).shape

(2, 1)

We would expect (1,2), for 1 sentence and 2 words.

texts_to_sequences takes a list (char, words, sequences). So in the above case, it considers the 2 words as separate sentences.

To fix this:


In [13]:
l = tok.texts_to_sequences([s.split()])
l

[[1, 113]]

To get back to text = decode:

In [14]:
tok.sequences_to_texts(l)

['UNK movie']

Tokenizer can also be used for few more things:

- word counts: A dictionary of words and their counts.

- word docs: An integer count of the total number of documents that were used to fit the Tokenizer.

- word index: A dictionary of words and their uniquely assigned integers.

- document count: A dictionary of words and how many documents each appeared in.

In [15]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)


OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
defaultdict(<class 'int'>, {'done': 1, 'well': 1, 'good': 1, 'work': 2, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})


# Padding

The vectors we obtain are generally not of equal lengths:

In [16]:
docs = ['cartoon movie show', 'hello world']
docs = [s.split() for s in docs]
l = tok.texts_to_sequences(docs)
l

[[1, 113, 1], [1, 1]]

For most supervised learning models, we need the design matrix X to have fixed number of columns and number of rows as the data size. RNNs are exceptions, but in most frameworks, it's also required to have fixed length vectors.

For that, we might need to pad the sequences to max len. For RNNs, the padded parts can be ignored later.



In [17]:
from keras.preprocessing.sequence import pad_sequences
maxlen = max([len(t) for t in texts])

l = np.array(pad_sequences(l,
                          maxlen=maxlen,
                          padding='post',
                          truncating='post'))
l.shape

(2, 900)

In [19]:
l

array([[  1, 113,   1, ...,   0,   0,   0],
       [  1,   1,   0, ...,   0,   0,   0]])

Notes:

- Padding can be performed on texts before or after binarization. However, it's better to be done after binarization since we pad 0's which won't be understood by the vocab during binarization

- The padded vector is very sparse! This is due to exceptionally long sentences --> outliers. So it's better to remove them, or even limit the max sentence length.

# Text features

So far, we have tranformed the text into binary/digital form that can be understood by ML models.

However, we can further apply or extract different features from the vectorized form.

In other words, we can represent the sequence of word indices we obtained in different forms.

# BoW with keras Tokenizer

We can use keras tokenizer to build a simple BoW.

The rows = number of sentences/documents

The #columns = number of words in the vocab

Each entry can encode different modes:

- binary: Whether or not each word is present in the document. This is the default.

- count: The count of each word in the document.

- tfidf: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document (more on that later).

- freq: The frequency of each word as a ratio of words within each document.

In [21]:
tok = Tokenizer(oov_token='UNK')
tok.fit_on_texts(texts)
bow = tok.texts_to_matrix(texts[:10], mode='count')
bow.shape

(2, 186)

In [22]:
tok = Tokenizer(num_words=100, oov_token='UNK')
tok.fit_on_texts(texts)
bow = tok.texts_to_matrix(texts[:10], mode='count')
bow.shape

(2, 100)

In [23]:
bow

array([[ 0.,  9.,  4.,  7.,  3.,  2.,  3.,  2.,  1.,  2.,  2.,  1.,  1.,
         0.,  2.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,
         1.,  2.,  2.,  1.,  2.,  1.,  1.,  1.,  1.,  1.,  2.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 0., 77.,  7.,  3.,  5.,  3.,  1.,  2.,  3.,  2.,  2.,  3.,  3.,
         4.,  1.,  2.,  2.,  3.,  3.,  3.,  3.,  3.,  1.,  1.,  1.,  1.,
         1.,  0.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  2.,  2.,
         2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0

# BoW with sklearn

The BoW model above can be also produced using sklearn

# CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(texts)
bow = vectorizer.transform(texts[:10])
bow.shape

(2, 178)

In [25]:
vectorizer.vocabulary_

{'story': 137,
 'of': 101,
 'man': 89,
 'who': 170,
 'has': 65,
 'unnatural': 158,
 'feelings': 48,
 'for': 51,
 'pig': 113,
 'starts': 134,
 'out': 109,
 'with': 174,
 'opening': 106,
 'scene': 122,
 'that': 142,
 'is': 76,
 'terrific': 140,
 'example': 46,
 'absurd': 1,
 'comedy': 29,
 'formal': 54,
 'orchestra': 108,
 'audience': 14,
 'turned': 155,
 'into': 75,
 'an': 4,
 'insane': 74,
 'violent': 163,
 'mob': 91,
 'by': 19,
 'the': 143,
 'crazy': 33,
 'chantings': 22,
 'it': 77,
 'singers': 131,
 'unfortunately': 157,
 'stays': 135,
 'whole': 171,
 'time': 151,
 'no': 98,
 'general': 59,
 'narrative': 96,
 'eventually': 45,
 'making': 88,
 'just': 78,
 'too': 154,
 'off': 102,
 'putting': 117,
 'even': 44,
 'those': 150,
 'from': 57,
 'era': 43,
 'should': 128,
 'be': 15,
 'cryptic': 34,
 'dialogue': 35,
 'would': 175,
 'make': 87,
 'shakespeare': 127,
 'seem': 124,
 'easy': 41,
 'to': 152,
 'third': 148,
 'grader': 62,
 'on': 103,
 'technical': 138,
 'level': 84,
 'better': 17,
 

In [26]:
bow

<2x178 sparse matrix of type '<class 'numpy.int64'>'
	with 202 stored elements in Compressed Sparse Row format>

In [27]:
bow = bow.toarray()
print(bow.shape)
print(bow)

(2, 178)
[[0 2 0 0 1 1 0 0 0 0 0 0 0 0 1 2 0 1 1 2 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1
  0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 2 1 0 1 1 1 0 1 0 0 0 0 0 0
  0 0 1 1 2 4 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 1 0 0 3 2 1 0 0 1 0
  1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 4
  0 0 1 0 1 0 1 1 1 0 1 2 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 3 1 1 1]
 [3 0 1 2 1 3 1 1 1 4 1 1 1 1 0 2 1 1 0 1 0 3 0 2 1 0 2 1 1 0 1 4 1 0 0 0
  2 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 2 0 0 2 0 0 0 1 1 1 2 1 1 1 1
  2 1 0 0 3 2 1 1 1 0 3 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 3 2 1 5 2 1 1 1 0 1
  0 1 1 1 1 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 2 0 7
  1 1 0 1 0 1 0 1 3 1 0 0 1 0 0 1 2 2 0 0 1 1 3 1 1 2 1 0 1 1 1 0 0 0]]


Notes:

- We don't have control on the vocab size unlike keras

- The matrix is (CSR) (Compressed Sparse Row), which is more compact representation for matrices with many zeros. The BoW is very sparse, it's a good way to represent it.
You can easily recover the array when needed using `toarray()`

- Vocab is all lower case, and punctuation is ignored. All those are configurable from sklearn