## Text Pre-Processing

In this notebook we'll the most basic approach to text.  Bag of Words.
Let's start by loading up a text-data set.  We'll use Keras built in IMDB movie reviews. 

https://keras.io/datasets/
### From the Keras Docs:

>Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".


So actually, Keras already put it into bag of words format for us!  All we have to is just see how well that actually works out.

Before we start, the original data is available here : (if you want to practice working with text directly)

http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
from keras.datasets import imdb
import numpy as np

In [2]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


We import the dataset and load it into the familiar sets of data.  Note the keyword argument `num_words = 10000` this built-in argument means that we are only taking the 10,000 most commonly used words in the dataset.  The logic is that rarely used words aren't going to help classify the movies as positive or negative.

In [3]:
print ("Train shape : ",train_data.shape)
print ("Test shape : ", test_data.shape)

Train shape :  (25000,)
Test shape :  (25000,)


Note that the training data is a 1-tensor, that means each sample is just a list of numbers.  Let's look a a sample

In [12]:
len(train_data[1])

189

In [13]:
train_data[1]

[1,
 194,
 1153,
 194,
 8255,
 78,
 228,
 5,
 6,
 1463,
 4369,
 5012,
 134,
 26,
 4,
 715,
 8,
 118,
 1634,
 14,
 394,
 20,
 13,
 119,
 954,
 189,
 102,
 5,
 207,
 110,
 3103,
 21,
 14,
 69,
 188,
 8,
 30,
 23,
 7,
 4,
 249,
 126,
 93,
 4,
 114,
 9,
 2300,
 1523,
 5,
 647,
 4,
 116,
 9,
 35,
 8163,
 4,
 229,
 9,
 340,
 1322,
 4,
 118,
 9,
 4,
 130,
 4901,
 19,
 4,
 1002,
 5,
 89,
 29,
 952,
 46,
 37,
 4,
 455,
 9,
 45,
 43,
 38,
 1543,
 1905,
 398,
 4,
 1649,
 26,
 6853,
 5,
 163,
 11,
 3215,
 2,
 4,
 1153,
 9,
 194,
 775,
 7,
 8255,
 2,
 349,
 2637,
 148,
 605,
 2,
 8003,
 15,
 123,
 125,
 68,
 2,
 6853,
 15,
 349,
 165,
 4362,
 98,
 5,
 4,
 228,
 9,
 43,
 2,
 1157,
 15,
 299,
 120,
 5,
 120,
 174,
 11,
 220,
 175,
 136,
 50,
 9,
 4373,
 228,
 8255,
 5,
 2,
 656,
 245,
 2350,
 5,
 4,
 9837,
 131,
 152,
 491,
 18,
 2,
 32,
 7464,
 1212,
 14,
 9,
 6,
 371,
 78,
 22,
 625,
 64,
 1382,
 9,
 8,
 168,
 145,
 23,
 4,
 1690,
 15,
 16,
 4,
 1355,
 5,
 28,
 6,
 52,
 154,
 462,
 33,
 89,
 78,
 2

## Decoding the reviews

It's good to decode the reviews, just to have some idea what you are working with.

In [6]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])
def decode_review(index):
    decoded_review = " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[index]])
    print (decoded_review)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


### One Positive Review

In [7]:
decode_review(0)

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

The "?" mean that we don't have that word, not in our top 10,000

In [8]:
train_labels[0]

1

### One Negative Review

In [9]:
decode_review(100)

? i am a great fan of david lynch and have everything that he's made on dvd except for hotel room the 2 hour twin peaks movie so when i found out about this i immediately grabbed it and and what is this it's a bunch of ? drawn black and white cartoons that are loud and foul mouthed and unfunny maybe i don't know what's good but maybe this is just a bunch of crap that was ? on the public under the name of david lynch to make a few bucks too let me make it clear that i didn't care about the foul language part but had to keep ? the sound because my neighbors might have all in all this is a highly disappointing release and may well have just been left in the ? box set as a curiosity i highly recommend you don't spend your money on this 2 out of 10


In [10]:
train_labels[100]

0

### A bit more preprocessing required

We can't feed lists of numbers into our network. We'll need to vectorize them into one-hot-encoded lists where they get 1's if their word exists.  This means that each review will be a vector of 10,000 length with 1's for the places where it has the vocabulary word.

In [11]:
def vectorize_sequences (sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

In [14]:
train_data_vectors = vectorize_sequences(train_data)

In [15]:
train_data_vectors.shape

(25000, 10000)

In [16]:
print(train_data_vectors[0])
print(len(train_data_vectors[0]))

[0. 1. 1. ... 0. 0. 0.]
10000


In [17]:
test_data_vectors = vectorize_sequences(test_data)

## Time to build a model!

Since you are such experts at this part, I will let you struggle to make your own now!

But I will start it for you

I suggest you make your final activation layer sigmoid.  And think about what kind of final layer we need.  What is the output size?

In [18]:
from keras import models
from keras import layers

In [33]:
model = models.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape = (10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='RMSProp',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

model.fit(train_data_vectors, train_labels, epochs = 3, batch_size = 128)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1b65b908df0>

In [34]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, 128)               1280128   
_________________________________________________________________
dense_15 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_16 (Dense)             (None, 16)                1040      
_________________________________________________________________
dense_17 (Dense)             (None, 1)                 17        
Total params: 1,289,441
Trainable params: 1,289,441
Non-trainable params: 0
_________________________________________________________________


In [35]:
test_loss, test_acc = model.evaluate(test_data_vectors, test_labels)
print('test_acc:', test_acc)

test_acc: 0.8763599991798401


## A few points

Ok, firstly, we seem to have done MUCH better on the training than the testing.
This is called "overfitting" and is a major topic in the next unit.

Aside from that, I'm curious if we turn our vectors into **counts** of the bag of words, would we do better?  Right now it doesn't matter if the word is present more than once in the vector.

This would require us to modify our vectorizing code -- so let's do that.

In [36]:
def vectorize_sequences (sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i, j] += 1.
    return results


In [37]:
train_data_vector_counts = vectorize_sequences(train_data)
test_data_vector_counts = vectorize_sequences(test_data)

In [38]:
print(train_data_vector_counts.shape)
train_data_vector_counts[0]

(25000, 10000)


array([0., 1., 6., ..., 0., 0., 0.])

## Ok, let's try with our count vectors

In [39]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape = (10000,)))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='RMSProp',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

model.fit(train_data_vector_counts, train_labels, epochs = 5, batch_size = 128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1b65c081610>

In [40]:
test_loss, test_acc = model.evaluate(test_data_vector_counts, test_labels)
print('test_acc:', test_acc)

test_acc: 0.8694800138473511


## Your thoughts please

Better ? Worse? 

## TFIDF

Let's go for broke and do TFIDF with our count matrix.


In [41]:
from sklearn.feature_extraction.text import TfidfTransformer

In [42]:
transformer = TfidfTransformer()

In [43]:
train_tfidf = transformer.fit_transform(train_data_vector_counts)
test_tfidf = transformer.transform(test_data_vector_counts)

In [44]:
train_tfidf = train_tfidf.toarray()
test_tfidf = test_tfidf.toarray()

In [45]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape = (10000,)))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='RMSProp',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

model.fit(train_tfidf, train_labels, epochs = 4, batch_size = 128)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x1b7596ffac0>

In [46]:
test_loss, test_acc = model.evaluate(test_tfidf, test_labels)
print('test_acc:', test_acc)

test_acc: 0.8873599767684937


In [47]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_22 (Dense)             (None, 16)                160016    
_________________________________________________________________
dense_23 (Dense)             (None, 8)                 136       
_________________________________________________________________
dense_24 (Dense)             (None, 1)                 9         
Total params: 160,161
Trainable params: 160,161
Non-trainable params: 0
_________________________________________________________________


your answer here: