
**We import imdb dataset of 25000 reviews, then we select top 10000 words from dataset to work with<br>
X = reviewes that are tokenized in form of indexes that are related to a token<br>
Y = if review is negative or positive<br>
we normalize each review to be no more than 256 words and if it is less we add 0 to end of the review<br>

later we use embedding layer of keras library to vectorize each word (token) to a vector with length of 32<br>
next we flatten this vector using flatten layer of keras and then feed it to neural net.

In [23]:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
sess = tf.Session(config=config)
set_session(sess)

In [2]:
# we need to fit model with sequence of tokens with specific length
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout, Flatten
from keras.datasets import imdb

In [3]:
num_words = 10000 # num_words: integer or None. Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data

In [4]:
# Dataset of 25000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been 
# preprocessed, and each review is encoded as a sequence of word indexes (integers)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

In [5]:
print(len(X_train))
print(len(X_test))

25000
25000


In [6]:
max_len = 256 # represents length of the sequence of tokens | int. Maximum sequence length. Any longer sequence will be truncated.
embedding_size = 32 # size of the vector that represents each word
batch_size = 128

In [7]:
pad  =  'post' #'pre'

In [8]:
# set all sequences to the same size of 256, add zero to empty places
X_train_pad = pad_sequences(X_train, maxlen=max_len, padding=pad, truncating=pad)
X_test_pad = pad_sequences(X_test, maxlen=max_len, padding=pad, truncating=pad)

In [9]:
X_train_pad.shape

(25000, 256)

In [10]:
X_train.shape

(25000,)

In [11]:
model = Sequential()

***** Embedding layer vs Word2Vec and Glove *****<br><br>

Embeddings (in general, not only in Keras) are methods for learning vector representations of categorical data.<br>
They are most commonly used for working with textual data. Word2vec and GloVe are two popular frameworks for learning word embeddings. What embeddings do, is they simply learn to map the one-hot encoded categorical variables to vectors<br> 
of floating point numbers of smaller dimensionality then the input vectors. For example, one-hot vector representing<br> 
a word from vocabulary of size 50 000 is mapped to real-valued vector of size 100. Then, the embeddings vector is used for whatever you want to use it as features.<br>

The difference is how Word2vec is trained, as compared to the "usual" learned embeddings layers. Word2vec is trained to predict if word belongs to the context, given other words, e.g. to tell if "milk" is a likely word given the<br> 
"The cat was drinking..." sentence begging. By doing so, we expect Word2vec to learn something about the language,<br> 
as in the quote "You shall know a word by the company it keeps" by John Rupert Firth. Using the above example,<br> 
Word2vec learns that "cat" is something that is likely to appear together with "milk", but also with "house", or "pet", so it is somehow similar to "dog". As a consequence, embeddings created by Word2vec, or similar models, learn to represent words with similar meanings using similar vectors.<br>

On another hand, with embeddings learned as a layer of a neural network, the network may be trained to predict<br> 
whatever you want. For example, you can train your network to predict sentiment of a text. In such case, the embeddings<br>
would learn features that are relevant for this particular problem. As a side effect, they can learn also some general<br> 
things about the language, but the network is not optimized for such task. Using the "cat" example, embeddings trained<br> 
for sentiment analysis may learn that "cat" and "dog" are similar, because people often say nice things about their pets.<br>

In practical terms, you can use the pretrained Word2vec embeddings as features of any neural network (or other<br> 
algorithm). They can give you advantage if your data is small, since the pretrained embeddings were trained on<br> 
large volumes of text. On another hand, there are examples showing that learning the embeddings from your data, optimized for a particular problem, may be more efficient (Qi et al, 2018).<br>

In [12]:
# add the embedding layer
# input_dim : number of words, output_dim = size of vector for each word
# input_length : length of each sequence (input review), name = name of layer
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_len,
                    name='layer_embedding'))

Instructions for updating:
Colocations handled automatically by placer.


In [13]:
model.add(Flatten())
# we add a flatten layer because we need to transform data before feeding it to a hidden layer
# before we had a 256x32 input (32 dense layer) but after flattening we have 8192x1 flat layer

In [14]:
model.add(Dense(units=250, activation='relu')) # add first hidden layer with relu and 250 nodes

In [15]:
model.add(Dropout(0.5)) # add regularizer to prevent overfitting

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [16]:
# we have binary classification so we use sigmoid for last layer
model.add(Dense(1, activation='sigmoid'))

In [17]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [18]:
model.summary() # shows a visual representation of the rnn model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 256, 32)           320000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 8192)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               2048250   
_________________________________________________________________
dropout_1 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 2,368,501
Trainable params: 2,368,501
Non-trainable params: 0
_________________________________________________________________


In [19]:
%%time
model.fit(X_train_pad, y_train, epochs=5, validation_data=(X_test_pad, y_test), batch_size=batch_size)

Instructions for updating:
Use tf.cast instead.
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 3min 27s, sys: 10.9 s, total: 3min 38s
Wall time: 1min 14s


<keras.callbacks.History at 0x7f505fddcdd8>

In [20]:
%%time
eval_ = model.evaluate(X_test_pad, y_test)

CPU times: user 7.15 s, sys: 188 ms, total: 7.34 s
Wall time: 3.25 s


In [22]:
print(eval_[0], eval_[1]) # loss / accuracy

0.5925464714813232 0.84696
