# Learn Word Embedding

In [1]:
%matplotlib inline

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame()
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1


In [4]:
X_train = df.loc[:24999, 'review'].values
y_train = df.loc[:24999, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

The word embeddings of our dataset can be learned while training a neural network on the classification problem. Before it can be presented to the network, the text data is first encoded so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API provided with Keras. We add padding to make all the vectors of same length (max_length). Below code converts the text to integer indexes, now ready to be used in Keras embedding layer.

In [5]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences


tokenizer_obj = Tokenizer()
total_reviews = X_train + X_test
tokenizer_obj.fit_on_texts(total_reviews) 

# pad sequences
max_length = 100 # try other options like mean
# define vocabulary size
vocab_size = len(tokenizer_obj.word_index) + 1

X_train_tokens =  tokenizer_obj.texts_to_sequences(X_train)
X_test_tokens = tokenizer_obj.texts_to_sequences(X_test)


X_train_pad = pad_sequences(X_train_tokens, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_length, padding='post')

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [6]:
print(vocab_size)

125602


We are now ready to define our neural network model. The model will use an Embedding layer as the first hidden layer. The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset during training of the model.

In [9]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding

EMBEDDING_DIM = 100

print('Build model...')

model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length))
model.add(LSTM(units=32,  dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print('Summary of the built model...')
print(model.summary())

Build model...
Summary of the built model...
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 100)          12560200  
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 12,577,257
Trainable params: 12,577,257
Non-trainable params: 0
_________________________________________________________________
None


In [10]:
print('Train...')

model.fit(X_train_pad, y_train, batch_size=128, epochs=25, validation_data=(X_test_pad, y_test), verbose=2)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/25
 - 41s - loss: 0.5164 - accuracy: 0.7500 - val_loss: 0.4212 - val_accuracy: 0.8164
Epoch 2/25
 - 41s - loss: 0.3038 - accuracy: 0.8820 - val_loss: 0.3953 - val_accuracy: 0.8331
Epoch 3/25
 - 40s - loss: 0.2169 - accuracy: 0.9231 - val_loss: 0.4506 - val_accuracy: 0.8121
Epoch 4/25
 - 40s - loss: 0.1664 - accuracy: 0.9421 - val_loss: 0.4813 - val_accuracy: 0.8200
Epoch 5/25
 - 39s - loss: 0.1515 - accuracy: 0.9492 - val_loss: 0.5177 - val_accuracy: 0.8125
Epoch 6/25
 - 40s - loss: 0.1110 - accuracy: 0.9646 - val_loss: 0.5468 - val_accuracy: 0.8128
Epoch 7/25
 - 40s - loss: 0.0817 - accuracy: 0.9748 - val_loss: 0.6387 - val_accuracy: 0.8098
Epoch 8/25
 - 40s - loss: 0.0725 - accuracy: 0.9773 - val_loss: 0.6860 - val_accuracy: 0.7903
Epoch 9/25
 - 39s - loss: 0.0788 - accuracy: 0.9750 - val_loss: 0.6092 - val_accuracy: 0.8180
Epoch 10/25
 - 40s - loss: 0.0538 - accuracy: 0.9836 - val_loss: 0.6740 - val_accuracy: 0.8138


<keras.callbacks.callbacks.History at 0x138801c1808>

In [11]:
print('Testing...')
score, acc = model.evaluate(X_test_pad, y_test, batch_size=128)

print('Test score:', score)
print('Test accuracy:', acc)

print("Accuracy: {0:.2%}".format(acc))

Testing...
Test score: 0.958162715177536
Test accuracy: 0.7931600213050842
Accuracy: 79.32%
