We will use "IMDB movie review sentiment classification dataset"

Dataset Description: https://keras.io/api/datasets/imdb/

This is a dataset of 25,000 movie reviews from IMDB, tagged by sentiment (positive/negative). The reviews have been preprocessed and each review is coded as a list of (whole) word indexes. For convenience, words are indexed by their overall frequency in the dataset, so that, for example, the integer "3" encodes the 3rd most frequent word in the data.

In [28]:
!pip install keras

In [5]:
!pip install --upgrade tensorflow



In [6]:
import numpy
import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, Dropout
from tensorflow.python.keras.layers.embeddings import Embedding
from tensorflow.python.keras.layers.convolutional import Conv1D
from tensorflow.python.keras.layers.convolutional import MaxPooling1D
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.layers import Flatten
# fix random seed for reproducibility;pl
numpy.random.seed(7)

ModuleNotFoundError: No module named 'tensorflow.python.keras.preprocessing'

In [9]:
db=imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [11]:
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [12]:
len(X_train)

25000

In [13]:
y_train

array([1, 0, 0, ..., 0, 1, 0], dtype=int64)

In [14]:
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [15]:
X_train.shape

(25000, 500)

we will use the embedding layer which defines the first hidden layer of the network. it must specify 3 arguments:

input_dim: the size of the vocabulary in the text

output_dim: this is the size of the vector space in which each word will be immersed

input_legth: this is the size of the sequence, for example if your documents contain 100 words each then it is 100

In [16]:
# creating tyhe model 
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=6, batch_size=64)

ValueError: This model has not yet been built. Build the model first by calling `build()` or by calling the model on a batch of data.

In [17]:

# evaluation
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 48.02%


## a simple example of the embedding layer

In [18]:
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

In [19]:
labels = [1,1,1,1,1,0,0,0,0,0]

In [20]:
vocab_size = 50

In [21]:
encoded_docs = [one_hot(d, vocab_size) for d in docs]

NameError: name 'one_hot' is not defined

In [None]:
print(encoded_docs)

In [None]:
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

We are now ready to define our Embedding layer as part of our model.

The embedding has a vocabulary of 50 and an entry length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. It is important to note that the output of the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten it (the flatten layer) into a 32-element vector to pass it to the Dense output layer. 

In [4]:
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

NameError: name 'vocab_size' is not defined

In [5]:
model.fit(padded_docs, labels, epochs=50, verbose=0)

NameError: name 'padded_docs' is not defined

In [6]:
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

NameError: name 'padded_docs' is not defined

## To Do: 

1. Try the same thing on Google reviews dataset ( the file is given in the lab directory)
2. try to change the embedding representation using Glove and Skipgram 