# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
import numpy as np
from tensorflow.keras.datasets import imdb

data = imdb.load_data(num_words=10000) # Take the top 10000 frequent words

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [2]:
X_train, y_train = data[0]
X_test, y_test = data[1]

### Pad each sentence to be of same length
- Take maximum sequence length as 300

In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Pad the sentences with 0 to mark them as unknown word
max_len = 300
X_train_pad = pad_sequences(X_train, maxlen=max_len, padding='post', value=0)
X_test_pad = pad_sequences(X_test, maxlen=max_len, padding='post', value=0)

### Print shape of features & labels

Number of review, number of words in each review

In [4]:
# Number of reviews in train and test set
print(f"Number of reviews in train set: {len(X_train_pad)}")
print(f"Number of labels in train set: {len(y_train)}\n")
print(f"Number of reviews in test set: {len(X_test_pad)}")
print(f"Number of labels in test set: {len(y_test)}\n")

Number of reviews in train set: 25000
Number of labels in train set: 25000

Number of reviews in test set: 25000
Number of labels in test set: 25000



In [5]:
# Average number of words per review in train and test set
import statistics as stat

train_set_avg = stat.mean([len(x) for x in X_train])
test_set_avg = stat.mean([len(x) for x in X_test])

print(f"Avg number of words per review in train set: {train_set_avg}")
print(f"Avg number of words per review in test set: {test_set_avg}")

Avg number of words per review in train set: 238.71364
Avg number of words per review in test set: 230.8042


Number of labels

In [6]:
# Number of labels in train and test set

print(f"Number of labels in train set: {len(y_train)}")
print(f"Number of labels in test set: {len(y_test)}")

Number of labels in train set: 25000
Number of labels in test set: 25000


In [7]:
# Positive reviews % 

pct_pos_train = stat.mean([int(x) for x in y_train])
pct_pos_test = stat.mean([int(x) for x in y_test])

print(f"Percentage positive reviews in train set: {pct_pos_train}")
print(f"Percentage positive reviews in test set: {pct_pos_test}")

Percentage positive reviews in train set: 0.5
Percentage positive reviews in test set: 0.5


### Print value of any one feature and it's label

Feature value

In [8]:
# Print 2nd review from train set
print(X_train[1])

[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]


Label value

In [9]:
# Print its label
print(y_train[1])

0


It's a negative review...

### Decode the feature value to get original sentence

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [10]:
# Use get_word_index function to get the mapping dictionary

imdb_word_index = imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [11]:
# First we need to reverse the mapping to get it in index -> word form

imdb_index_word = {imdb_word_index[word]:word for word in imdb_word_index}

In [64]:
# Use the imdb_index_word mapping to map the indices to words

sample_review = X_train[1]
sample_label = y_train[1]

decoded_review = " ".join([imdb_index_word[idx] for idx in sample_review])
print(f"Review:\n\n{decoded_review}")

Review:

the thought solid thought senator do making to is spot nomination assumed while he of jack in where picked as getting on was did hands fact characters to always life thrillers not as me can't in at are br of sure your way of little it strongly random to view of love it so principles of guy it used producer of where it of here icon film of outside to don't all unique some like of direction it if out her imagination below keep of queen he diverse to makes this stretch and of solid it thought begins br senator and budget worthwhile though ok and awaiting for ever better were and diverse for budget look kicked any to of making it out and follows for effects show to show cast this family us scenes more it severe making senator to and finds tv tend to of emerged these thing wants but and an beckinsale cult as it is video do you david see scenery it in few those are of ship for with of wild to one is very work dark they don't do dvd with those them


Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [65]:
# Get the sentiment for this review

sentiment = 'Positive' if sample_label else 'Negative'
print(f"Sentiment: {sentiment}")

Sentiment: Negative


### Define model
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [42]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Input, Flatten, SpatialDropout1D

inputs = Input(shape=(max_len,)) # Input layer
model = Embedding(input_dim=10000, output_dim=100, input_length=max_len)(inputs) # Word embedding layer
model = SpatialDropout1D(0.4)(model) # Dropout layer
model = LSTM(units=300, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)(model) # LSTM layer with dropout
model = LSTM(units=200, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)(model) # LSTM layer with dropout
model = TimeDistributed(Dense(100, activation='relu'))(model) # Time-distributed layer
model = Flatten()(model) # Flatten
out = Dense(1, activation='sigmoid')(model) # Sigmoid output layer

model = Model(inputs, out) # Complete model



### Compile the model
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [43]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]) # Compile the model

### Print model summary

In [44]:
print(model.summary())

Model: "functional_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         [(None, 300)]             0         
_________________________________________________________________
embedding_7 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
spatial_dropout1d_4 (Spatial (None, 300, 100)          0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 300, 300)          481200    
_________________________________________________________________
lstm_10 (LSTM)               (None, 300, 200)          400800    
_________________________________________________________________
time_distributed_6 (TimeDist (None, 300, 100)          20100     
_________________________________________________________________
flatten_6 (Flatten)          (None, 30000)           

### Fit the model

In [45]:
history = model.fit(X_train_pad, y_train, batch_size=100, epochs=3, validation_data=(X_test_pad, y_test), verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


### Evaluate model

In [46]:
# Check accuracy on test set
model.evaluate(X_test_pad, y_test)



[0.30342113971710205, 0.8831200003623962]

### Predict on one sample

In [71]:
# Look at 1st test sample

test_sample = X_test[0]
sample_label = y_test[0]

decoded_test_review = " ".join([imdb_index_word[idx] for idx in test_sample])

print(f"Review:\n\n{decoded_test_review}")
print(f"\n\nSentiment: {'Positive' if sample_label else 'Negative'}")

Review:

the wonder own as by is sequence i i and and to of hollywood br of down shouting getting boring of ever it sadly sadly sadly i i was then does don't close faint after one carry as by are be favourites all family turn in does as three part in another some to be probably with world and her an have faint beginning own as is sequence


Sentiment: Negative


In [69]:
# Predict the review

test_sample = X_test_pad[0]
prediction = 'Positive' if model.predict(x=np.array([test_sample]))[0] > 0.5 else 'Negative'

print(f"Predicted Sentiment: {prediction}")

Predicted Sentiment: Negative


## Conclusion

We have successfully build a sentiment classification model for movie reviews achieving **88% accuracy**