![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

Selecting TensorFlow version 2.x in colab

In [1]:
%tensorflow_version 2.x
import tensorflow
print(tensorflow.__version__)
print('GPU name: {}'.format(tensorflow.test.gpu_device_name()))

2.4.0
GPU name: /device:GPU:0


In [2]:
# Initialize the random number generator
import random
random.seed(0)

# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [3]:
import tensorflow.keras.datasets.imdb as imdb
(x_train, y_train), (x_test, y_test)= imdb.load_data(num_words = 10000)
print("Preview of x_train: ")
print(x_train)
print("Preview of y_train: ")
print(y_train)
print("Preview of x_test: ")
print(x_test)
print("Preview of y_test: ")
print(y_test)

Preview of x_train: 
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])
 list([1, 194, 1153, 194, 8255, 

In [4]:
print("Data type of x_train is: ", type(x_train), " Dimension of x_train is: ",x_train.shape)
print("Data type of y_train is: ", type(y_train), " Dimension of y_train is: ",y_train.shape)
print("Data type of x_test is: ", type(x_test), " Dimension of x_test is: ",x_test.shape)
print("Data type of y_test is: ", type(y_test), " Dimension of y_test is: ",y_test.shape)

Data type of x_train is:  <class 'numpy.ndarray'>  Dimension of x_train is:  (25000,)
Data type of y_train is:  <class 'numpy.ndarray'>  Dimension of y_train is:  (25000,)
Data type of x_test is:  <class 'numpy.ndarray'>  Dimension of x_test is:  (25000,)
Data type of y_test is:  <class 'numpy.ndarray'>  Dimension of y_test is:  (25000,)


### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [5]:
print("Number of reviews in training set is: ", len(x_train))
print("Number of words in each review in train set is:")
for i in range(len(x_train)):
  print("(",i+1, ") ",len(x_train[i]) )

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
( 20001 )  252
( 20002 )  118
( 20003 )  67
( 20004 )  198
( 20005 )  53
( 20006 )  525
( 20007 )  151
( 20008 )  123
( 20009 )  263
( 20010 )  179
( 20011 )  427
( 20012 )  151
( 20013 )  114
( 20014 )  122
( 20015 )  172
( 20016 )  182
( 20017 )  135
( 20018 )  220
( 20019 )  487
( 20020 )  432
( 20021 )  99
( 20022 )  975
( 20023 )  157
( 20024 )  63
( 20025 )  371
( 20026 )  212
( 20027 )  156
( 20028 )  54
( 20029 )  723
( 20030 )  178
( 20031 )  222
( 20032 )  38
( 20033 )  225
( 20034 )  134
( 20035 )  232
( 20036 )  261
( 20037 )  199
( 20038 )  193
( 20039 )  238
( 20040 )  349
( 20041 )  107
( 20042 )  204
( 20043 )  47
( 20044 )  156
( 20045 )  536
( 20046 )  402
( 20047 )  552
( 20048 )  175
( 20049 )  135
( 20050 )  771
( 20051 )  138
( 20052 )  255
( 20053 )  472
( 20054 )  556
( 20055 )  94
( 20056 )  210
( 20057 )  145
( 20058 )  63
( 20059 )  593
( 20060 )  60
( 20061 )  124
( 20062 )  121
( 20063 )  810


In [6]:
print("Number of reviews in test set is: ", len(x_test))
print("Number of words in each review in test set is:")
for i in range(len(x_test)):
  print("(",i+1, ") ",len(x_test[i]) )

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
( 20001 )  199
( 20002 )  361
( 20003 )  214
( 20004 )  295
( 20005 )  131
( 20006 )  135
( 20007 )  185
( 20008 )  238
( 20009 )  180
( 20010 )  36
( 20011 )  131
( 20012 )  325
( 20013 )  229
( 20014 )  475
( 20015 )  165
( 20016 )  160
( 20017 )  68
( 20018 )  371
( 20019 )  554
( 20020 )  118
( 20021 )  208
( 20022 )  232
( 20023 )  74
( 20024 )  367
( 20025 )  359
( 20026 )  183
( 20027 )  300
( 20028 )  46
( 20029 )  201
( 20030 )  218
( 20031 )  235
( 20032 )  210
( 20033 )  133
( 20034 )  552
( 20035 )  415
( 20036 )  189
( 20037 )  235
( 20038 )  223
( 20039 )  189
( 20040 )  239
( 20041 )  47
( 20042 )  127
( 20043 )  92
( 20044 )  454
( 20045 )  308
( 20046 )  902
( 20047 )  332
( 20048 )  201
( 20049 )  140
( 20050 )  804
( 20051 )  131
( 20052 )  166
( 20053 )  268
( 20054 )  175
( 20055 )  420
( 20056 )  179
( 20057 )  144
( 20058 )  820
( 20059 )  659
( 20060 )  621
( 20061 )  123
( 20062 )  140
( 20063 )  

Number of labels

In [7]:
print("Total number of labels in training set is: ", len(y_train))
no_of_pos_sent_train = 0
for i in range(len(y_train)):
  if y_train[i]==1:
    no_of_pos_sent_train += 1
no_of_neg_sent_train = len(y_train) - no_of_pos_sent_train
print("Number of positive sentiments in train set is: ", no_of_pos_sent_train)
print("Number of negative sentiments in train set is: ", no_of_neg_sent_train)


Total number of labels in training set is:  25000
Number of positive sentiments in train set is:  12500
Number of negative sentiments in train set is:  12500


In [8]:
print("Total number of labels in test set is: ", len(y_test))
no_of_pos_sent_test = 0
for i in range(len(y_test)):
  if y_test[i]==1:
    no_of_pos_sent_test += 1
no_of_neg_sent_test = len(y_test) - no_of_pos_sent_test
print("Number of positive sentiments in test set is: ", no_of_pos_sent_test)
print("Number of negative sentiments in test set is: ", no_of_neg_sent_test)

Total number of labels in test set is:  25000
Number of positive sentiments in test set is:  12500
Number of negative sentiments in test set is:  12500


### Print value of any one feature and it's label (4 Marks)

Feature value

In [9]:
print(x_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


Label value

In [10]:
if y_train[0] == 0:
  print("negative review")
elif y_train[0] == 1:
  print("positive review")

positive review


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [11]:
word_to_index = imdb.get_word_index()
word_to_index

{'fawn': 34701,
 'tsukino': 52006,
 'nunnery': 52007,
 'sonja': 16816,
 'vani': 63951,
 'woods': 1408,
 'spiders': 16115,
 'hanging': 2345,
 'woody': 2289,
 'trawling': 52008,
 "hold's": 52009,
 'comically': 11307,
 'localized': 40830,
 'disobeying': 30568,
 "'royale": 52010,
 "harpo's": 40831,
 'canet': 52011,
 'aileen': 19313,
 'acurately': 52012,
 "diplomat's": 52013,
 'rickman': 25242,
 'arranged': 6746,
 'rumbustious': 52014,
 'familiarness': 52015,
 "spider'": 52016,
 'hahahah': 68804,
 "wood'": 52017,
 'transvestism': 40833,
 "hangin'": 34702,
 'bringing': 2338,
 'seamier': 40834,
 'wooded': 34703,
 'bravora': 52018,
 'grueling': 16817,
 'wooden': 1636,
 'wednesday': 16818,
 "'prix": 52019,
 'altagracia': 34704,
 'circuitry': 52020,
 'crotch': 11585,
 'busybody': 57766,
 "tart'n'tangy": 52021,
 'burgade': 14129,
 'thrace': 52023,
 "tom's": 11038,
 'snuggles': 52025,
 'francesco': 29114,
 'complainers': 52027,
 'templarios': 52125,
 '272': 40835,
 '273': 52028,
 'zaniacs': 52130,

In [12]:
word_to_index['the']

1

- Every index value is offset by 3, so the word “the” maps to 4. 
- This allows 0 to be used for padding so all reviews can be made to have the same length.
- 1 is used as a start-of-review indicator. 
- 2 is used for out-of-vocabulary (unknown) words.

In [13]:
for (k,v) in word_to_index.items():
  word_to_index[k] = v + 3
word_to_index['the']

4

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [14]:
index_to_word = {}
index_to_word[0] = "PAD"
index_to_word[1] = "START"
index_to_word[2] = "OOV"
for (k,v) in word_to_index.items():
  index_to_word[v] = k


In [15]:
for index in x_train[0]:
  print(index_to_word[index], end = " ")

START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert OOV is an amazing actor and now the same being director OOV father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for OOV and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also OOV to the two little boy's that played the OOV of norman and paul they were just brilliant children are often left out of the OOV list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have do

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [16]:
y_train[0] #As read from above the review is positive

1

### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [17]:
padded_x_train = tensorflow.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=300, padding='post', truncating='pre')
padded_x_test = tensorflow.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=300, padding='post', truncating='pre')
print("Padded x_train shape: ", padded_x_train.shape)
print("Padded x_test shape: ", padded_x_test.shape)

Padded x_train shape:  (25000, 300)
Padded x_test shape:  (25000, 300)


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Flatten, Dense
vocab_size = 10000
embed_dim = 100
model = Sequential()
model.add(Embedding(vocab_size, embed_dim, input_length = padded_x_train.shape[1]))
model.add(Bidirectional(LSTM(100, return_sequences=True), input_shape=(300, 1)))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [19]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


### Print model summary (4 Marks)

In [20]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 300, 200)          160800    
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          20100     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 30001     
Total params: 1,210,901
Trainable params: 1,210,901
Non-trainable params: 0
_________________________________________________________________
None


### Fit the model (4 Marks)

In [21]:
model.fit(padded_x_train, y_train, epochs=10, verbose=2, batch_size=64)

Epoch 1/10
391/391 - 20s - loss: 0.3696 - accuracy: 0.8277
Epoch 2/10
391/391 - 15s - loss: 0.1965 - accuracy: 0.9244
Epoch 3/10
391/391 - 15s - loss: 0.1068 - accuracy: 0.9603
Epoch 4/10
391/391 - 15s - loss: 0.0360 - accuracy: 0.9870
Epoch 5/10
391/391 - 15s - loss: 0.0236 - accuracy: 0.9917
Epoch 6/10
391/391 - 15s - loss: 0.0184 - accuracy: 0.9938
Epoch 7/10
391/391 - 15s - loss: 0.0114 - accuracy: 0.9962
Epoch 8/10
391/391 - 15s - loss: 0.0156 - accuracy: 0.9947
Epoch 9/10
391/391 - 15s - loss: 0.0151 - accuracy: 0.9947
Epoch 10/10
391/391 - 15s - loss: 0.0098 - accuracy: 0.9967


<tensorflow.python.keras.callbacks.History at 0x7f1389bcd2b0>

### Evaluate model (4 Marks)

In [22]:
score, acc = model.evaluate(padded_x_test, y_test, verbose=2, batch_size = 64)
print("Score of model is: %.3f" % (score))
print("Accuracy of model is: %.3f%%" % (acc*100))

391/391 - 6s - loss: 1.1510 - accuracy: 0.8575
Score of model is: 1.151
Accuracy of model is: 85.748%


### Predict on one sample (4 Marks)

In [23]:
#Let's look the content of any one test sample
input_idx = 1
for index in x_test[input_idx]:
  print(index_to_word[index], end=" ")

START this film requires a lot of patience because it focuses on mood and character development the plot is very simple and many of the scenes take place on the same set in frances OOV the sandy dennis character apartment but the film builds to a disturbing climax br br the characters create an atmosphere OOV with sexual tension and psychological OOV it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely one of her darker characters but she plays it so perfectly and convincingly that it's scary michael burns does a good job as the mute young man regular altman player michael murphy has a small part the OOV moody set fits the content of the story very well in short this movie is a powerful study of loneliness sexual OOV and desperation be patient OOV up the atmosphere 

In [24]:
#Let's see what is given in the corresponding test output data
if y_test[input_idx] == 0:
  print("negative review")
elif y_test[input_idx] == 1:
  print("positive review")


positive review


In [25]:
#Now see the corresponding predicted ouput by fitting this sentence to the model
import numpy as np
predict_sent = model.predict(padded_x_test[input_idx:input_idx+1], batch_size=64, verbose=2)
print(["negative review", "positive review"][int(float(predict_sent)+0.5)])

1/1 - 1s
positive review


In [1]:
!jupyter nbconvert --to html NLP-sentiment-classification.ipynb

[NbConvertApp] Converting notebook NLP-sentiment-classification.ipynb to html
[NbConvertApp] Writing 510417 bytes to NLP-sentiment-classification.html
