![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [26]:
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__

import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models, layers

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import  Flatten


In [2]:
#### Add your code here ####
from tensorflow.keras.datasets import imdb
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)



Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [4]:
print(training_data.shape) 
print(testing_data.shape)


(25000,)
(25000,)


### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [5]:
print("Categories:", np.unique(training_targets)) ## remove 
print("Number of unique words:", len(np.unique(np.hstack(training_data))))
length = [len(i) for i in training_data]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

Categories: [0 1]
Number of unique words: 9998
Average Review length: 238.71364
Standard Deviation: 176.0


In [6]:
## padding each sentence to be of same length
max_sequence_length=300
padded_inputs = pad_sequences(training_data, maxlen=max_sequence_length,padding="post") 
padded_inputs_test = pad_sequences(testing_data, maxlen=max_sequence_length,padding="post") 




### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [7]:

print('For training...')
print(padded_inputs.shape)



For training...
(25000, 300)


In [8]:

print('For testing...')
print(padded_inputs_test.shape)


For testing...
(25000, 300)


Number of labels

In [9]:

print('For training lables ...')
print(training_targets.shape)

For training lables ...
(25000,)


In [10]:

print('For testing lables...')
print(testing_targets.shape)

For testing lables...
(25000,)


### Print value of any one feature and it's label (4 Marks)

Feature value

In [11]:

print(padded_inputs[100])

[   1   13  244    6   87  337    7  628 2219    5   28  285   15  240
   93   23  288  549   18 1455  673    4  241  534 3635 8448   20   38
   54   13  258   46   44   14   13 1241 7258   12    5    5   51    9
   14   45    6  762    7    2 1309  328    5  428 2473   15   26 1292
    5 3939 6728    5 1960  279   13   92  124  803   52   21  279   14
    9   43    6  762    7  595   15   16    2   23    4 1071  467    4
  403    7  628 2219    8   97    6  171 3596   99  387   72   97   12
  788   15   13  161  459   44    4 3939 1101  173   21   69    8  401
    2    4  481   88   61 4731  238   28   32   11   32   14    9    6
  545 1332  766    5  203   73   28   43   77  317   11    4    2  953
  270   17    6 3616   13  545  386   25   92 1142  129  278   23   14
  241   46    7  158    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

Label value

In [12]:

print(testing_targets[100]) 

1


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [13]:

index = imdb.get_word_index()


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [14]:

reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in training_data[100]] )


Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [15]:
print(decoded) 
print(training_targets[100])

# i am a great fan of david lynch and have everything that he's made on dvd except for hotel room the 2 hour twin peaks movie so when i found out about this i immediately grabbed it and and what is this it's a bunch of # drawn black and white cartoons that are loud and foul mouthed and unfunny maybe i don't know what's good but maybe this is just a bunch of crap that was # on the public under the name of david lynch to make a few bucks too let me make it clear that i didn't care about the foul language part but had to keep # the sound because my neighbors might have all in all this is a highly disappointing release and may well have just been left in the # box set as a curiosity i highly recommend you don't spend your money on this 2 out of 10
0


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [37]:


max_features=10000

mod3 = Sequential()
mod3.add(Embedding(max_features, 100))
mod3.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
mod3.add(TimeDistributed(Dense(100)))
mod3.add(Dense(1, activation='sigmoid'))



### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [38]:

mod3.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

### Print model summary (4 Marks)

In [39]:

mod3.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, None, 100)         1000000   
_________________________________________________________________
bidirectional_14 (Bidirectio (None, None, 200)         160800    
_________________________________________________________________
time_distributed_14 (TimeDis (None, None, 100)         20100     
_________________________________________________________________
dense_25 (Dense)             (None, None, 1)           101       
Total params: 1,181,001
Trainable params: 1,181,001
Non-trainable params: 0
_________________________________________________________________


### Fit the model (4 Marks)

In [40]:

mod3.fit(padded_inputs, training_targets,batch_size=32,epochs=3,validation_split=0.1, verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f0d4772ef28>

### Evaluate model (4 Marks)

In [41]:

score, acc = mod3.evaluate(padded_inputs_test, testing_targets,
                            batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.4988357424736023
Test accuracy: 0.7230520248413086


In [43]:
score1, acc1 = mod3.evaluate(padded_inputs_test, testing_targets,
                            batch_size=32)
print('Test score:', score1)
print('Test accuracy:', acc1)

Test score: 0.4988357424736023
Test accuracy: 0.7230520248413086


### Predict on one sample (4 Marks)

In [45]:
 #here predicting 7737 and 449 samples 
text_bad = training_data[7737]
text_good = training_data[449]
texts = (text_bad, text_good)
padded_texts = pad_sequences(texts, maxlen=max_sequence_length, padding='post')
decoded_predict_7737 = " ".join( [reverse_index.get(i - 3, "#") for i in training_data[7737]] )
decoded_predict_449 = " ".join( [reverse_index.get(i - 3, "#") for i in training_data[449]] )

print(decoded_predict_7737)
print(decoded_predict_449)


# as a # and non christian i thought i really was going to be holding onto my faith but what a load of i # i thought the film would have great arguments but only got one sided views from # and jews and who are all these street people he's # who don't know the back of their arm from their head where are the proper # and priests and stuff he could have got arguments from not retired nuts who wrote books and finished their studies in 1970 personally this dvd was a waste of time and not worth my time to check if the facts are right or wrong or if i should or should not believe because an anti christ told me so please to think he came up with the conclusion of not finding god because his own ego and demons got the better of him no im not going to say the movie was stunning to help # reading this feel better about themselves but if you really want to show the world you care about us poor souls who believe in jesus then # us with your worth not your beating off the drums
# here it is the firs

In [46]:
predictions =mod3.predict(padded_texts)
print(predictions[0].mean())
print(predictions[1].mean())

0.1501305
0.9046873


In [None]:
##Conclusion:
##7737-predictions[0]    close to 0,  which is bad. Obviously, this is correct based on  the text.
##449 -predictions[1] is close to 1 which is Good.This makes sense – the text clearly indicates that the viewer had positive sentiment about the movie.
