![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [36]:
# LSTM for sequence classification in the IMDB dataset
import numpy as np # linear algebra
import pandas as pd # data processing
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
np.random.seed(7)

In [37]:
from tensorflow.keras.datasets import imdb
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)



  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [38]:
print(X_train[1])
print(type(X_train[1]))
print(len(X_train[1]))

[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]
<class 'list'>
189


### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [39]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_review_length = 300
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length,padding='post')
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length,padding='post')

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [10]:
print(X_train.shape)

(25000, 300)


In [11]:
print(X_test.shape)

(25000, 300)


In [30]:
print(len(X_train[0,:]))

300


In [32]:
print(len(X_test[0,:]))

300


Number of labels

In [20]:
np.unique(y_test)

array([0, 1])

In [24]:
print(y_test.shape)

(25000,)


In [33]:
print(y_train.shape)

(25000,)


### Print value of any one feature and it's label (4 Marks)

Feature value

In [40]:
print(X_train[1])

[   1  194 1153  194 8255   78  228    5    6 1463 4369 5012  134   26
    4  715    8  118 1634   14  394   20   13  119  954  189  102    5
  207  110 3103   21   14   69  188    8   30   23    7    4  249  126
   93    4  114    9 2300 1523    5  647    4  116    9   35 8163    4
  229    9  340 1322    4  118    9    4  130 4901   19    4 1002    5
   89   29  952   46   37    4  455    9   45   43   38 1543 1905  398
    4 1649   26 6853    5  163   11 3215    2    4 1153    9  194  775
    7 8255    2  349 2637  148  605    2 8003   15  123  125   68    2
 6853   15  349  165 4362   98    5    4  228    9   43    2 1157   15
  299  120    5  120  174   11  220  175  136   50    9 4373  228 8255
    5    2  656  245 2350    5    4 9837  131  152  491   18    2   32
 7464 1212   14    9    6  371   78   22  625   64 1382    9    8  168
  145   23    4 1690   15   16    4 1355    5   28    6   52  154  462
   33   89   78  285   16  145   95    0    0    0    0    0    0    0
    0 

Label value

In [38]:
print(y_train[1])

0


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [17]:
#word_index = imdb.get_word_index() # get {word : index}
#index_word = {v : k for k,v in word_index.items()} # get {index : word}

#index = 1
#print(" ".join([index_word[idx] for idx in X_train[index]]))
#print("positve" if y_train[index]==1 else "negetive")

the thought solid thought senator do making to is spot nomination assumed while he of jack in where picked as getting on was did hands fact characters to always life thrillers not as me can't in at are br of sure your way of little it strongly random to view of love it so principles of guy it used producer of where it of here icon film of outside to don't all unique some like of direction it if out her imagination below keep of queen he diverse to makes this stretch and of solid it thought begins br senator and budget worthwhile though ok and awaiting for ever better were and diverse for budget look kicked any to of making it out and follows for effects show to show cast this family us scenes more it severe making senator to and finds tv tend to of emerged these thing wants but and an beckinsale cult as it is video do you david see scenery it in few those are of ship for with of wild to one is very work dark they don't do dvd with those them
negetive


The above gives us an incoherent decoded review. We must check if the padding has effected the encoded data.

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [41]:
word_index = imdb.get_word_index()                                    
reverse_word_index = dict([(v, k) for (k, v) in word_index.items()])            
decoded_review = ' '.join([reverse_word_index.get(i - 3, "") for i in X_train[1]])
print(decoded_review)
#Here we have taken into account the changed caused by preproccessing, namely the padding process. Hence by shifting the reverse index by i - 3 we can get the actual decoded review.

 big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal  the hair is big lots of boobs  men wear those cut  shirts that show off their  sickening that men actually wore them and the music is just  trash that plays over and over again in almost every scene there is trashy music boobs and  taking away bodies and the gym still doesn't close for  all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then                                                                   

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [42]:
print("positve" if y_train[1]==1 else "negetive")

negetive


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [44]:
model = Sequential()
model.add(Embedding(10000, 100,input_length=300))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2,return_sequences=True))
model.add(Dense(1, activation='sigmoid'))



### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [45]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 1,080,501
Trainable params: 1,080,501
Non-trainable params: 0
_________________________________________________________________
None


### Print model summary (4 Marks)

In [46]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 1,080,501
Trainable params: 1,080,501
Non-trainable params: 0
_________________________________________________________________
None


### Fit the model (4 Marks)

In [47]:
model.fit(X_train, y_train, epochs=10, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 81.88%


### Evaluate model (4 Marks)

Performed in above step.

### Predict on one sample (4 Marks)

In [52]:
model.predict(X_test[1])



array([[0.76044995],
       [0.30541104],
       [0.3500576 ],
       [0.8459484 ],
       [0.52827823],
       [0.49234426],
       [0.47458044],
       [0.07044841],
       [0.45184883],
       [0.72473234],
       [0.55862623],
       [0.31303087],
       [0.06027576],
       [0.36162752],
       [0.8572193 ],
       [0.12489337],
       [0.42671138],
       [0.07899644],
       [0.44874012],
       [0.49388972],
       [0.96988374],
       [0.36162752],
       [0.866654  ],
       [0.47458044],
       [0.42671138],
       [0.14759794],
       [0.6105374 ],
       [0.89776474],
       [0.31303087],
       [0.42671138],
       [0.42676288],
       [0.8315427 ],
       [0.33853534],
       [0.04437233],
       [0.503905  ],
       [0.42671138],
       [0.8354886 ],
       [0.9576781 ],
       [0.8572193 ],
       [0.834519  ],
       [0.52525795],
       [0.42671138],
       [0.3500576 ],
       [0.9495769 ],
       [0.35417023],
       [0.52827823],
       [0.8491449 ],
       [0.961

We are able to predict the sentiment of the the given review with 81% Accuracy. 