<a href="https://colab.research.google.com/github/ANANTHMANOJ/Ai_Projs/blob/master/Sequential_Models_in_NLP_Sentiment_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
-Each review is encoded as a sequence of word indexes (integers).
- Words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations.
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.



### Import the data 
- Using `imdb.load_data()` method
- Take 10000 most frequent words into train and test set

In [None]:
from tensorflow.keras.datasets import imdb
data=imdb.load_data(num_words=10000)  #reading 10000 frequent words.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [None]:
train_x ,train_y,test_x,test_y = data[0][0],data[0][1],data[1][0],data[1][1]
train_x

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228,

<h3>Anaylsing the data</h3>

In [None]:
len(train_x)

25000

In [None]:
train_y

array([1, 0, 0, ..., 0, 1, 0])

In [None]:
test_x

array([list([1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 5760, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 7944, 100, 28, 1668, 14, 31, 23, 27, 7479, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 2, 38, 32, 25, 7944, 451, 202, 14, 6, 717]),
       list([1, 14, 22, 3443, 6, 176, 7, 5063, 88, 12, 2679, 23, 1310, 5, 109, 943, 4, 114, 9, 55, 606, 5, 111, 7, 4, 139, 193, 273, 23, 4, 172, 270, 11, 7216, 2, 4, 8463, 2801, 109, 1603, 21, 4, 22, 3861, 8, 6, 1193, 1330, 10, 10, 4, 105, 987, 35, 841, 2, 19, 861, 1074, 5, 1987, 2, 45, 55, 221, 15, 670, 5304, 526, 14, 1069, 4, 405, 5, 2438, 7, 27, 85, 108, 131, 4, 5045, 5304, 3884, 405, 9, 3523, 133, 5, 50, 13, 104, 51, 66, 166, 14, 22, 157, 9, 4, 530, 239, 34, 8463, 2801, 45, 407, 31, 7, 41, 3778, 105, 21, 59, 299, 12, 38, 950, 5, 4521, 15, 45, 629, 488, 2733, 127, 6, 52, 292, 17, 4, 6936, 185, 132, 1988, 5304, 1799, 488, 2693, 47, 6, 392, 173, 4, 2, 4378, 270, 2352, 4, 1500, 7, 

In [None]:
test_y

array([0, 1, 1, ..., 0, 0, 0])

In [None]:
import numpy as np
import pandas as pd
from keras.preprocessing import sequence

In [None]:
x=np.concatenate((train_x,test_x),axis=0)
y=np.concatenate((train_y,test_y),axis=0)

In [None]:
print("The shape of features is ",x.shape[0])
print("The shape of labels is ",y.shape[0])

The shape of features is  50000
The shape of labels is  50000


In [None]:
np.unique(y)

array([0, 1])

In [None]:
len(np.unique(np.hstack(x)))

9998

### Number of review, number of words in each review

In [None]:
print("Number of reviews is ",x.shape[0])

Number of reviews is  50000


In [None]:
 for i in range(len(x)):
   print("The number of words in sentence ", i , " is ", len(x[i]))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
The number of words in sentence  45001  is  361
The number of words in sentence  45002  is  214
The number of words in sentence  45003  is  295
The number of words in sentence  45004  is  131
The number of words in sentence  45005  is  135
The number of words in sentence  45006  is  185
The number of words in sentence  45007  is  238
The number of words in sentence  45008  is  180
The number of words in sentence  45009  is  36
The number of words in sentence  45010  is  131
The number of words in sentence  45011  is  325
The number of words in sentence  45012  is  229
The number of words in sentence  45013  is  475
The number of words in sentence  45014  is  165
The number of words in sentence  45015  is  160
The number of words in sentence  45016  is  68
The number of words in sentence  45017  is  371
The number of words in sentence  45018  is  554
The number of words in sentence  45019  is  118
The number of words in se

Number of labels

In [None]:
print("Number of labels are", y.shape)

Number of labels are (50000,)


### Padding each sentence to be of same length
- Taking maximum sequence length as 300 as it will be confort for the model.

In [None]:
train_xp= sequence.pad_sequences(train_x,maxlen=300)
train_xp

array([[   0,    0,    0, ...,   19,  178,   32],
       [   0,    0,    0, ...,   16,  145,   95],
       [   0,    0,    0, ...,    7,  129,  113],
       ...,
       [   0,    0,    0, ...,    4, 3586,    2],
       [   0,    0,    0, ...,   12,    9,   23],
       [   0,    0,    0, ...,  204,  131,    9]], dtype=int32)

In [None]:
test_xp= sequence.pad_sequences(test_x,maxlen=300)
test_xp

array([[   0,    0,    0, ...,   14,    6,  717],
       [   0,    0,    0, ...,  125,    4, 3077],
       [1239, 5189,  137, ...,    9,   57,  975],
       ...,
       [   0,    0,    0, ...,   21,  846, 5518],
       [   0,    0,    0, ..., 2302,    7,  470],
       [   0,    0,    0, ...,   34, 2005, 2643]], dtype=int32)

### Printing value of  feature and it's label

Feature value

In [None]:
print("Feature value of line 22 is")
print(x[22])

Feature value of line 22 is
[1, 13, 784, 886, 857, 15, 135, 142, 40, 2, 437, 129, 58, 14, 22, 4385, 23, 1903, 758, 12, 127, 8, 15, 2215, 246, 18, 72, 12, 203, 28, 49, 432, 7, 2, 1382, 48, 25, 40, 4, 85, 2, 201, 108, 14, 31, 80, 30, 1753, 48, 25, 40, 6583, 2, 108, 14, 31, 80, 30, 1753, 10, 10, 14, 22, 9, 24, 17, 52, 11, 61, 652, 17, 101, 7, 4, 908, 201, 7609, 63, 2684, 745, 2, 17, 4, 2311, 45, 76, 7569, 5, 4, 114, 9, 3104, 874, 110, 14, 172, 1321, 2701, 343, 11, 111, 85, 108, 5, 633, 128, 10, 10, 21, 4, 116, 9, 52, 5, 38, 9, 4, 1524, 5, 4, 807, 45, 43, 1892, 11, 1708, 5, 490, 1329, 822, 46, 618, 803, 170, 23, 5, 89, 45, 32, 170, 8, 216, 46, 11, 4, 130, 24, 53, 74, 6, 6542, 7, 4, 96, 143, 10, 10, 4, 2, 201, 9, 2394, 1359, 5, 50, 109, 1310, 1524, 370, 2425, 5, 2445, 26, 53, 674, 74, 4, 65, 410, 21, 14, 22, 9, 24, 1359, 45, 99, 641, 3324, 5, 363, 1356, 18, 15, 1082, 745, 2, 109, 885, 148, 7, 101, 7, 27, 1914, 11, 4, 960, 108, 69, 8, 216, 8, 6, 52, 130, 25, 43, 115, 697, 366, 4, 130, 10, 10

Label value

In [None]:
print("Label value of line 22 is")
y[22]

Label value of line 22 is


1

### Decoding the feature value to get original sentence
 - To check what's the sentence may mean. 

Firstly, retrieving a dictionary that contains mapping of words to their index in the IMDB dataset

In [None]:
imdb_dict = imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now using that dictionary to get the original words from the encodings, for a particular sentence

In [None]:
j=1
for i in  x[1]:
  for j in imdb_dict.keys():
    if i == imdb_dict[j]:
      print(j,end=' ')

the thought solid thought senator do making to is spot nomination assumed while he of jack in where picked as getting on was did hands fact characters to always life thrillers not as me can't in at are br of sure your way of little it strongly random to view of love it so principles of guy it used producer of where it of here icon film of outside to don't all unique some like of direction it if out her imagination below keep of queen he diverse to makes this stretch and of solid it thought begins br senator and budget worthwhile though ok and awaiting for ever better were and diverse for budget look kicked any to of making it out and follows for effects show to show cast this family us scenes more it severe making senator to and finds tv tend to of emerged these thing wants but and an beckinsale cult as it is video do you david see scenery it in few those are of ship for with of wild to one is very work dark they don't do dvd with those them 

Checking the sentiment of the line that was just read.

In [None]:
print("negative") if y[1]==0 else print("postive")

negative


So, that as the sad sentiment.

### Defining model
- Defining a Sequential Model, which contains...
- Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- A `TimeDistributed` layer with 100 Dense neurons
- Flatten layer
- Dense layer

In [None]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Dense,Flatten,LSTM,TimeDistributed

In [None]:
model=Sequential()
model.add(Embedding(input_dim=10000,output_dim=100,input_length=300))
model.add(LSTM(units=100,return_sequences=True))
model.add(TimeDistributed(Dense(len(train_y),activation="softmax")))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))

### Compiling the model with
-  Optimizer as Adam
-  Binary Crossentropy as loss
-  Accuracy as metrics

In [None]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

### Summary 

In [None]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
lstm_4 (LSTM)                (None, 300, 100)          80400     
_________________________________________________________________
time_distributed_4 (TimeDist (None, 300, 25000)        2525000   
_________________________________________________________________
flatten_4 (Flatten)          (None, 7500000)           0         
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 7500001   
Total params: 11,105,401
Trainable params: 11,105,401
Non-trainable params: 0
_________________________________________________________________


### Fitting the model 

In [None]:
model.fit(x=train_xp,y=train_y,batch_size=32,epochs=5,verbose=1,validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f3c67ffb518>

### Evaluating model


In [None]:
model.evaluate(test_xp,test_y,batch_size=100,verbose=1)



[0.32071375846862793, 0.8720399737358093]

### Predicting on samples.

In [None]:
pred=model.predict(test_xp)

In [None]:
pred= np.around(pred).astype(int)
print(pred)

[[0]
 [1]
 [1]
 ...
 [0]
 [0]
 [1]]


In [None]:
print(test_y[10])

1


The test value for 10th sentence is positive

In [None]:
pred[10]

array([1])

Also the predicted sentiment of 10th sentence is positive

The model is giving quite a good result predicting.
<h3> Thank You</h3>