# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [46]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow as tf

In [2]:
#### Add your code here ####
from tensorflow.keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

X = np.concatenate((X_train, X_test), axis = 0)
y = np.concatenate((y_train, y_test), axis = 0)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [3]:
print("Training data: ")
print(X_train.shape)
print(y_train.shape)

print("Testing data: ")
print(X_test.shape)
print(y_test.shape)


Training data: 
(25000,)
(25000,)
Testing data: 
(25000,)
(25000,)


In [5]:
print("Number of words: ")
print(len(np.unique(np.hstack(X))))

Number of words: 
9998


### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [26]:
#### Add your code here ####
#import pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train_pad = pad_sequences(sequences=X_train, maxlen =300, padding='pre', truncating='post')

X_test_pad = pad_sequences(sequences=X_test, maxlen=300, padding='pre', truncating='post')

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [27]:
#### Add your code here ####
print(f'Shape of X_train_pad is: {X_train_pad.shape} \nShape of X_test_pad is: {X_test_pad.shape}')

Shape of X_train_pad is: (25000, 300) 
Shape of X_test_pad is: (25000, 300)


In [28]:
#### Add your code here ####
print(f'Shape of y_train is: {y_train.shape} \nShape of y_test is: {y_test.shape}')

Shape of y_train is: (25000,) 
Shape of y_test is: (25000,)


Number of labels

In [29]:
print("Unique Labels: ")
print(np.unique(y))

Unique Labels: 
[0 1]


In [30]:
#### Add your code here ####
from collections import Counter
print(Counter(y_train))
print(Counter(y_test))

Counter({1: 12500, 0: 12500})
Counter({0: 12500, 1: 12500})


### Print value of any one feature and it's label (2 Marks)

Feature value

In [33]:
#### Add your code here ####

X_train_pad[15]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

Label value

In [34]:
#### Add your code here ####
y_train[15]

0

### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [38]:
#### Add your code here ####
index = imdb.get_word_index()

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [39]:
#### Add your code here ####
reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in X_train[0]] )
print(decoded)

# this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [43]:
#### Add your code here ####
print('The sentiment of the review is Positive' if y_train[0]==1 else 'The sentiment of the review is Negative')

The sentiment of the review is Positive


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [52]:
#### Add your code here ####

model = tf.keras.backend.clear_session()
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10000+1,input_length=300, output_dim=100))
model.add(tf.keras.layers.LSTM(32,return_sequences=True))
model.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(100)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(200))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [53]:
#### Add your code here ####
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(patience=4,restore_best_weights=True,monitor='val_loss')

### Print model summary (2 Marks)

In [54]:
#### Add your code here ####
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000100   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 32)           17024     
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          3300      
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 200)               6000200   
_________________________________________________________________
dropout (Dropout)            (None, 200)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 2

### Fit the model (2 Marks)

In [55]:
#### Add your code here ####
model.fit(X_train_pad,y_train, validation_data=(X_test_pad,y_test),
          epochs=100,
          batch_size=64,
          callbacks = [early_stop])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100


<tensorflow.python.keras.callbacks.History at 0x7f4d3baa45d0>

### Evaluate model (2 Marks)

In [56]:
#### Add your code here ####
model.evaluate(X_test_pad,y_test)



[0.3256191313266754, 0.8615999817848206]

### Predict on one sample (2 Marks)

In [60]:
#### Add your code here ####
" ".join(reverse_index[i] for i in X_test[10])

"the thats who dangerously checked are is manner positive br seen budget and confusing this women's but often davies br of seen great establishing she were care double and or of manner it so i've to time trained must long are of less if is successful in allow and for all give cold to jungle this as track must americans i i robin and it of killer like and of complete white to if scope such would role 4 of semi br allow and film about accents contain to jersey sub made things but think track and worthy favour it of violence you've war there forth be ending less to it years never and movie is him actually crappy which split are of ealing ever in this and all real at needless marriage and role of less br and confusing in davies thinking br while and natured i i br screen dvd to last well ago no violence whale and games be flying and film and front but don't of little noted to if spend an of too i'm lou brings br older to chaotic camera of too movie much movie is very wait br brought i i bu

In [61]:
X_test_pad.shape

(25000, 300)

In [62]:
predicted = model.predict_classes(X_test_pad)



In [63]:
print('Predicted : ',predicted[10])
print('Actual : ',y_test[10])

Predicted :  [1]
Actual :  1
