# Sentiment Classification

###Objective
To classify user sentiments as positive or negative using Natural Language Process techniques.

We generate Word Embedding and retrieve outputs of each layer with Keras based
on the Classification task.Word embeddings are a type of word representation that allows words with similar meanings to have similar representations.
It is a distributed representation for the text that is perhaps one of the key
breakthroughs for the impressive performance of deep learning methods on
challenging natural language processing problems.
We will use the IMDb dataset to learn word embedding as we train our dataset.
This dataset contains 50,000 movie reviews from IMDB, labeled with a sentiment
(positive or negative).

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Change the current working directory
import os
os.chdir("/content/drive/My Drive/Colab Notebooks")

In [None]:
# Import packages
import pandas as pd, numpy as np
import tensorflow as tf

from itertools import islice

from keras.layers import Dense, Embedding, LSTM, Dropout, MaxPooling1D, Conv1D
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.preprocessing import sequence
from keras.datasets import imdb
from keras.callbacks import ModelCheckpoint, EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

random_state = 1
np.random.seed(random_state)
tf.random.set_seed(random_state)

### Import the data 
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [None]:
vocab_size = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = vocab_size)

### Pad each sentence to be of same length 
- Maximum sequence length as 300

In [None]:
max_len = 300
x_train = pad_sequences(x_train, maxlen = max_len, padding = 'pre')
x_test =  pad_sequences(x_test, maxlen = max_len, padding = 'pre')

X = np.concatenate((x_train, x_test), axis = 0)
y = np.concatenate((y_train, y_test), axis = 0)

### Print shape of features & labels 

Number of review, number of words in each review

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, shuffle = True)
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size = 0.2, random_state = random_state, shuffle = True)

print('---'*20, f'\nNumber of reviews in training dataset: {x_train.shape[0]}')
print(f'Number of words in training dataset: {x_train.shape[1]}')
print(f'Number of unique words in training dataset: {len(np.unique(np.hstack(x_train)))}')


print('---'*20, f'\nNumber of reviews in validation dataset: {x_valid.shape[0]}')
print(f'Number of words in validation dataset: {x_valid.shape[1]}')
print(f'Number of unique words in validation dataset: {len(np.unique(np.hstack(x_valid)))}')


print('---'*20, f'\nNumber of reviews in test dataset: {x_test.shape[0]}')
print(f'Number of words in test dataset: {x_test.shape[1]}')
print(f'Number of unique words in test dataset: {len(np.unique(np.hstack(x_test)))}')


print('---'*20, f'\nUnique Categories: {np.unique(y_train), np.unique(y_valid), np.unique(y_test)}')

------------------------------------------------------------ 
Number of reviews in training dataset: 32000
Number of words in training dataset: 300
Number of unique words in training dataset: 9999
------------------------------------------------------------ 
Number of reviews in validation dataset: 8000
Number of words in validation dataset: 300
Number of unique words in validation dataset: 9991
------------------------------------------------------------ 
Number of reviews in test dataset: 10000
Number of words in test dataset: 300
Number of unique words in test dataset: 9993
------------------------------------------------------------ 
Unique Categories: (array([0, 1]), array([0, 1]), array([0, 1]))


Number of labels

### Print value of any one feature and it's label 

Feature value

In [None]:
print (x_train[5])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    1   11
    4  454   19    4 4690 2941 1622    5 2639  481   13  197    7    4
  636  136   11  280  725    6   58   11    4 1263   54   12  214    8
    4  543   95   12  778    8  764   40 3904  537   42    2 8014   60
  869    2   10   10   95   12  505   46   15   14    9   53  729    8
  914 

Label value

In [None]:
print(y_train[5])

1


### Decode the feature value to get original sentence 

Retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

Use the dictionary to get the original words from the encodings, for a particular sentence

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [None]:
def decode_review(x, y):
  w2i = imdb.get_word_index()                                
  w2i = {k:(v + 3) for k, v in w2i.items()}
  w2i['<PAD>'] = 0
  w2i['<START>'] = 1
  w2i['<UNK>'] = 2
  i2w = {i: w for w, i in w2i.items()}

  ws = (' '.join(i2w[i] for i in x))
  print(f'Review: {ws}')
  print(f'Actual Sentiment: {y}')
  return w2i, i2w

w2i, i2w = decode_review(x_train[0], y_train[0])

# get first 50 key, value pairs from id to word dictionary
print('---'*30, '\n', list(islice(i2w.items(), 0, 50)))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
Review: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <

### Define model
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [None]:
# Creating Model
model = Sequential()
model.add(Embedding(vocab_size, 256, input_length = maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(256, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(Conv1D(128, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(MaxPooling1D(pool_size = 2))
model.add(Conv1D(64, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(MaxPooling1D(pool_size = 2))
model.add(LSTM(75))
model.add(Dense(1, activation = 'sigmoid'))



### Compile the model 
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [None]:
model.compile(optimizer = 'adam',loss = 'binary_crossentropy', metrics = ['accuracy'])

### Print model summary 

In [None]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 256)          2560000   
_________________________________________________________________
dropout (Dropout)            (None, 300, 256)          0         
_________________________________________________________________
conv1d (Conv1D)              (None, 300, 256)          327936    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 300, 128)          163968    
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 150, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 150, 64)           41024     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 75, 64)            0

In [None]:
# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 0)  
mc = ModelCheckpoint('imdb_model.h5', monitor = 'val_loss', mode = 'min', save_best_only = True, verbose = 1)

### Fit the model 

In [None]:
# Fitting the model
model.fit(x_train, y_train, validation_data = (x_valid, y_valid), epochs = 3, batch_size = 64, verbose = True, callbacks = [es, mc])



Epoch 1/3
Epoch 00001: val_loss did not improve from 0.24832
Epoch 2/3
Epoch 00002: val_loss did not improve from 0.24832
Epoch 00002: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f988fb5ca90>

### Evaluate model 

In [None]:
# Evaluating the model
scores = model.evaluate(x_test, y_test, batch_size = 64)
print('Test accuracy: %.2f%%' % (scores[1]*100))

Test accuracy: 89.62%


### Predict on one sample 

In [None]:
y_pred = model.predict_classes(x_test)
print(f'Classification Report:\n{classification_report(y_pred, y_test)}')

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.91      0.90      4896
           1       0.91      0.88      0.90      5104

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



In [None]:
# Retrieving the output of each layer in keras for a given single test sample from the trained model

sample_x_test = x_test[np.random.randint(10000)]

for layer in model.layers:
    model_layer = Model(inputs = model.input, outputs = model.get_layer(layer.name).output)
    output = model_layer.predict(sample_x_test.reshape(1,-1))
    print('\n','--'*20, layer.name, 'layer', '--'*20, '\n')
    print(output)


 ---------------------------------------- embedding layer ---------------------------------------- 

[[[-0.0427901  -0.01804712 -0.02806675 ...  0.00624817  0.06408209
   -0.02403734]
  [-0.07048995  0.0796249  -0.01308915 ...  0.0400336   0.05493726
   -0.06124282]
  [ 0.04054682  0.00713296  0.00692001 ...  0.04047054  0.03309698
   -0.02720738]
  ...
  [ 0.03292023 -0.0405538   0.01924137 ... -0.07656936  0.01597329
    0.09222866]
  [-0.06663334 -0.01331978 -0.00673268 ... -0.02415959 -0.05999588
   -0.03977136]
  [-0.05858055 -0.03403356 -0.02258617 ...  0.00834998  0.04716273
   -0.0433165 ]]]

 ---------------------------------------- dropout layer ---------------------------------------- 

[[[-0.0427901  -0.01804712 -0.02806675 ...  0.00624817  0.06408209
   -0.02403734]
  [-0.07048995  0.0796249  -0.01308915 ...  0.0400336   0.05493726
   -0.06124282]
  [ 0.04054682  0.00713296  0.00692001 ...  0.04047054  0.03309698
   -0.02720738]
  ...
  [ 0.03292023 -0.0405538   0.0192413

In [None]:
decode_review(x_test[10], y_test[10])
print(f'Predicted sentiment: {y_pred[10][0]}')

Review: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <P

Conclusion: On the sentiment classification of IMDB Dataset we achieved
- Accuracy 90%
- F1-Score 90%
- Loss 24.8%