![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [18]:
from tensorflow.keras.datasets import imdb
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [48]:
# Because the load_Data function splits into 50/50 , we will concatenate and then later split into 80/20
import numpy as np
data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)

X_train = data[10000:]
y_train = targets[10000:]
X_test = data[:10000]
y_test = targets[:10000]

### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [49]:
from keras.preprocessing import sequence
max_words = 300
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [54]:
print('Number of review in train',X_train.shape[0])

Number of review in train 40000


In [55]:
print('Number of words in train',X_train.shape[1])

Number of words in train 300


Number of labels

In [57]:
print('Number of labels in train',y_train.shape[0])

Number of labels in train 40000


### Print value of any one feature and it's label (2 Marks)

Feature value

In [58]:
X_train[10]

array([ 463,   10,   10,    5,   13,  203,   28, 1049,  142,   21,  286,
       1165, 4386, 5638,  342, 1408,    8,   30,    6,  500,   15,   69,
        115,  256,  159,   73,   48,   15,    9,    4,  420,    5,   13,
        122,   24,  717,  233,   89,  122, 1802,    2,  109,    5,   60,
          4,  250,  124,   38,   76,   44,    4,  500,  460, 1106,  662,
        183,   15,   49,   84,   92,  124,   44,   34,   68,  333,   42,
        840,  297,  143,   10,   10,  724,    4, 7140,    5,    2, 4013,
         50,    9,    4,  831,  364,  489,    7,    4,   22,  410,  164,
        133,    9, 1252,   55,  906,    4,  554,  286,   60,   15,   52,
         33,  395,  374, 1628,   11,    4,  929,   36, 1177,    6,  176,
          7,  362,   13,  697,   96,  145,   11,  148,  504,   71, 8703,
         53, 2574,   23,  350,    7,   32,   14,    4,  116,    5,  769,
         26,   43, 1501,   33,   68, 5655,  757,    4,  105,   26, 1904,
          5,  340, 2502,    4,    2,    4, 5494,  5

Label value

In [59]:
training_targets[10]

1

### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [60]:
word_index = imdb.get_word_index()
index_word = {v : k for k,v in word_index.items()}

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [61]:
index = 10
print(" ".join([index_word[idx] for idx in X_train[index]]))

the clear fact entertaining there life back br is safely show of performance stars br actors film him many should movie reasons to and reading and are in of scenes and and of safely out compared not boss yes to sentiment show its disappointed fact raw to it justice by br of where clear fact many your way and with city nice are is along wrong not as it way she but this anything up haven't been by who of choices br of you to as this i'd it and who of shot you'll to love for updated of you it is sequels of little quest are seen watched front chemistry to simply alive of chris being it is say easy and cry in chemistry but voodoo all it maybe this is wing film job live of objects relief and level names and dunne to be stops serial 1948 watch is men go this of wing american from russo moving is accepted put this of jerry for places so work and watch and lot br that from sometimes wondered make department introduced to wondered from action at turns in low that in gay i'm of chemistry bible i 

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [62]:
print("positve" if y_train[index]==1 else "negetive")

positve


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [63]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras import layers

In [64]:
model = Sequential()
model.add(Embedding(10000, 100, input_length=max_words))
model.add(LSTM(300,, dropout=0.2, recurrent_dropout=0.2,return_sequences=True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(Dense(1,activation='softmax'))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [65]:
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)

### Print model summary (2 Marks)

In [66]:
model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 300, 100)          1000000   
_________________________________________________________________
lstm_9 (LSTM)                (None, 300, 300)          481200    
_________________________________________________________________
time_distributed_7 (TimeDist (None, 300, 100)          30100     
_________________________________________________________________
flatten_7 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 30000)             0         
_________________________________________________________________
dense_15 (Dense)             (None, 50)                1500050   
_________________________________________________________________
dropout_3 (Dropout)          (None, 50)              

### Fit the model (2 Marks)

In [67]:
results = model.fit(
 X_train, y_train,
 epochs= 10,
 batch_size = 500
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluate model (2 Marks)

In [69]:
score,acc = model.evaluate(X_test, y_test, verbose = 2, batch_size = 500)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

20/20 - 2s - loss: 0.7425 - accuracy: 0.5053
score: 0.74
acc: 0.51


### Predict on one sample (2 Marks)

In [77]:
index = 10
print(" ".join([index_word[idx] for idx in X_test[index]]))

is men go this of wing american from russo moving is accepted put this of jerry for places so work and watch and lot br that from sometimes wondered make department introduced to wondered from action at turns in low that in gay i'm of chemistry bible i i simply alive it is time done inspector to watching look world named for more tells up many fans are that movie music her get grasp but seems in people film that if explain in why for and find of where br if and movie throughout if and of you best look red startling to recently in successfully much unfortunately going dan and stuck is him sequences but of you of enough for its br that beautiful put reasons of chris chemistry wing and for of you red time trivia to as companion payoff of chris less br of subplots torture in low alive in gay some br of wing if time actual in also side any if name takes for of friendship it of 10 for had and great to as you students for movie of going and for bad well best had at woman br musical when it ca

In [78]:
print("positve" if y_test[10]==1 else "negetive")

positve


In [79]:
result = model.predict(X_test[10].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]

1/1 - 0s


In [80]:
result

array([1.], dtype=float32)