# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

The numpy version is changed from 1.16.4 to 1.16.1 which has allow_pickel=True *Source: Google

In [0]:
import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

In [0]:
from keras.datasets import imdb
vocab_size = 10000 #vocab size
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

What i didn't understand is, why we executed the previous line of code to import dataset when we are doing the same below!!

In [0]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [0]:
# restore np.load for future normal usage
np.load = np_load_old

In [0]:
x_train.shape

(25000, 300)

In [0]:
import collections
counter = collections.Counter(y_train)
print (counter)

Counter({1: 12500, 0: 12500})


In [0]:
y_train.shape

(25000,)

So we have x_train with 25000 rows and each row is populated with 300 words/cols.
And there are 12500 positive reviews and 12500 negative reviews in the given dataset.

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

In [0]:
model = Sequential()
model.add(Embedding(vocab_size, 32, input_length=maxlen))
model.add(Flatten())
model.add(Dense(300, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=20, batch_size=128, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
 - 2s - loss: 0.4544 - acc: 0.7620 - val_loss: 0.2972 - val_acc: 0.8709
Epoch 2/20
 - 1s - loss: 0.1359 - acc: 0.9521 - val_loss: 0.3360 - val_acc: 0.8632
Epoch 3/20
 - 1s - loss: 0.0237 - acc: 0.9948 - val_loss: 0.4483 - val_acc: 0.8618
Epoch 4/20
 - 1s - loss: 0.0027 - acc: 0.9999 - val_loss: 0.4967 - val_acc: 0.8642
Epoch 5/20
 - 1s - loss: 8.6831e-04 - acc: 1.0000 - val_loss: 0.5267 - val_acc: 0.8654
Epoch 6/20
 - 1s - loss: 4.7719e-04 - acc: 1.0000 - val_loss: 0.5491 - val_acc: 0.8658
Epoch 7/20
 - 1s - loss: 3.0908e-04 - acc: 1.0000 - val_loss: 0.5675 - val_acc: 0.8662
Epoch 8/20
 - 1s - loss: 2.1864e-04 - acc: 1.0000 - val_loss: 0.5833 - val_acc: 0.8664
Epoch 9/20
 - 1s - loss: 1.6182e-04 - acc: 1.0000 - val_loss: 0.5975 - val_acc: 0.8668
Epoch 10/20
 - 1s - loss: 1.2355e-04 - acc: 1.0000 - val_loss: 0.6105 - val_acc: 0.8670
Epoch 11/20
 - 1s - loss: 9.7151e-05 - acc: 1.0000 - val_loss: 0.6225 - val_acc: 0.8672
Epoch 1

<keras.callbacks.History at 0x7f309af345f8>

We are getting accuracy of higher 86%.
This means the model is able to predict 87% of sentiments/reviews correctly as positive or negative

In [0]:
scores = model.evaluate(x_test, y_test, verbose=2)
print(scores)

[0.7054031683260202, 0.86736]


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

"I am not able to understand the question and its purpose itself"

In [0]:
##**I am herewith trying to improve the accuracy (if possible) by playing with the vocab size**##

In [0]:
vocab_size1 = 100000 #vocab size increased 10 times
#maxlen1 = 300 # each row will now have 3000 words, thereby increased it 10 times as well

In [0]:
#load dataset as a list of ints
(xtrain, ytrain), (xtest, ytest) = imdb.load_data(num_words=vocab_size1)
#make all sequences of the same length
xtrain = pad_sequences(xtrain, maxlen=maxlen)
xtest =  pad_sequences(xtest, maxlen=maxlen)

In [0]:
model1 = Sequential()
model1.add(Embedding(vocab_size1, 32, input_length=maxlen))
model1.add(Flatten())
model1.add(Dense(300, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
model1.fit(xtrain, ytrain, validation_data=(xtest, ytest), epochs=100, batch_size=128, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/100
 - 2s - loss: 0.4442 - acc: 0.7660 - val_loss: 0.2969 - val_acc: 0.8721
Epoch 2/100
 - 2s - loss: 0.0986 - acc: 0.9662 - val_loss: 0.3418 - val_acc: 0.8650
Epoch 3/100
 - 2s - loss: 0.0106 - acc: 0.9986 - val_loss: 0.4057 - val_acc: 0.8664
Epoch 4/100
 - 2s - loss: 0.0013 - acc: 1.0000 - val_loss: 0.4378 - val_acc: 0.8672
Epoch 5/100
 - 2s - loss: 5.6906e-04 - acc: 1.0000 - val_loss: 0.4556 - val_acc: 0.8678
Epoch 6/100
 - 2s - loss: 3.4376e-04 - acc: 1.0000 - val_loss: 0.4713 - val_acc: 0.8688
Epoch 7/100
 - 2s - loss: 2.2961e-04 - acc: 1.0000 - val_loss: 0.4849 - val_acc: 0.8684
Epoch 8/100
 - 2s - loss: 1.6433e-04 - acc: 1.0000 - val_loss: 0.4969 - val_acc: 0.8690
Epoch 9/100
 - 2s - loss: 1.2304e-04 - acc: 1.0000 - val_loss: 0.5079 - val_acc: 0.8688
Epoch 10/100
 - 2s - loss: 9.4843e-05 - acc: 1.0000 - val_loss: 0.5177 - val_acc: 0.8691
Epoch 11/100
 - 2s - loss: 7.4819e-05 - acc: 1.0000 - val_loss: 0.5266 - val_acc: 0.8

<keras.callbacks.History at 0x7f306a995a20>