# Sentiment classifier
Tutorial from [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/) by Prodip Hore and Sayan Chatterjee

## Dataset

UCI Machine Learning Repository: Sentiment Labelled Sentences Data Set
('From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015)

Sentences: 2000 (We are only using Amazon and Yelp files)

Labels: Positive (1) - Negative (0)


Example:

* "The mic is great." Positive ->  `The mic is great.	1`

* "What a waste of money and time!." Negative -> `What a waste of money and time!.	0`


## Architecture

Input layer -> Embedding layer -> LSTM -> Dense (softmax) -> Label

In [1]:
import numpy as np

# Read txt files
with open('data/amazon.txt', mode='r') as f:
    lines = f.readlines()
    
with open('data/yelp.txt', mode='r') as f:
    lines += f.readlines()

# Split lines so we have sentences and the class as an integer
sentences = [line.split('\t')[0] for line in lines]
labels = [int(line.split('\t')[1]) for line in lines]
labels = np.asarray(labels)
print(len(labels))

2000


In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Tokenizer: An object with an internal lexicon, and unknown token.
t = Tokenizer()
# Load the dataset in the tokenizer
t.fit_on_texts(sentences)

# Maps the words in the sentences with the indeces in the lexicon (list of lists)
text_matrix= t.texts_to_sequences(sentences)

print('sentence: ' + sentences[0])

print('representation: ')
print(text_matrix[0])


# calculate max length of sentence in the corpus
max_length = 0

for i in range(len(text_matrix)):
    sent_length = len(text_matrix[i])
    if max_length < sent_length:
        max_length = sent_length
    
print('max length: %d' % max_length)

# The vocabulary size will be determine by the index of the last word in the lexicon (index starting from 0)
vocab_size = len(t.word_index) + 1

print('vocabulary size: %d'%vocab_size)

sentence: So there is no way for me to plug it in here in the US unless I go by a converter.
representation: 
[27, 58, 7, 55, 141, 12, 60, 6, 268, 5, 14, 45, 14, 1, 148, 448, 3, 59, 112, 4, 1427]
max length: 32
vocabulary size: 3259


In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# dimension of the embeddings to represent the words with vectors of the same dimension. 
emb_dim = 16

# we need to pad the sentences that have less words than the maximum length by adding zeros
tex_pad = pad_sequences(text_matrix, maxlen=max_length, padding='post')

# Dummy train test sets split 
x_train = tex_pad[:1600,:]
y_train = labels[:1600]
x_test = tex_pad[1600:,:]
y_test = labels[1600:]

print(len(x_train))
print(len(y_train))
print(len(x_test))
print(len(y_test))

1600
1600
400
400


In [4]:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2

lstm_units = 10

inputs = Input(shape=(max_length,))
embedding = Embedding(input_dim=vocab_size, output_dim=emb_dim, input_length=max_length, embeddings_regularizer=l2(.001))
embd_out = embedding(inputs)
lstm = LSTM(lstm_units, dropout=0.3, recurrent_dropout=0.2)
lstm_out = lstm(embd_out)

prob = Dense(1, activation='sigmoid')
outputs = prob(lstm_out)

model = Model(inputs, outputs)

print(model.summary())

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 32)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 32, 16)            52144     
_________________________________________________________________
lstm (LSTM)                  (None, 10)                1080      
_________________________________________________________________
dense (Dense)                (None, 1)                 11        
Total params: 53,235
Trainable params: 53,235
Non-trainable params: 0
_________________________________________________________________
None


In [5]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])


model.fit(x=x_train,y=y_train,
          batch_size=100,
          epochs=10,
          verbose=1,
          shuffle=True,
          validation_data=(x_test,y_test)
         )

Train on 1600 samples, validate on 400 samples
Epoch 1/10


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd8547b12b0>

In [6]:
# The model barely learnt. Results change with each execusion
# acc_train = 0.56 last epoch vs acc_train = 0.52 first epoch
# acc_test = 0.45 last epoch vs acc_test = 0.41 for first epoch
# Test
print(t.sequences_to_texts(x_test[:10]))
print(y_test[:10])

pred = model.predict(x_test[:10])
print(pred)

['i miss it and wish they had one in philadelphia', 'we got sitting fairly fast but ended up waiting 40 minutes just to place our order another 30 minutes before the food arrived', 'they also have the best cheese crisp in town', 'good value great food great service', "couldn't ask for a more satisfying meal", 'the food is good', 'it was awesome', 'i just wanted to leave', 'we made the drive all the way from north scottsdale and i was not one bit disappointed', 'i will not be eating there again']
[1 0 1 1 1 1 1 0 1 0]
[[0.5299712 ]
 [0.49581152]
 [0.54338825]
 [0.56598663]
 [0.53870577]
 [0.5556019 ]
 [0.54109323]
 [0.5389493 ]
 [0.5079776 ]
 [0.5250999 ]]


## Architecture

Input layer -> Embedding layer -> LSTM -> Attention -> Dense (softmax) -> Label

### Attention (Bahdanau et al., 2015)
Additive Attention

1. $\large score(s_t, h_i) = v_a^T \text{tanh}(W_a[s_t;h_i])$ 

   -> In this example we use local attention $score(h) = \large \text{tanh}(W_ah + b)$

2. $\large \alpha_{ti}=\frac{exp(score_{ti})}{\sum_{k=1}{N}{exp(score_{tk})}}$

3. $\large \alpha \cdot h$

In [7]:
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Layer

# Custom attetion layer
class BahdanauAttention(Layer):
    def __init__(self, **kwargs):
        super(BahdanauAttention, self).__init__(**kwargs)
    
    # This method states the weights that the layer will learn. It has as input param the shape of the input
    # which is called. This method is called at the declaration time
    def build(self, input_shape):
        # We need to provide the dimensions of our weights. In this example, we will have a W_a matrix of
        # dimension (lstm_units, 1), and a bias of dimension (max_length, 1)
        self.W=self.add_weight(name="att_weight",shape=(input_shape[-1],1),initializer="normal")
        self.b=self.add_weight(name="att_bias",shape=(input_shape[1],1),initializer="zeros")        
        super(BahdanauAttention, self).build(input_shape)
    
    # In this method with do all the calculations of the layer and return the output of the layer
    def call(self, x):
        # x is the input of the layer. In this example, the output of lstm (hidden_statesxlstm_units) 
        # hidden_states = max_length
        
        # We calculate the score tanh(W.x + b)
        scores = K.tanh(K.dot(x,self.W)+self.b)  # (max_length x 1) 
        print('scores shape: ')
        print(scores.shape)
        
        # This removes the last axis -> a vector of max_length dimension 
        # we can omit this since our W matrix has dimension 1 in the last axis
        scores=K.squeeze(scores, axis=-1) 
        print('scores shape after squeeze: ')
        print(scores.shape)
        
        # we apply softmax (the last axis is the default axis used for calculation)
        at=K.softmax(scores)
        print('attention weights shape: ')
        print(at.shape)
        
        # This adds a 1-sized dimension to the last axis -> matrix of (max_length x 1)
        at=K.expand_dims(at,axis=-1) # if there is no squeeze, then we can omit this
        print('attention weights shape after expand_dims: ')
        print(at.shape)
        
        # We calculate the weighted values -> \alpha*hidden_states         
        # row-wise multiplication (we are weighting the hidden_states, not the lstm_units) 
        output=x*at # (max_length x lstm_units)
        print('weighted values shape: ')
        print(output)
        
        # The output of this layer is the weighted values (we sum up the values of the hidden states), and
        # the weights of the attetnion (max_length x 1)
        return K.sum(output, axis=1), at
    
    # This is used for summary, to see the output shape of the two output matrices
    def compute_output_shape(self, input_shape):
        return (input_shape[0],input_shape[-1])
    
    # This is used for summary (it returns the params of the layer)
    def get_config(self):
        return super(BahdanauAttention, self).get_config()


In [8]:
from tensorflow.keras.layers import Attention, GlobalAveragePooling1D

# Architecture
inputs1 = Input(shape=(max_length,))
embedding1 = Embedding(input_dim=vocab_size, output_dim=emb_dim, input_length=max_length, embeddings_regularizer=l2(.001))
embd_out1 = embedding1(inputs1)
lstm1 = LSTM(lstm_units, dropout=0.3, recurrent_dropout=0.2, return_sequences=True)
lstm_out1 = lstm1(embd_out1)

# attention = GlobalAveragePooling1D(Attention()([lstm_out1, lstm_out1]))
weigthed_out, weights = BahdanauAttention()(lstm_out1)

prob1 = Dense(1, activation='sigmoid')
outputs1 = prob1(weigthed_out)

model1 = Model(inputs1, outputs1) # classifier
attention_model = Model(inputs1, weights) # attention weights


print(model1.summary())

scores shape: 
(None, 32, 1)
scores shape after squeeze: 
(None, 32)
attention weights shape: 
(None, 32)
attention weights shape after expand_dims: 
(None, 32, 1)
weighted values shape: 
Tensor("bahdanau_attention/mul:0", shape=(None, 32, 10), dtype=float32)
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 32)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 32, 16)            52144     
_________________________________________________________________
lstm_1 (LSTM)                (None, 32, 10)            1080      
_________________________________________________________________
bahdanau_attention (Bahdanau ((None, 10), (None, 32, 1 42        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11      

In [9]:
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model1.fit(x=x_train,y=y_train,
          batch_size=100,
          epochs=10,
          verbose=1,
          shuffle=True,
          validation_data=(x_test,y_test)
          )

Train on 1600 samples, validate on 400 samples
Epoch 1/10
scores shape: 
(100, 32, 1)
scores shape after squeeze: 
(100, 32)
attention weights shape: 
(100, 32)
attention weights shape after expand_dims: 
(100, 32, 1)
weighted values shape: 
Tensor("model_1/bahdanau_attention/mul:0", shape=(100, 32, 10), dtype=float32)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


scores shape: 
(100, 32, 1)
scores shape after squeeze: 
(100, 32)
attention weights shape: 
(100, 32)
attention weights shape after expand_dims: 
(100, 32, 1)
weighted values shape: 
Tensor("model_1/bahdanau_attention/mul:0", shape=(100, 32, 10), dtype=float32)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


(100, 32, 1)
scores shape after squeeze: 
(100, 32)
attention weights shape: 
(100, 32)
attention weights shape after expand_dims: 
(100, 32, 1)
weighted values shape: 
Tensor("model_1/bahdanau_attention/mul:0", shape=(100, 32, 10), dtype=float32)
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd818934518>

In [10]:
# This time the model faster
# results change with each execusion 
# acc_train = 0.92 last epoch and acc_test = 0.76 last epoch

pred = model1.predict(x_test[:10])
# Attention is after embeddings and lstm, therefore this attention is to the abstract representation obtained
# with the LSTM layer rather than to the words.
attention_pred = attention_model.predict(x_test[:10]) 

print(x_test[0])
print(t.sequences_to_texts(x_test[:1]))
print(attention_pred.shape)
print(np.argmax(attention_pred, axis=1))
print(pred)

scores shape: 
(None, 32, 1)
scores shape after squeeze: 
(None, 32)
attention weights shape: 
(None, 32)
attention weights shape after expand_dims: 
(None, 32, 1)
weighted values shape: 
Tensor("model_1/bahdanau_attention/mul:0", shape=(None, 32, 10), dtype=float32)
[   3 2866    5    2 1101   37   25   40   14 2867    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
['i miss it and wish they had one in philadelphia']
(10, 32, 1)
[[15]
 [14]
 [15]
 [14]
 [14]
 [14]
 [15]
 [15]
 [14]
 [14]]
[[0.81437135]
 [0.23242   ]
 [0.9033072 ]
 [0.93976784]
 [0.22325104]
 [0.9174627 ]
 [0.8985529 ]
 [0.88971627]
 [0.26299414]
 [0.16395885]]
