# Sentiment classifier
Tutorial from [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/) by Prodip Hore and Sayan Chatterjee

## Dataset

UCI Machine Learning Repository: Sentiment Labelled Sentences Data Set
('From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015)

Sentences: 2000 (We are only using Amazon and Yelp files)

Labels: Positive (1) - Negative (0)


Example:

* "The mic is great." Positive ->  `The mic is great.	1`

* "What a waste of money and time!." Negative -> `What a waste of money and time!.	0`


## Architecture

Input layer -> Embedding layer -> LSTM -> Dense (softmax) -> Label

In [1]:
import numpy as np

# Read txt files
with open('data/amazon.txt', mode='r') as f:
    lines = f.readlines()
    
with open('data/yelp.txt', mode='r') as f:
    lines += f.readlines()

# Split lines so we have sentences and the class as an integer
sentences = [line.split('\t')[0] for line in lines]
labels = [int(line.split('\t')[1]) for line in lines]
labels = np.asarray(labels)
print(len(labels))

2000


In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Tokenizer: An object with an internal lexicon, and unknown token.
t = Tokenizer()
# Load the dataset in the tokenizer
t.fit_on_texts(sentences)

# Maps the words in the sentences with the indeces in the lexicon (list of lists)
text_matrix= t.texts_to_sequences(sentences)

print('sentence: ' + sentences[0])

print('representation: ')
print(text_matrix[0])


# calculate max length of sentence in the corpus
max_length = 0

for i in range(len(text_matrix)):
    sent_length = len(text_matrix[i])
    if max_length < sent_length:
        max_length = sent_length
    
print('max length: %d' % max_length)

# The vocabulary size will be determine by the index of the last word in the lexicon (index starting from 0)
vocab_size = len(t.word_index) + 1

print('vocabulary size: %d'%vocab_size)

sentence: So there is no way for me to plug it in here in the US unless I go by a converter.
representation: 
[27, 58, 7, 55, 141, 12, 60, 6, 268, 5, 14, 45, 14, 1, 148, 448, 3, 59, 112, 4, 1427]
max length: 32
vocabulary size: 3259


In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# dimension of the embeddings to represent the words with vectors of the same dimension. 
emb_dim = 16

# we need to pad the sentences that have less words than the maximum length by adding zeros
tex_pad = pad_sequences(text_matrix, maxlen=max_length, padding='post')

# Dummy train test sets split 
x_train = tex_pad[:1600,:]
y_train = labels[:1600]
x_test = tex_pad[1600:,:]
y_test = labels[1600:]

print(len(x_train))
print(len(y_train))
print(len(x_test))
print(len(y_test))

1600
1600
400
400


In [4]:
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Layer, GlobalAveragePooling1D, Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2

# Custom attetion layer
class BahdanauAttention(Layer):
    def __init__(self, **kwargs):
        super(BahdanauAttention, self).__init__(**kwargs)
    
    # This method states the weights that the layer will learn. It has as input param the shape of the input
    # which is called. This method is called at the declaration time
    def build(self, input_shape):
        # We need to provide the dimensions of our weights. In this example, we will have a W_a matrix of
        # dimension (lstm_units, 1), and a bias of dimension (max_length, 1)
        self.W=self.add_weight(name="att_weight",shape=(input_shape[-1],1),initializer="normal")
        self.b=self.add_weight(name="att_bias",shape=(input_shape[1],1),initializer="zeros")        
        super(BahdanauAttention, self).build(input_shape)
    
    # In this method with do all the calculations of the layer and return the output of the layer
    def call(self, x):
        # x is the input of the layer. In this example, the output of lstm (hidden_statesxlstm_units) 
        # hidden_states = max_length
        
        # We calculate the score tanh(W.x + b)
        scores = K.tanh(K.dot(x,self.W)+self.b)  # (max_length x 1) 
        print('scores shape: ')
        print(scores.shape)
        
        # This removes the last axis -> a vector of max_length dimension 
        # we can omit this since our W matrix has dimension 1 in the last axis
        scores=K.squeeze(scores, axis=-1) 
        print('scores shape after squeeze: ')
        print(scores.shape)
        
        # we apply softmax (the last axis is the default axis used for calculation)
        at=K.softmax(scores)
        print('attention weights shape: ')
        print(at.shape)
        
        # This adds a 1-sized dimension to the last axis -> matrix of (max_length x 1)
        at=K.expand_dims(at,axis=-1) # if there is no squeeze, then we can omit this
        print('attention weights shape after expand_dims: ')
        print(at.shape)
        
        # We calculate the weighted values -> \alpha*hidden_states         
        # row-wise multiplication (we are weighting the hidden_states, not the lstm_units) 
        output=x*at # (max_length x lstm_units)
        print('weighted values shape: ')
        print(output)
        
        # The output of this layer is the weighted values (we sum up the values of the hidden states), and
        # the weights of the attetnion (max_length x 1)
        return K.sum(output, axis=1), at
    
    # This is used for summary, to see the output shape of the two output matrices
    def compute_output_shape(self, input_shape):
        return (input_shape[0],input_shape[-1])
    
    # This is used for summary (it returns the params of the layer)
    def get_config(self):
        return super(BahdanauAttention, self).get_config()


# Architecture
lstm_units = 10

inputs = Input(shape=(max_length,))
embedding = Embedding(input_dim=vocab_size, output_dim=emb_dim, input_length=max_length, embeddings_regularizer=l2(.001))
embd_out = embedding(inputs)
lstm = LSTM(lstm_units, dropout=0.3, recurrent_dropout=0.2, return_sequences=True)
lstm_out = lstm(embd_out)

weigthed_out, weights = BahdanauAttention()(lstm_out)

prob = Dense(2, activation='sigmoid')
outputs = prob(weigthed_out)

model = Model(inputs, outputs) # classifier
attention_model = Model(inputs, weights) # attention weights


print(model.summary())

scores shape: 
(None, 32, 1)
scores shape after squeeze: 
(None, 32)
attention weights shape: 
(None, 32)
attention weights shape after expand_dims: 
(None, 32, 1)
weighted values shape: 
Tensor("bahdanau_attention/mul:0", shape=(None, 32, 10), dtype=float32)
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 32)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 32, 16)            52144     
_________________________________________________________________
lstm (LSTM)                  (None, 32, 10)            1080      
_________________________________________________________________
bahdanau_attention (Bahdanau ((None, 10), (None, 32, 1 42        
_________________________________________________________________
dense (Dense)                (None, 2)                 22        

In [5]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])


model.fit(x=x_train,y=y_train,
          batch_size=32,
          epochs=10,
          verbose=1,
          shuffle=True,
          validation_data=(x_test,y_test)
         )

Train on 1600 samples, validate on 400 samples
Epoch 1/10
scores shape: 
(32, 32, 1)
scores shape after squeeze: 
(32, 32)
attention weights shape: 
(32, 32)
attention weights shape after expand_dims: 
(32, 32, 1)
weighted values shape: 
Tensor("model/bahdanau_attention/mul:0", shape=(32, 32, 10), dtype=float32)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


scores shape: 
(32, 32, 1)
scores shape after squeeze: 
(32, 32)
attention weights shape: 
(32, 32)
attention weights shape after expand_dims: 
(32, 32, 1)
weighted values shape: 
Tensor("model/bahdanau_attention/mul:0", shape=(32, 32, 10), dtype=float32)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


(None, 32, 1)
scores shape after squeeze: 
(None, 32)
attention weights shape: 
(None, 32)
attention weights shape after expand_dims: 
(None, 32, 1)
weighted values shape: 
Tensor("model/bahdanau_attention/mul:0", shape=(None, 32, 10), dtype=float32)
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f36580795c0>

In [6]:
# The model barely learnt. Results change with each execusion
# acc_train = 0.56 last epoch vs acc_train = 0.52 first epoch
# acc_test = 0.45 last epoch vs acc_test = 0.41 for first epoch
# Test
print(t.sequences_to_texts(x_test[:10]))
print(y_test[:10])

pred = model.predict(x_test[:10])
print(pred)

['i miss it and wish they had one in philadelphia', 'we got sitting fairly fast but ended up waiting 40 minutes just to place our order another 30 minutes before the food arrived', 'they also have the best cheese crisp in town', 'good value great food great service', "couldn't ask for a more satisfying meal", 'the food is good', 'it was awesome', 'i just wanted to leave', 'we made the drive all the way from north scottsdale and i was not one bit disappointed', 'i will not be eating there again']
[1 0 1 1 1 1 1 0 1 0]
scores shape: 
(None, 32, 1)
scores shape after squeeze: 
(None, 32)
attention weights shape: 
(None, 32)
attention weights shape after expand_dims: 
(None, 32, 1)
weighted values shape: 
Tensor("model/bahdanau_attention/mul:0", shape=(None, 32, 10), dtype=float32)
[[0.6030605  0.15063035]
 [0.63111967 0.11287954]
 [0.05503449 0.9607009 ]
 [0.04045883 0.97273684]
 [0.68081427 0.05628744]
 [0.04759791 0.9668544 ]
 [0.05885363 0.9574808 ]
 [0.66232026 0.0722383 ]
 [0.618875 

## Architecture

Input layer -> Embedding block -> Transforemer Block -> GlobalAveragePooling -> Dropout ->Dense (reduce dimensions) -> Dropout -> Dense (softmax) -> Label

In [7]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Layer, Embedding, LayerNormalization, Dropout
from tensorflow.keras.models import Sequential


class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

    
# Multi-head self-attention
class MultiHeadSelfAttention(Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
            )
        self.projection_dim = embed_dim // num_heads
        self.query_dense = Dense(embed_dim)
        self.key_dense = Dense(embed_dim)
        self.value_dense = Dense(embed_dim)
        self.combine_heads = Dense(embed_dim)

    def attention(self, query, key, value, mask):
        score = tf.matmul(query, key, transpose_b=True) # (batch_size, )
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32) # d_model
        scaled_score = score / tf.math.sqrt(dim_key)
        if mask:
            # look ahead mask
            seq_length = tf.shape(key)[1]
            mask_matrix = 1 - tf.linalg.band_part((tf.ones(seq_length, seq_length)), -1, 0)
            scaled_score += mask_matrix * -1e9 # -inf for masked values         
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs, mask=False):
        # x.shape = [batch_size, seq_len, embedding_dim]
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs)  # (batch_size, seq_len, embed_dim)
        key = self.key_dense(inputs)  # (batch_size, seq_len, embed_dim)
        value = self.value_dense(inputs)  # (batch_size, seq_len, embed_dim)
        query = self.separate_heads(
            query, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        key = self.separate_heads(
            key, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        value = self.separate_heads(
            value, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        attention, weights = self.attention(query, key, value, mask)
        attention = tf.transpose(
            attention, perm=[0, 2, 1, 3]
        )  # (batch_size, seq_len, num_heads, projection_dim)
        concat_attention = tf.reshape(
            attention, (batch_size, -1, self.embed_dim)
        )  # (batch_size, seq_len, embed_dim)
        output = self.combine_heads(
            concat_attention
        )  # (batch_size, seq_len, embed_dim)
        return output
    

class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = Sequential(
            [Dense(ff_dim, activation="relu"), Dense(embed_dim),]
        ) # FeedForward Network
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        # if training=True the layer add dropout, in inference mode it does nothing.
        attn_output = self.dropout1(attn_output, training=training) 
        out1 = self.layernorm1(inputs + attn_output) # Residual connection (+)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In [8]:
from tensorflow.keras.layers import Input, GlobalAveragePooling1D
from tensorflow.keras.models import Model

embed_dim = 16  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 10  # Hidden layer size in feed forward network inside transformer

inputs = Input(shape=(max_length,))
embedding_layer = TokenAndPositionEmbedding(max_length, vocab_size, embed_dim)
x = embedding_layer(inputs)
print('----------------------- dimensions -----------------------')
print('Shape embedding layer: ')
print(x.shape)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
print('Shape Transformer block: ')
print(x.shape)
x = GlobalAveragePooling1D()(x)
print('Shape global average pooling layer: ')
print(x.shape)
x = Dropout(0.1)(x)
print('Shape dropout layer: ')
print(x.shape)
x = Dense(20, activation="relu")(x)
print('Shape dense 1 layer: ')
print(x.shape)
x = Dropout(0.1)(x)
print('Shape dropout layer: ')
print(x.shape)
outputs = Dense(2, activation="softmax")(x)
print('Shape dense 2 (output) layer: ')
print(outputs.shape)

model_transformer = Model(inputs=inputs, outputs=outputs)
print(model_transformer.summary())

----------------------- dimensions -----------------------
Shape embedding layer: 
(None, 32, 16)
Shape Transformer block: 
(None, 32, 16)
Shape global average pooling layer: 
(None, 16)
Shape dropout layer: 
(None, 16)
Shape dense 1 layer: 
(None, 20)
Shape dropout layer: 
(None, 20)
Shape dense 2 (output) layer: 
(None, 2)
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 32)]              0         
_________________________________________________________________
token_and_position_embedding (None, 32, 16)            52656     
_________________________________________________________________
transformer_block (Transform (None, 32, 16)            1498      
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
______________________________________________________________

In [9]:
model_transformer.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model_transformer.fit(x=x_train,y=y_train,
          batch_size=32,
          epochs=10,
          verbose=1,
          shuffle=True,
          validation_data=(x_test,y_test)
          )

Train on 1600 samples, validate on 400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f360083e128>

In [10]:
# Predictions

pred = model_transformer.predict(x_test)


print(x_test[1])
print(t.sequences_to_texts(x_test[1:2]))
print(pred[1])
print(y_test[1])

[ 32 108 819 756 331  28 661  52 425 727 124  50   6  26  78 198 209 592
 124 205   1  24 364   0   0   0   0   0   0   0   0   0]
['we got sitting fairly fast but ended up waiting 40 minutes just to place our order another 30 minutes before the food arrived']
[0.9899026  0.01009738]
0
