# Sentiment Analysis Experimentation 

This notebook is meant to be an experimentation on Sentiment Analysis with Deep Learning. Roughly speaking, to see if the task can be done using just the most frequently used words. 

Usually the order of the words is important for a better understanding of the meaning, and in this case of the sentiment of the processed sentence. Deep Learning models already proven to work efficiently in these cases.
##### But what if we drop out the less frequently used words from the input sentence? 
Since all the words must be encoded and the dictionary dimension directly affects the number of trainable parameters in the networks, having less words would allow to use ligher models. 

### Methodology

As dataset for training and testing it has been used a ready-to-use dataset provided by keras. More specifically the dataset contains IMDB reviews and a binary flag that says whether the review is good or bad.

In this notebook I tried different networks:
- GRU based network
- Conv1D based network with squeeze and expansion layer
- LSTM based network
- Conv1D + GRU based network

The best results are written down as comments in the cell.



In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras


import tensorflow.keras.backend as K


from tensorflow.keras.models import load_model

In [2]:

if not tf.config.list_physical_devices('XLA_GPU'):
    print("No GPU was detected.")

gpus = tf.config.experimental.list_physical_devices('XLA_GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'XLA_GPU')

gpus

[PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]

In [3]:
#upload the dataset
(X_train,y_train),(X_test,y_test)= keras.datasets.imdb.load_data()


#y=0 bad, y=1 good

In [4]:
dictionary_word_index= keras.datasets.imdb.get_word_index()
#dictionary_word_index.items()

In [5]:
#create index to word dict

def index_to_word(d):
    d= {(index +3) : word for word,  index in d.items()}
    d[0]='<pad>'
    d[1]='<sos>'
    d[2]='<unk>'
    
    return d

In [6]:
def add_3(d):
    d= {word : (index +3) for word,  index in d.items()}
    d[0]='<pad>'
    d[1]='<sos>'
    d[2]='<unk>'
    
    return d

In [7]:
#keep in the dictionary only the first "threshold" most frequent word, removing all the others from the dict

def remove_less_freq(d, threshold):
    '''
    params: 
    d: dictionary
    threshold: int 
    '''
    d={index: word for index,word in d.items() if index <= (threshold+3)}

    return d

In [8]:
#create index to word dict
dictionary_index_word=index_to_word(dictionary_word_index)
dictionary_word_index=add_3(dictionary_word_index)
#dictionary_index_word.items()

In [9]:
#print first review
#[dictionary_word_index[index] for index in X_test[0]]

In [10]:
t=30000 #threshold for frequent word
x_t=300 #threshold for number of word for every review

dictionary_index_word= remove_less_freq(dictionary_index_word,t)
#dictionary_index_word.items()

In [11]:
#removing less frequent items also from X_train and zero padding it so they all have the same dimension

for i in range(len(X_train)):
    X_train[i]=[j for j in X_train[i] if j<(t+3)]
    if len(X_train[i])>x_t:
        X_train[i]=X_train[i][:x_t]
    else:
        X_train[i] += [0]*(x_t-len( X_train[i]))



In [12]:
#removing items also from X_test and zero padding it
for i in range(len(X_test)):
    X_test[i]=[j for j in X_test[i] if j<(t+3)]
    if len(X_test[i])>x_t:
        X_test[i]=X_test[i][:x_t]
    else:
        X_test[i] += [0]*(x_t-len( X_test[i]))


In [13]:
#transform in numpy arrays

X_train=np.array([np.array(xi) for xi in X_train]) 
X_test=np.array([np.array(xi) for xi in X_test]) 

In [14]:
#define functions for different exp


from tensorflow.keras.layers import GlobalAveragePooling1D,Reshape,Multiply

#squeeze and excite
def sq_n_ex(input_, r=4):

    '''
    param: input , ratio 
    '''
    input_sNe_shape = (1,input_.shape[2]) 
    sNe_layer = GlobalAveragePooling1D()(input_)
    sNe_layer = Reshape(input_sNe_shape)(sNe_layer)
    
    #ratio is used only in the first fully connected layer
    sNe_layer = Dense(input_.shape[2] // r, activation='relu', kernel_initializer='he_normal', use_bias=False)(sNe_layer)  
    #hard sigmoid in the second FC
    sNe_layer = Dense(input_.shape[2], activation='relu', kernel_initializer='he_normal', use_bias=False)(sNe_layer)
    
    return Multiply()([input_, sNe_layer])

In [40]:
#clear keras session
K.clear_session()


#model
from tensorflow.keras.layers import Embedding, GRU, Dense, Conv1D, Concatenate,Input,Flatten,LSTM
import tensorflow.keras.regularizers as regularizers

'''
# this model gets accuracy 0.85 with 30.000/300 as params rmsprop epoch 5
model = keras.Sequential([
    Embedding(t+3,128,mask_zero=True,input_shape=[None]),
    GRU(128,return_sequences=True,dropout=0.2,recurrent_dropout=0.2),
    GRU(128,dropout=0.2,recurrent_dropout=0.2),
    Dense(128, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
                bias_regularizer=regularizers.l2(1e-4),
                activity_regularizer=regularizers.l2(1e-5)),
    Dense(1,kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
                bias_regularizer=regularizers.l2(1e-4),
                activity_regularizer=regularizers.l2(1e-5), activation="sigmoid")
])
'''




'''
The model overfits, params 40.000/200  / rmsprop
accuracy with conc=0.83  epoch 5
accuracy with one conv1d=0.84 epoch 5

'''
'''
input_ =Input(shape=(x_t))
em=Embedding(input_dim=t+3,output_dim=128,mask_zero=True)(input_)
c1=Conv1D(64,1)(em)
#c2=Conv1D(64,3,padding='same')(em)
#c3=Conv1D(64,2,padding="same")(em)
se=sq_n_ex(c1)
#conc= Concatenate()([se,c2])
f=Flatten()(se)
d1=Dense(128, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
            bias_regularizer=regularizers.l2(1e-4),
            activity_regularizer=regularizers.l2(1e-5),activation="tanh")(f)
output_=Dense(1,kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
            bias_regularizer=regularizers.l2(1e-4),
            activity_regularizer=regularizers.l2(1e-5), activation="sigmoid")(d1)


model=keras.Model(inputs=[input_],outputs=[output_])
'''


'''

#3rd model, simple LTMS
#params 10.000/200 accuracy 0.83 epoch 5
input_ =Input(shape=(x_t))
em=Embedding(input_dim=t+3,output_dim=128,mask_zero=True)(input_)

l=LSTM(128,return_sequences=True)(em)
l=LSTM(128)(l)

output_=Dense(1,kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
            bias_regularizer=regularizers.l2(1e-4),
            activity_regularizer=regularizers.l2(1e-5), activation="sigmoid")(l)


model=keras.Model(inputs=[input_],outputs=[output_])

'''
#4th model Conv1D+GRU
#params 10.000/200 accuracy 0.81 epoch 8 patience 2
#params 20.000/300 accuracy 0.839 epoch 5 patience 3
model=keras.Sequential([
    Embedding(input_dim=t+3,output_dim=128,mask_zero=True),
    Conv1D(128,4,strides=2,padding='valid'),
    GRU(128,return_sequences=True),
    GRU(128,return_sequences=False),
    Dense(1, activation="sigmoid")
])

In [41]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         3840384   
_________________________________________________________________
conv1d (Conv1D)              (None, None, 128)         65664     
_________________________________________________________________
gru (GRU)                    (None, None, 128)         99072     
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               99072     
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 4,104,321
Trainable params: 4,104,321
Non-trainable params: 0
_________________________________________________________________


In [42]:
opt_a=tf.keras.optimizers.Adam(
    learning_rate=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=True)


In [43]:
model.compile(loss="binary_crossentropy", optimizer=opt_a, metrics=["accuracy"])


In [44]:
es=tf.keras.callbacks.EarlyStopping(patience=2)


history = model.fit(X_train,y_train, epochs=5, batch_size=128,validation_split=0.2,callbacks=[es])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [45]:
model.evaluate(X_test,y_test)



[0.5570315718650818, 0.8375200033187866]

In [None]:
#saving model
model.save('sentiment_analysis.h5')

In [None]:
#loading model

model = load_model('sentiment_analysis.h5')

In [53]:

X_new=[]
sentence1="You've gotta love that sales pitch, and I find it hard to believe the director had trouble getting funding for Frantic But I really struggled with this film, and it comes down to pacing. Harrison Ford's rather stiff here, and the story's somewhat re-energized very once in a while with a new breadcrumb on the trail of his missing wife. This is essentially what concerns the movie's first half. Personally, I found a lot to like about Emmanuelle Seigner, and she really seemed to elevate her scenes with the star but she also comes in rather late in the game for such a key component.".split()
sentence2=" His Bayisms were kept to a minimum, and the movie ran on the Smith/Lawrence chemistry, macho gun battles and slick polish. The Mark Mancina score added loads to the film, and it was pretty funny tosses out all of that. Everything is ramped to 11, including the camerawork, hateful dialogue and coked-fueled editing. This is a testament to a director whose id is fully in charge, and this saps all of the humor, fun and entertainment value It is exhausting.".split()
for word in sentence1:
    if word in dictionary_word_index:
        if(dictionary_word_index[word] in dictionary_index_word):
            X_new.append(dictionary_word_index[word])
if len(X_new)>x_t:
    X_new=X_new[:x_t]
else:
    X_new += [0]*(x_t-len( X_new))
X_new=np.array(X_new)


In [54]:
pred=model.predict(X_new[None,...])
print(pred)
if pred>0.5:
    print('That\'s a good review!')
else:
    print('better don\'t watch that movie!')


[[0.4854832]]
better don't watch that movie!


## Conclusions

In conclusion, besides overfitting (which can be fixed), the models show very low accuracy when dealing with new sentences with many words not included in the dictionary used in the training. Next step: add stopwords in order to get more meaningful words