# Sentiment Analysis Experimentation 

This notebook is meant to be an experimentation on Sentiment Analysis with Deep Learning. Roughly speaking, to see if the task can be done using just the most frequently used words. 

Usually the order of the words is important for a better understanding of the meaning, and in this case of the sentiment of the processed sentence. Deep Learning models already proven to work efficiently in these cases.
##### But what if we drop out the less frequently used words from the input sentence? 
Since all the words must be encoded and the dictionary dimension directly affects the number of trainable parameters in the networks, having less words would allow to use ligher models. 

### Methodology

As dataset for training and testing it has been used a ready-to-use dataset provided by keras. More specifically the dataset contains IMDB reviews and a binary flag that says whether the review is good or bad.

In this notebook I tried different networks:
- GRU based network
- Conv1D based network with squeeze and expansion layer
- LSTM based network
- Conv1D + GRU based network

The best results are written down as comments in the cell.
#### Version2: 
- using nltk-stopwords and remove the stopwords among the "most frequent" ones

In [188]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

import nltk
from nltk.corpus import stopwords

import string


import tensorflow.keras.backend as K


from tensorflow.keras.models import load_model

In [189]:

if not tf.config.list_physical_devices('XLA_GPU'):
    print("No GPU was detected.")

gpus = tf.config.experimental.list_physical_devices('XLA_GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'XLA_GPU')

gpus

[PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]

In [190]:
#upload the dataset
(X_train,y_train),(X_test,y_test)= keras.datasets.imdb.load_data()
#y=0 bad, y=1 good

In [191]:
dictionary_word_index= keras.datasets.imdb.get_word_index()
#dictionary_word_index.items()

In [192]:
#create index to word dict

def index_to_word(d):
    d= {(index +3) : word for word,  index in d.items()}
    d[0]='<pad>'
    d[1]='<sos>'
    d[2]='<unk>'
    
    return d

In [193]:
def add_3(d):
    d= {word : (index +3) for word,  index in d.items()}
    d[0]='<pad>'
    d[1]='<sos>'
    d[2]='<unk>'
    
    return d

In [194]:
#keep in the dictionary only the first "threshold" most frequent word, removing all the others from the dict

def remove_less_freq(d, threshold):
    '''
    params: 
    d: dictionary
    threshold: int 
    '''
    d={index: word for index,word in d.items() if index <= (threshold+3)}

    return d

In [195]:
#create index to word dict
dictionary_index_word=index_to_word(dictionary_word_index)
dictionary_word_index=add_3(dictionary_word_index)
#dictionary_index_word.items()

## update

- expand contraction forms
- remove stop words
- reduce dictionary dimensionality by removing stop-words 

In [196]:
contractions_dict = {
"im": "i am",
"dont": "do not",
"doesnt": "does not",
"theres": "there is",
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "I would",
"i'd've": "I would have",
"i'll": "I will",
"i'll've": "I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that had",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": " what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [197]:
def encoder_dict_out(sent):
    enc=[]
    for word in sent.lower().split():
        enc.append(dictionary_word_index[word])
    return enc[:]

In [198]:
stop_words = set(stopwords.words('english'))  
stop_words.add('br')
stop_words.remove('no')
stop_words.remove('not')  #removing this in order to differentiate "good" from "not good"
stop_words.remove('nor')


for i in range(len(X_train)):
    new_X=[]
    for index in X_train[i]:
        if dictionary_index_word[index] in contractions_dict:
            new_X.extend(encoder_dict_out(contractions_dict[dictionary_index_word[index]]))
        else:
            new_X.append(index)
    
    X_train[i]=new_X[0:]


for i in range(len(X_train)):
    X_train[i] = [index for index in X_train[i] if not dictionary_index_word[index] in stop_words]
        
for i in range(len(y_train)):
    X_test[i] = [index for index in X_test[i] if not dictionary_index_word[index] in stop_words]


In [199]:
#print first review before removing words
#[dictionary_index_word[index] for index in X_train[10]]
for x in X_train[10]:
    print(dictionary_index_word[x])

<sos>
french
horror
cinema
seen
something
revival
last
couple
years
great
films
inside
switchblade
romance
bursting
scene
maléfique
preceded
revival
slightly
stands
head
shoulders
modern
horror
titles
surely
one
best
french
horror
films
ever
made
maléfique
obviously
shot
low
budget
made
far
ways
one
originality
film
turn
complimented
excellent
writing
acting
ensure
film
winner
plot
focuses
two
main
ideas
prison
black
magic
central
character
man
named
carrère
sent
prison
fraud
put
cell
three
others
quietly
insane
lassalle
body
building
transvestite
marcus
retarded
boyfriend
daisy
short
cell
together
stumble
upon
hiding
place
wall
contains
old
journal
translating
part
soon
realise
magical
powers
realise
may
able
use
break
prison
walls
black
magic
interesting
topic
actually
quite
surprised
not
films
based
much
scope
things
fair
say
maléfique
makes
best
assets
despite
restraints
film
never
actually
feels
restrained
manages
flow
well
throughout
director
eric
valette
provides
great
atmospher

In [200]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'br',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same

In [201]:
#print first review with stop words removed
#[dictionary_index_word[index] for index in X_test[0]]

In [202]:
def cleaning_dict(d_i2w,d_w2i):
    for word in stop_words:
        if word in dictionary_word_index:
            index_word=dictionary_word_index[word]
            if index_word in d_i2w:
                del d_i2w[index_word]
                del d_w2i[word]
    return d_i2w,d_w2i

In [203]:
dictionary_index_word,dictionary_word_index=cleaning_dict(dictionary_index_word,dictionary_word_index)

In [204]:
t=30000 #threshold for frequent word
x_t=300 #threshold for number of word for every review

dictionary_index_word= remove_less_freq(dictionary_index_word,t)
#dictionary_index_word.items()

In [205]:
#removing less frequent items also from X_train and zero padding it so they all have the same dimension

for i in range(len(X_train)):
    X_train[i]=[j for j in X_train[i] if j<(t+3)]
    if len(X_train[i])>x_t:
        X_train[i]=X_train[i][:x_t]
    else:
        X_train[i] += [0]*(x_t-len( X_train[i]))



In [206]:
#removing items also from X_test and zero padding it
for i in range(len(X_test)):
    X_test[i]=[j for j in X_test[i] if j<(t+3)]
    if len(X_test[i])>x_t:
        X_test[i]=X_test[i][:x_t]
    else:
        X_test[i] += [0]*(x_t-len( X_test[i]))


In [207]:
#transform in numpy arrays

X_train=np.array([np.array(xi) for xi in X_train]) 
X_test=np.array([np.array(xi) for xi in X_test]) 

In [208]:
#define functions for different exp


from tensorflow.keras.layers import GlobalAveragePooling1D,Reshape,Multiply

#squeeze and excite
def sq_n_ex(input_, r=4):

    '''
    param: input , ratio 
    '''
    input_sNe_shape = (1,input_.shape[2]) 
    sNe_layer = GlobalAveragePooling1D()(input_)
    sNe_layer = Reshape(input_sNe_shape)(sNe_layer)
    
    #ratio is used only in the first fully connected layer
    sNe_layer = Dense(input_.shape[2] // r, activation='relu', kernel_initializer='he_normal', use_bias=False)(sNe_layer)  
    #hard sigmoid in the second FC
    sNe_layer = Dense(input_.shape[2], activation='relu', kernel_initializer='he_normal', use_bias=False)(sNe_layer)
    
    return Multiply()([input_, sNe_layer])

In [220]:
#clear keras session
K.clear_session()

#model
from tensorflow.keras.layers import Embedding, GRU, Dense, Conv1D, Concatenate,Input,Flatten,LSTM
import tensorflow.keras.regularizers as regularizers

# this model gets accuracy 0.85 with 30.000/300 as params rmsprop epoch 5
latent_dim=64

model = keras.Sequential([
    Embedding(t+3,latent_dim,mask_zero=True,input_shape=[None]),
    GRU(latent_dim,return_sequences=True,dropout=0.4,recurrent_dropout=0.4),
    GRU(latent_dim,dropout=0.4,recurrent_dropout=0.4),
    Dense(latent_dim, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
                bias_regularizer=regularizers.l2(1e-4),
                activity_regularizer=regularizers.l2(1e-5)),
    Dense(1,kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
                bias_regularizer=regularizers.l2(1e-4),
                activity_regularizer=regularizers.l2(1e-5), activation="sigmoid")
])


'''

#3rd model, simple LTMS
#params 10.000/200 accuracy 0.83 epoch 5
input_ =Input(shape=(x_t))
em=Embedding(input_dim=t+3,output_dim=128,mask_zero=True)(input_)

l=LSTM(128,return_sequences=True)(em)
l=LSTM(128)(l)

output_=Dense(1,kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
            bias_regularizer=regularizers.l2(1e-4),
            activity_regularizer=regularizers.l2(1e-5), activation="sigmoid")(l)


model=keras.Model(inputs=[input_],outputs=[output_])

'''

'\n\n#3rd model, simple LTMS\n#params 10.000/200 accuracy 0.83 epoch 5\ninput_ =Input(shape=(x_t))\nem=Embedding(input_dim=t+3,output_dim=128,mask_zero=True)(input_)\n\nl=LSTM(128,return_sequences=True)(em)\nl=LSTM(128)(l)\n\noutput_=Dense(1,kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),\n            bias_regularizer=regularizers.l2(1e-4),\n            activity_regularizer=regularizers.l2(1e-5), activation="sigmoid")(l)\n\n\nmodel=keras.Model(inputs=[input_],outputs=[output_])\n\n'

In [210]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          1920192   
_________________________________________________________________
gru (GRU)                    (None, None, 64)          24960     
_________________________________________________________________
gru_1 (GRU)                  (None, 64)                24960     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 1,974,337
Trainable params: 1,974,337
Non-trainable params: 0
_________________________________________________________________


In [211]:
opt_a=tf.keras.optimizers.Adam(
    learning_rate=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=True)

In [212]:
model.compile(loss="binary_crossentropy", optimizer=opt_a, metrics=["accuracy"])


In [213]:
es=tf.keras.callbacks.EarlyStopping(patience=2)


history = model.fit(X_train,y_train, epochs=5, batch_size=128,validation_split=0.2,callbacks=[es])

Epoch 1/5
Epoch 2/5
Epoch 3/5


In [214]:
model.evaluate(X_test,y_test)



[0.40172043442726135, 0.8568800091743469]

In [215]:
#saving model
model.save('sentiment_analysis.h5')

In [216]:
#loading model

model = load_model('sentiment_analysis.h5')

In [217]:
def preprocess_input(sentence):
    table = str.maketrans(dict.fromkeys(string.punctuation))
    remove_digits = str.maketrans('', '', string.digits)
    sentence=sentence.lower()
    sentence = sentence.translate(remove_digits)
    sentence=sentence.translate(table).lower() #remove punt and set to lower
    return sentence

In [218]:
def predict_feedback(sentence):
    X_new=[]
    print(sentence, "\t")
    sentence=preprocess_input(sentence)
    
    for word in sentence.split():
        if word in dictionary_word_index:
            if(dictionary_word_index[word] in dictionary_index_word):
                X_new.append(dictionary_word_index[word])
    if len(X_new)>x_t:
        X_new=X_new[:x_t]
    else:
        X_new += [0]*(x_t-len( X_new))
    
    X_new=np.array(X_new)
    
    pred=model.predict(X_new[None,...])

    if pred>0.5:
        print('That\'s a good review! \n')
    else:
        print('better don\'t watch that movie! \n')


In [219]:
sentence1="This movie was awesome"
sentence2="i hate this movie, it is terrible"
sentence3='this is not good, hate it'  ##!!!! 

predict_feedback(sentence1)
predict_feedback(sentence2)
predict_feedback(sentence3)

This movie was awesome 	
That's a good review! 

i hate this movie, it is terrible 	
better don't watch that movie! 

this is not good, hate it 	
That's a good review! 



## Conclusions

Some consideration of this second version:
- stop-words work like noise, removing them better results are obtained
- Conv works terrebly and the task is better perform by GRU cells
- Purposely removing "not/no/nor" from the list of stopword, in order to have the relation "good"(positive) "not good"(negative), do not work