# Sentiment Analysis with a `Bidirectional LSTM`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

_Sentiment analysis_ is the use of [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), [text analysis](https://en.wikipedia.org/wiki/Text_analytics), [computational linguistics](https://en.wikipedia.org/wiki/Computational_linguistics), and [biometrics](https://en.wikipedia.org/wiki/Biometrics) to systematically identify, extract, quantify, and study affective states and subjective information. 

Sentiment analysis is widely applied to [voice of the customer](https://en.wikipedia.org/wiki/Voice_of_the_customer) materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

![sentiment-analisys](https://miro.medium.com/proxy/1*_JW1JaMpK_fVGld8pd1_JQ.gif)

In this notebook, we are creating a language model for sentiment analysis using `Keras API` and `TensorFlow`.

We will be using a dataset that was put together by combining several datasets for sentiment classification available on [Kaggle](https://www.kaggle.com/):

- The `IMDB 50K` [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv): _0K movie reviews for natural language processing or Text analytics._
- The `Twitter US Airline Sentiment` [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment): _originated from the  [Crowdflower's Data for Everyone library](http://www.crowdflower.com/data-for-everyone)._
- Our `google_play_apps_review` _dataset: built using the `google_play_scraper` in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/64d0693c28786ce42149411bec8b3b42520fc4df/ML%20Explainability/NLP%20Interpreter%20(en)/scrape(en).ipynb)._
- The `EcoPreprocessed` [dataset](https://www.kaggle.com/datasets/pradeeshprabhakar/preprocessed-dataset-sentiment-analysis): _scrapped amazon product reviews_.

The final result is the `sentiment_analysis_dataset.csv` available in for download in [this](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv) link. Also available in [portuguese](https://drive.google.com/uc?export=download&id=1YCIzGqcdlHSy-GvghRp0U5USUhuOVEE3)!

Both datasets already come preprocessed, and the `cleaning` function we used is this:

```python

import re
from unidecode import unidecode

def custom_standardization(input_data):
    clean_text = input_data.lower().replace("<br />", " ")
    clean_text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", ' ', clean_text)
    clean_text = re.sub(' +', ' ', clean_text)
    return unidecode(clean_text)

```

In [5]:
# PT-BR https://drive.google.com/uc?export=download&id=1YCIzGqcdlHSy-GvghRp0U5USUhuOVEE3
# EN https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv

import pandas as pd
import urllib.request

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset_en.csv'
)

df = pd.read_csv('sentiment_analysis_dataset_en.csv')
display(df)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tech...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,petter mattei's love in the time of money is a...,1
...,...,...
85084,yaaa cool use last weeks give good response,1
85085,years daughter love alexa enjoy alexa,1
85086,yes popular but doesnt use except listen songs...,1
85087,yo alexa love,1


The following cells will train a `Bidirectional long-short term memory (bi-lstm)` for binary sentiment classification (Negative versus Positive).


The `Embedding`, `Bidirectional`, and `LSTM` layers are commonly used in `RNNs` for processing sequential data such as text:

- The `Embedding` layer is used to convert input text data into numerical vectors. It maps each word in the text to a fixed-size vector of real numbers, which can be learned during training or pre-trained on a large corpus of text. The purpose of the embedding layer is to capture the semantic meaning of words and represent them in a dense vector space, which can be used as input to the subsequent layers of the network.
- The `Bidirectional` layer is used to improve the performance of `RNNs` by processing the input sequence in both directions, forward and backward. It consists of two separate `RNNs` that process the input sequence in opposite directions and concatenate the outputs of each time step. This allows the network to capture information from both past and future contexts, which can be particularly useful for tasks such as text classification and named entity recognition.
- The `LSTM` layer is a type of `RNN` that is designed to overcome the limitations of traditional `RNNs`, such as the vanishing gradient problem. It has a more complex architecture that allows it to selectively forget or remember information from previous time steps, which makes it particularly effective for tasks that involve long-term dependencies, such as language modeling and machine translation The `LSTM` layer consists of memory cells that store information over time, input gates that regulate the flow of new information into the memory cells, and output gates that control the output of the layer. For more information, read the original article proposal for this arquitecture, "_[Long Short-Term Memory](https://dl.acm.org/doi/10.1162/neco.1997.9.8.1735)_."


In [3]:
import io
import json
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json

vocab_size = 5000
embed_size = 128
sequence_length = 250

tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(df.review)
tokenizer_json = tokenizer.to_json()

with io.open('models/tokenizer_senti_model_pt.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()

x_train, x_test, y_train, y_test = train_test_split(
    df.review, df.sentiment, test_size=0.2, random_state=42)

x_train = pad_sequences(
    tokenizer.texts_to_sequences(x_train), 
    maxlen=sequence_length, 
    truncating='post')
x_test = pad_sequences(
    tokenizer.texts_to_sequences(x_test), 
    maxlen=sequence_length, 
    truncating='post')
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)


inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)

x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)

outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss=tf.losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [tf.keras.callbacks.ModelCheckpoint("models/senti_model_sigmoid_pt.keras",
                                                save_best_only=True),
            tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=3,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]
                                            
                                            
                                                
model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')


Version:  2.10.1
Eager mode:  True
GPU is available
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         640000    
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total par

If you prefer, you can create the same model with a `softmax` output function.

In [7]:
import io
import json
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json

vocab_size = 5000
embed_size = 128
sequence_length = 250

tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(df.review)
tokenizer_json = tokenizer.to_json()

with io.open('models/tokenizer_senti_model_pt.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()

x_train, x_test, y_train, y_test = train_test_split(
    df.review, df.sentiment, test_size=0.2, random_state=42)

x_train = pad_sequences(
    tokenizer.texts_to_sequences(x_train), 
    maxlen=sequence_length, 
    truncating='post')
x_test = pad_sequences(
    tokenizer.texts_to_sequences(x_test), 
    maxlen=sequence_length, 
    truncating='post')
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)

inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)


x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)


outputs = tf.keras.layers.Dense(2, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [tf.keras.callbacks.ModelCheckpoint("models/senti_model_softmax_pt.keras",
                                                save_best_only=True),
            tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=3,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]
                                                                                            
model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

Version:  2.10.1
Eager mode:  True
GPU is available
Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 128)         640000    
                                                                 
 bidirectional_4 (Bidirectio  (None, None, 128)        98816     
 nal)                                                            
                                                                 
 bidirectional_5 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_2 (Dense)             (None, 2)                 258       
                                                                 
Total p

Congratulations, you have trained your own `Bi-LSTM`. 🙃

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
