# Sentiment Analysis

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**_Sentiment analysis_ (also known as _opinion mining_ or _emotion AI_) is the use of [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing "Natural language processing"), [text analysis](https://en.wikipedia.org/wiki/Text_analytics "Text analytics"), [computational linguistics](https://en.wikipedia.org/wiki/Computational_linguistics "Computational linguistics"), and [biometrics](https://en.wikipedia.org/wiki/Biometrics "Biometrics") to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to [voice of the customer](https://en.wikipedia.org/wiki/Voice_of_the_customer "Voice of the customer") materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from [marketing](https://en.wikipedia.org/wiki/Marketing "Marketing") to [customer service](https://en.wikipedia.org/wiki/Customer_relationship_management "Customer relationship management") to clinical medicine.**

![image](https://vitalflux.com/wp-content/uploads/2021/10/sentiment-analysis-machine-learning-techniques-640x395.png)

**In this notebook, we are creating a language model for sentiment analysis using `Keras` and `TensorFlow`.**

**We will be using a dataset that was put together by combining several datasets for sentiment classification available on [Kaggle](https://www.kaggle.com/):**

- The `IMDB 50K` [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv): _0K movie reviews for natural language processing or Text analytics._
- The `Twitter US Airline Sentiment` [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment):_originated from the  [Crowdflower's Data for Everyone library](http://www.crowdflower.com/data-for-everyone)._
- Our `google_play_apps_review` _dataset: built using the `google_play_scraper` in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/64d0693c28786ce42149411bec8b3b42520fc4df/ML%20Explainability/NLP%20Interpreter%20(en)/scrape(en).ipynb)._
- The `EcoPreprocessed` [dataset](https://www.kaggle.com/datasets/pradeeshprabhakar/preprocessed-dataset-sentiment-analysis): _scrapped amazon product reviews_

**The final result is the `sentiment_analysis_dataset.csv` available in for download in [this](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv) link.**



In [1]:
import io
import json
import numpy as np
import pandas as pd
import urllib.request
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json


**Load & Split the Dataset**


In [2]:

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset.csv'
)

df = pd.read_csv('sentiment_analysis_dataset.csv')
display(df)

x = list(df.review)
y = list(df.sentiment)

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42)

y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)


**Build & Save Tokenizer**


In [11]:
vocab_size = 3000
embed_size = 50
max_len = 256
tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(x_train)
training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(
    training_sequences, maxlen=max_len, truncating='post')


tokenizer_json = tokenizer.to_json()
with io.open('tokenizer_senti_model_en.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))


**Train the model**

- _To deal with this NLP prblem, we will train a **bidirectional LSTM** (**BI-LSTM**), which are (in general), fast to train, and good for analyzing sequential data (such as text)._


In [None]:
inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=max_len)(inputs)


x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)


outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss=tf.losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()
model.fit(training_padded,
          y_train,
          epochs=6,
          verbose=1)

test_sequences = tokenizer.texts_to_sequences(x_test)
test_padded = pad_sequences(test_sequences, maxlen=256, truncating='post')

test_loss_score, test_acc_score = model.evaluate(test_padded, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')
model.save("models\senti_model_sigmoid.h5")


**Model 2: `softmax` output**

In [5]:
inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=max_len)(inputs)


x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)


outputs = tf.keras.layers.Dense(2, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()
model.fit(training_padded,
          y_train,
          validation_split= 0.2,
          epochs=6,
          verbose=1)

test_sequences = tokenizer.texts_to_sequences(x_test)
test_padded = pad_sequences(test_sequences, maxlen=256, truncating='post')

test_loss_score, test_acc_score = model.evaluate(test_padded, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')
model.save("models\senti_model_softmax.h5")

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
