<a href="https://www.kaggle.com/code/briankhor/amazon-review-sentiment-analysis?scriptVersionId=110733049" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Amazon Review Sentiment Analysis

The Amazon review dataset consists of texts of reviews by customer and has been classified as positive or negative. The dataset would be largely balanced (as we would expect in real life situations where there are many ratings, both positive and negative, on the Amazon webpage). This makes this dataset the perfect case studying for applying natural language processing (NLP) framework.

We will start off by loading the necessary Python libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.keras import models, layers, optimizers
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
import re

%matplotlib inline

import os
print(os.listdir("../input"))

['amazonreviews']


## Reading the text (converting raw binary strings to strings that can be parsed)

We need to read the bz2 files in the dataset. The text is in raw binary compressed format. We need to process our data. The data is given in the fornat of label (good or bad) followed by text so the first word needs to be converted to a number (rating 1-2 as bad and 4-5 as good) and the rest as texts.

In [2]:
def separate_label_and_text(file):
    labels = []
    texts = []
    for line in bz2.BZ2File(file):
        x = line.decode("utf-8")
        labels.append(int(x[9]) - 1)
        texts.append(x[10:].strip())
    return np.array(labels), texts

train_labels, train_texts = separate_label_and_text('../input/amazonreviews/train.ft.txt.bz2')
test_labels, test_texts = separate_label_and_text('../input/amazonreviews/test.ft.txt.bz2')

## Text Processing

Next, we need to process the texts input given by users. We will apply lowercase to all letters, and replace any non words characters with spaces. We will also remove any characters with accents in this rather simplified text processing.

In [3]:
NON_ALPHANUM = re.compile(r'[\W]')
NON_ASCII = re.compile(r'[^a-z0-1\s]')

def text_processing(texts):
    processed_texts = []
    for text in texts:
        lower = text.lower()
        no_punctuation = NON_ALPHANUM.sub(r' ', lower)
        no_non_ascii = NON_ASCII.sub(r'', no_punctuation)
        processed_texts.append(no_non_ascii)
    return processed_texts

train_texts = text_processing(train_texts)
test_texts = text_processing(test_texts)

## Splitting Test/Dev Set

We have the training and test sets, but we can further split the train dataset into train (80%) and development set (20%).

In [4]:
train_texts, dev_texts, train_labels, dev_labels = train_test_split(train_texts, train_labels, random_state = 328413, test_size = 0.2)

We will now run tokenizers to choose the 12000 most used words as features below.

In [5]:
MAX_FEATURES = 12000
tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(train_texts)
train_texts = tokenizer.texts_to_sequences(train_texts)
dev_texts = tokenizer.texts_to_sequences(dev_texts)
test_texts = tokenizer.texts_to_sequences(test_texts)

## Padding sequences

To use batches effectively, we will add padding to shorter sentences so that all sentences have lengths matching that of the longest sentence. 

In [6]:
MAX_LENGTH = max(len(train_ex) for train_ex in train_texts)
train_texts = pad_sequences(train_texts, maxlen = MAX_LENGTH)
dev_texts = pad_sequences(dev_texts, maxlen = MAX_LENGTH)
test_texts = pad_sequences(test_texts, maxlen = MAX_LENGTH)

## Recurrent Neural Network (RNN) Model

We are now ready to apply the RNN model to our dataset. I will use a simple model with embedding, 2 GRU layers, and 2 dense layers followed by the output layer.

In [7]:
def make_rnn_model():
    sequences = layers.Input(shape=(MAX_LENGTH, ))
    embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)
    x = layers.CuDNNGRU(128, return_sequences=True)(embedded)
    x = layers.CuDNNGRU(128)(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dense(100, activation='relu')(x)
    predictions = layers.Dense(1, activation='sigmoid')(x)
    model = models.Model(inputs = sequences, outputs = predictions)
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['binary_accuracy'])
    return model

rnn_model = make_rnn_model()

2022-11-11 20:16:32.471004: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-11 20:16:32.616430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-11 20:16:32.617368: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-11 20:16:32.619400: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

We apply the RNN model to our train and development dataset

In [8]:
rnn_model.fit(train_texts, train_labels, batch_size=256, epochs=1, validation_data=(dev_texts, dev_labels), )

2022-11-11 20:16:46.746342: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2937600000 exceeds 10% of free system memory.
2022-11-11 20:16:49.407404: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-11-11 20:16:52.041407: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005




2022-11-11 20:26:11.967361: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 734400000 exceeds 10% of free system memory.




<tensorflow.python.keras.callbacks.History at 0x7fe6a7164e90>

We will finally test this model with our test set

In [9]:
preds = rnn_model.predict(test_texts)
print('Accuracy Score: {}'.format(accuracy_score(test_labels, 1*(preds>0.5) ) ) )
print('F1 Score: {}'.format(f1_score(test_labels, 1*(preds>0.5) ) ) )
print('ROC AUC Score: {}'.format(roc_auc_score(test_labels, preds) ) )

2022-11-11 20:39:14.636722: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 408000000 exceeds 10% of free system memory.


Accuracy Score: 0.94241
F1 Score: 0.9407954930967483
ROC AUC Score: 0.9871339611000001


## References

1. [Sentiment Analysis with Amazon Reviews](http://www.kaggle.com/code/muonneutrino/sentiment-analysis-with-amazon-reviews/notebook)

2. [CuDNNLSTM Implementation (93.7% Accuracy)](http://www.kaggle.com/code/anshulrai/cudnnlstm-implementation-93-7-accuracy/notebook)