**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Libraries and Dependencies**

In [3]:
#@ Downloading the Libraries and Dependencies:
from __future__ import absolute_import, division
from __future__ import print_function, unicode_literals
from IPython.display import display

# try:
#   !pip uninstall tb-nightly tensorboardX tenosrboard
#   !pip install tf-nightly
# except Exception:
#   pass 

import tensorflow as tf
import os
import datetime
import tensorflow_datasets as tfds

from keras.models import Sequential                                              # Base Keras Neural Network Model.
from keras.layers import Dense, Bidirectional, LSTM, Dropout, Embedding          # Keras Functional API.

%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [4]:
#@ Downloading the Libraries and Dependencies:
import pkg_resources
for entry_point in pkg_resources.iter_entry_points("tensorboard_plugins"):
  print(entry_point.dist)

tensorboard 2.3.0
tensorboard-plugin-wit 1.7.0


In [6]:
#@ Uninstalling the Tensorboard Colab:
# !rm -r /usr/local/lib/python3.6/dist-packages/tensorboardcolab-0.0.22.dist-info

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I will use **Amazon Reviews Mobile Electronics Dataset** for this Project and I will import the Dataset using Tensorflow. The Dataset is already present in Tensorflow Dataset Library Corpus.

In [8]:
#@ Getting the Dataset:
dataset, info = tfds.load("amazon_us_reviews/Mobile_Electronics_v1_00", with_info=True)
train_dataset = dataset["train"]

#@ Inspecting the Information of the Dataset:
display(info)

#@ Inspecting the Dataset:
display(train_dataset)
display(len(list(train_dataset)))

tfds.core.DatasetInfo(
    name='amazon_us_reviews',
    version=0.1.0,
    description='Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.

Over 130+ million cus

<DatasetV1Adapter shapes: {data: {customer_id: (), helpful_votes: (), marketplace: (), product_category: (), product_id: (), product_parent: (), product_title: (), review_body: (), review_date: (), review_headline: (), review_id: (), star_rating: (), total_votes: (), verified_purchase: (), vine: ()}}, types: {data: {customer_id: tf.string, helpful_votes: tf.int32, marketplace: tf.string, product_category: tf.string, product_id: tf.string, product_parent: tf.string, product_title: tf.string, review_body: tf.string, review_date: tf.string, review_headline: tf.string, review_id: tf.string, star_rating: tf.int32, total_votes: tf.int32, verified_purchase: tf.int64, vine: tf.int64}}>

104975

**Processing the Dataset**

In [9]:
#@ Parameters for Training the Dataset:
BUFFER_SIZE = 30000                       # Maximum number of elements that will be buffered when Prefetching.
BATCH_SIZE = 128                          # Number of samples fed into the Network.

#@ Processing the Dataset:
train_dataset = train_dataset.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)  # Shuffling the Dataset.

#@ Inspecting the Dataset using Tensorflow:
for reviews in train_dataset.take(10):
  review_text = reviews["data"]                                                     # "data" is the key of Dataset.
  print(review_text.get("review_body").numpy())                                     # Converting the Tensors into Numpy arrays.
  print(review_text.get("star_rating"))                                             # Inspecting the Ratings.
  print(tf.where(review_text.get("star_rating")>3,1,0).numpy())                     # Rating greater than 3 is 1 and else 0.

b"Purchased this unit for a weekend camping trip. Charged the unit by USB initially, since I didn't have 13 hours to solar charge ahead of time. Although quite bulky, unit was able to power my T-Mobile Amaze once (full charge) and a G1 3/4 charge. Probably would have been better if I had turned off my phone during charging instead of leaving it on; probably drained more charge that way. My only concern is durability, but only time will tell..."
tf.Tensor(4, shape=(), dtype=int32)
1
b"This is the third Boxwave case I have purchased for the Nook (and I have purchased Boxwave cases for other devices) and while I LOVE the ones I have received so far this newest one is UNUSABLE because of the overpowering smell of the adhesive used for the backing. I have had it for 5 days now and after 2 days OUTSIDE and 3 days sitting near a window at the expense of my heating bill the smell is still unbearable when on the device. I don't know if they rushed the production to keep up with Holiday orders o

**Preprocessing the Model**
* Now, I will Tokenize the Data and convert it into Vocabulary. Vocabulary is the different combinations of words present inside the Model.

In [10]:
#@ Preprocessing the Model: Tokenization:
tokenizer = tfds.features.text.Tokenizer()                                         # Instantiating the Tokenizer.

vocabulary_set = set()                                                             # Removing the duplicates present in the Dataset.
for _, reviews in train_dataset.enumerate():
  review_text = reviews["data"]                                                    # "data" is the key of the Dataset.
  reviews_tokens = tokenizer.tokenize(review_text.get("review_body").numpy())      # Tokenizing the body of the Dataset.
  vocabulary_set.update(reviews_tokens)

#@ Inspecting the Vocabulary:
vocab_size = len(vocabulary_set)                                                   # Inspecting the length or size of Vocabulary.
vocab_size 

73738

* Now, I will encode the Vocabulary set into numerical values. I will implement the Text Encoder which basically takes all the Tokenized words and assign it to particular numerical values. 

In [11]:
#@ Preprocessing the Model: Encoding:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)                       # Encoding the Vocabulary set. 
   
for reviews in train_dataset.take(10):
  review_text = reviews["data"]                                                     # "data" is the key of the Dataset.
  print(review_text.get("review_body").numpy())
  encoded_example = encoder.encode(review_text.get("review_body").numpy())          # Encoding the Dataset.
  print(encoded_example)                                                            # Inspecting the Encoded Dataset.

b"Purchased this unit for a weekend camping trip. Charged the unit by USB initially, since I didn't have 13 hours to solar charge ahead of time. Although quite bulky, unit was able to power my T-Mobile Amaze once (full charge) and a G1 3/4 charge. Probably would have been better if I had turned off my phone during charging instead of leaving it on; probably drained more charge that way. My only concern is durability, but only time will tell..."
[29929, 60000, 14092, 30618, 17285, 55238, 30077, 59482, 20922, 9736, 14092, 25431, 63661, 21334, 26227, 71324, 43425, 60158, 31155, 47528, 8305, 63524, 52404, 55874, 36375, 67363, 44517, 51708, 50827, 70629, 14092, 39983, 39617, 63524, 36233, 20624, 25332, 2999, 27730, 1108, 42085, 55874, 9830, 17285, 50305, 28335, 15280, 55874, 7063, 70763, 31155, 56174, 24166, 72255, 71324, 18059, 73556, 15911, 20624, 39124, 15800, 11819, 9136, 67363, 15503, 4208, 12304, 57212, 1038, 21213, 55874, 23269, 7091, 44360, 9737, 9648, 2842, 70617, 21165, 9737, 4451

In [12]:
#@ Inspecting the Encoding:
for index in encoded_example:
  print("{} ----> {}".format(index, encoder.decode([index])))                  # Inspecting in one particular Example.

71324 ----> I
66944 ----> bought
60000 ----> this
3446 ----> camera
57045 ----> after
57525 ----> checking
72810 ----> some
69943 ----> picture
12304 ----> on
60000 ----> this
59507 ----> product
19371 ----> br
56324 ----> But
9736 ----> the
69943 ----> picture
9830 ----> and
63715 ----> current
3446 ----> camera
2842 ----> is
10436 ----> differnce
30618 ----> for
10899 ----> example
38235 ----> AV
47966 ----> OUT
2842 ----> is
55909 ----> unavailable
19371 ----> br
71324 ----> I
3857 ----> hope
63524 ----> to
58302 ----> change
9736 ----> the
69943 ----> picture
68405 ----> soon


**Preprocessing the Model**
* Now, I will Encode all the Dataset.

In [13]:
#@ Preprocessing the Model:
def encode(text_tensor, label_tensor):
  encoded_text = encoder.encode(text_tensor.numpy())        
  label = tf.where(label_tensor > 3,1,0)
  return encoded_text, label

def encode_map_fn(tensor):
  text = tensor["data"].get("review_body")                        # Accessing the review body from Tensor.
  label = tensor["data"].get("star_rating")                        # Accessing the ratings from Tensor.
  encoded_text, label = tf.py_function(encode,
                                       inp=[text, label],
                                       Tout=(tf.int64, tf.int32))
  encoded_text.set_shape([None])                                   # Automatically takes the shape.
  label.set_shape([])                                              # Automatically takes the shape.
  return encoded_text, label

#@ Encoding the Dataset:
encoded_data = train_dataset.map(encode_map_fn)                    # Encoding all the Dataset.

#@ Inspecting the Encoded Dataset:
for f0, f1 in encoded_data.take(3):
  print(f0)                                                        # Encoded text review.
  print(f1)                                                        # Encoded label.

tf.Tensor(
[29929 60000 14092 30618 17285 55238 30077 59482 20922  9736 14092 25431
 63661 21334 26227 71324 43425 60158 31155 47528  8305 63524 52404 55874
 36375 67363 44517 51708 50827 70629 14092 39983 39617 63524 36233 20624
 25332  2999 27730  1108 42085 55874  9830 17285 50305 28335 15280 55874
  7063 70763 31155 56174 24166 72255 71324 18059 73556 15911 20624 39124
 15800 11819  9136 67363 15503  4208 12304 57212  1038 21213 55874 23269
  7091 44360  9737  9648  2842 70617 21165  9737 44517 11589 37241], shape=(83,), dtype=int64)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(
[ 2082  2842  9736 45650 42477 13934 71324 31155 21155 30618  9736 36971
  9830 71324 31155 21155 42477 29563 30618 24971 32441  9830 39508 71324
 39788  9736 35850 71324 31155 29272 55691 46417 60000 62858 25655  2842
 37016 64324 67363  9736 65991  3102 67363  9736 67311 10117 30618  9736
 13770 71324 31155 18059  4208 30618 61200 73453 56438  9830 57045 55079
 73453 29227  9830 28335 73453 62659 20382 1

**Splitting the Dataset**
* Now, The Text Dataset is Tokenized and Encoded into particular Integers and it's readily available for Training the Model. So, I will split the Dataset into Training set and Testing or Validation set. I will split 10000 Encoded Dataset for Testing and the remaining Encoded Dataset for Training the Model. I will also perform padding to the Dataset to make the Data of constant length. 

In [14]:
#@ Splitting the Dataset:
TAKE_SIZE = 10000

#@ Training Dataset:
train_data = encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)           # Skipping the 10000 Encoded Dataset. 
train_data = train_data.padded_batch(BATCH_SIZE)                         # Padding the Dataset for constant length.

#@ Testing or Validation Dataset:
test_data = encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)                           # Padding the Dataset for constant length.

#@ Inspecting the Test Data:
sample_text, sample_label = next(iter(test_data))
sample_text[0], sample_label[0]                                          # Inspecting the Test Data.

(<tf.Tensor: shape=(676,), dtype=int64, numpy=
 array([29929, 60000, 14092, 30618, 17285, 55238, 30077, 59482, 20922,
         9736, 14092, 25431, 63661, 21334, 26227, 71324, 43425, 60158,
        31155, 47528,  8305, 63524, 52404, 55874, 36375, 67363, 44517,
        51708, 50827, 70629, 14092, 39983, 39617, 63524, 36233, 20624,
        25332,  2999, 27730,  1108, 42085, 55874,  9830, 17285, 50305,
        28335, 15280, 55874,  7063, 70763, 31155, 56174, 24166, 72255,
        71324, 18059, 73556, 15911, 20624, 39124, 15800, 11819,  9136,
        67363, 15503,  4208, 12304, 57212,  1038, 21213, 55874, 23269,
         7091, 44360,  9737,  9648,  2842, 70617, 21165,  9737, 44517,
        11589, 37241,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,  

In [15]:
#@ Inspecting the Distribution of Positive and Negative Reviews:
for f0, f1 in test_data.take(10):
  print(tf.unique_with_counts(f1)[2].numpy())

[96 32]
[44 84]
[44 84]
[86 42]
[42 86]
[96 32]
[86 42]
[47 81]
[84 44]
[40 88]


* The Distribution of positive and negative sentiment reviews are not equal.

**Long Short Term Memory**
* Long Short Term Memory or LSTM is an Artificial Recurrent Neural Network or RNN architecture used in the field of Deep Learning. Unlike standard Feedforward Neural Networks, LSTM has Feedback connections. It can not only process single data points, but also entire sequences of data such as Speech or Video. Now, The Dataset is ready to build the Neural Network.

In [16]:
#@ Long Short Term Memory or LSTM:
vocab_size = vocab_size + 1                                   # Increasing the vocab size by 1.

model = Sequential()                                          # Standard Model Definition for Keras.
model.add(Embedding(                                          # Adding the Embedding Layer.
    vocab_size, 128     
))
model.add(Bidirectional(LSTM(                                 # Adding the Bidirectional LSTM Layer.
    128, return_sequences=True
)))
model.add(Bidirectional(LSTM(                                 # Adding another LSTM Layer.
    64, return_sequences=True
)))
model.add(Bidirectional(LSTM(                                 # Adding the third LSTM Layer.
    64, return_sequences=False
)))
model.add(Dense(                                              # Adding the Dense Layer
    64, activation="relu"
))
model.add(Dense(                                              # Adding another Dense Layer.
    64, activation="relu"
))
model.add(Dense(
    1, activation="sigmoid"                                   # Adding the Output Layer.
))

#@ Inspecting the Summary of the Model:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         9438592   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 256)         263168    
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 128)         164352    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 6

In [17]:
# #@ Processing the Model:
# !rm -r /tmp/logs/                                                                     # Cleaning the tmp logs.

logdir = os.path.join("/tmp/logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))   # Creating the log directory.
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)         # Captures the progress of the Model.
checkpointer = tf.keras.callbacks.ModelCheckpoint(
    filepath="/tmp/sentiment.hdf5", verbose=1, save_best_only=True                      # Saves only the best Model.
)

# #@ Compiling the LSTM Neural Network:
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=["accuracy"]
)

# #@ Training the LSTM Neural Network:
history = model.fit(
    train_data, epochs=2,
    validation_data=test_data,
    callbacks=[tensorboard_callback, checkpointer]
)