# Sentiment Analysis

We've already demonstrated how to train a character-level RNN to create original text. In this chapter, we create a **word-level** to analyze sentiment.

Sentiment analysis is a common NLP task. **Sentiment analysis** computationally identifies and categorizes opinions expressed in a text corpus to determine attitude or sentiment. Typically, sentiment analysis is used to determine a positive, negative or neutral opinion towards a particular topic or product. This technique is widely applied to reviews, surveys, and documents.

Resource:

https://www.tensorflow.org/tutorials/keras/text_classification

# IMDb Dataset

A popular dataset used to practice NLP is the IMDb reviews dataset. **IMDb** is a benchmark dataset for binary sentiment classification. The dataset contains 50,000 movie reviews labeled as either positive (1) or negative (0). Reviews are preprocessed with each encoded as a sequence of word indexes in the form of integers. Words within the reviews are indexed by their overall frequency within the dataset. The 50,000 reviews are split into 25,000 for training and 25,000 for testing. So, we can predict the number of positive and negative reviews using either classification or other deep learning algorithms.

IMDb is popular because it is simple to use, relative easy to process, and challenging enough for machine learning aficionados. We enjoy working with IMDb because it's just fun to work with movie data.

# Import **tensorflow** Library

Import tensorflow library and alias as **tf**:

In [1]:
import tensorflow as tf

# GPU Hardware Accelerator

To vastly speed up processing, we can use the GPU available from the Google Colab cloud service. Colab provides a free Tesla K80 GPU of about 12 GB. It’s very easy to enable the GPU in a Colab notebook:

1.	click **Runtime** in the top left menu
2.	click **Change runtime** type from the drop-down menu
3.	choose **GPU** from the Hardware accelerator drop-down menu
4.	click **SAVE**

Verify that GPU is active:

In [2]:
tf.__version__, tf.test.gpu_device_name()

('2.4.0', '/device:GPU:0')

If '/device:GPU:0' is displayed, the GPU is active. If '..' is displayed, the regular CPU is active.

# Download the IMDB dataset

Download, extract the dataset, and explore the directory structure:

In [3]:
import os

url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


# Explore IMDB

Create a function to display directories:

In [4]:
def see_contents(dir):
  print (os.listdir(dir))

Display the contents of 'ac1Imdb':

In [5]:
see_contents(dataset_dir)

['imdb.vocab', 'train', 'test', 'README', 'imdbEr.txt']


The name of the dataset we download is 'ac1Imdb'.

# Explore the Train and Test Directories

Explore the contents of the train directory:

In [6]:
train_dir = os.path.join(dataset_dir, 'train')
d = see_contents(train_dir)
d

['unsup', 'urls_neg.txt', 'pos', 'neg', 'urls_pos.txt', 'unsupBow.feat', 'urls_unsup.txt', 'labeledBow.feat']


We now know the contents of the 'ac1Imdb/train' directory. To explore the contents of the directories, begin by returning an iterator:

In [7]:
from pathlib import Path

for path in Path('.').iterdir():
  print(path)

.config
aclImdb
aclImdb_v1.tar.gz.tar.gz
sample_data


**iterdir()** returns an path iterator for the current directory.

Since the current directory is 'ac1Imdb', we know that the train path is 'ac1Imdb/train'. So we can accesss positive and negative reviews from 'aclImdb/train/pos' and 'aclImdb/train/neg' respectively. Let's see the positive reviews: 

In [8]:
for path in Path('aclImdb/train/pos').iterdir():
    print (path)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
aclImdb/train/pos/12482_8.txt
aclImdb/train/pos/11994_10.txt
aclImdb/train/pos/8521_10.txt
aclImdb/train/pos/5493_10.txt
aclImdb/train/pos/8347_10.txt
aclImdb/train/pos/9884_10.txt
aclImdb/train/pos/3033_8.txt
aclImdb/train/pos/9853_8.txt
aclImdb/train/pos/3603_7.txt
aclImdb/train/pos/11047_9.txt
aclImdb/train/pos/5894_7.txt
aclImdb/train/pos/11677_8.txt
aclImdb/train/pos/10895_7.txt
aclImdb/train/pos/5146_9.txt
aclImdb/train/pos/11416_10.txt
aclImdb/train/pos/8410_10.txt
aclImdb/train/pos/7737_10.txt
aclImdb/train/pos/4398_9.txt
aclImdb/train/pos/4560_10.txt
aclImdb/train/pos/11681_9.txt
aclImdb/train/pos/4437_7.txt
aclImdb/train/pos/11042_7.txt
aclImdb/train/pos/3969_8.txt
aclImdb/train/pos/7207_7.txt
aclImdb/train/pos/7887_7.txt
aclImdb/train/pos/4401_9.txt
aclImdb/train/pos/5534_7.txt
aclImdb/train/pos/7296_10.txt
aclImdb/train/pos/184_8.txt
aclImdb/train/pos/3137_8.txt
aclImdb/train/pos/4639_10.txt
aclImdb/train/pos/

We can also use the function we created to return postive reviews:

In [9]:
see_contents('aclImdb/train/pos')

['25_7.txt', '2348_10.txt', '4392_9.txt', '2822_7.txt', '5810_9.txt', '5583_8.txt', '1804_10.txt', '9340_8.txt', '3792_8.txt', '7915_8.txt', '10168_8.txt', '10212_8.txt', '7130_10.txt', '1476_10.txt', '3520_9.txt', '3775_7.txt', '532_9.txt', '9141_8.txt', '948_10.txt', '5125_7.txt', '7653_10.txt', '4372_10.txt', '8539_8.txt', '8450_7.txt', '12371_8.txt', '7000_7.txt', '4065_10.txt', '2941_10.txt', '8310_10.txt', '1602_9.txt', '476_7.txt', '8817_8.txt', '4580_8.txt', '2214_10.txt', '2613_9.txt', '4211_8.txt', '3060_10.txt', '5379_7.txt', '3013_8.txt', '9803_7.txt', '5111_10.txt', '1652_10.txt', '9165_10.txt', '7674_10.txt', '6850_7.txt', '8728_7.txt', '586_10.txt', '647_10.txt', '2539_10.txt', '11915_10.txt', '7547_10.txt', '2656_10.txt', '10655_9.txt', '5991_8.txt', '6793_10.txt', '10409_10.txt', '4712_7.txt', '7490_8.txt', '730_7.txt', '8769_7.txt', '3934_8.txt', '8055_8.txt', '10762_10.txt', '8830_8.txt', '2412_10.txt', '4645_9.txt', '6267_10.txt', '1311_10.txt', '1827_10.txt', '198_

Create a list of positive reviews from the iterator, convert the first example to a string, and strip off extraneous directory information:

In [10]:
pos_reviews = list(Path('aclImdb/train/pos').iterdir())
first = str(pos_reviews[0])
first = first[14:]
first

'pos/25_7.txt'

Explore the first review:

In [12]:
sample_file = os.path.join(train_dir, first)
with open(sample_file) as f:
  print (f.read())

I never saw this when I was a kid, so this was seen with fresh eyes. I had never heard of it and rented it for my 5 year old daughter. Plus, the idea of Christopher Walken singing and dancing made me curious. The special fx are cheesy and the singing and dancing is mediocre. But the story is great. My daughter was entranced. I loved watching Walken in this role thinking about what the future held for him. Very amusing to see him dance! And if the songs weren't great, at least they weren't Disney over-produced saccharine sweetness. The ogre scene in the beginning was a little scary for her, and she was a little nervous when we saw him again at the end, but it was mostly benign. Interestingly, we had recently read "Puss in Boots", and I had wondered about the implausibility of the story. But while staying true to almost every aspect, Walken's acting made it believable. Great fun. I'd watch it again with my daughter.


Display the first five positive reviews:

In [13]:
r = []
for i in range(5):
  review = first = str(pos_reviews[i])
  review = review [14:]
  review = os.path.join(train_dir, review)
  with open(review) as f:
    print ('review', i, ':', f.read())

review 0 : I never saw this when I was a kid, so this was seen with fresh eyes. I had never heard of it and rented it for my 5 year old daughter. Plus, the idea of Christopher Walken singing and dancing made me curious. The special fx are cheesy and the singing and dancing is mediocre. But the story is great. My daughter was entranced. I loved watching Walken in this role thinking about what the future held for him. Very amusing to see him dance! And if the songs weren't great, at least they weren't Disney over-produced saccharine sweetness. The ogre scene in the beginning was a little scary for her, and she was a little nervous when we saw him again at the end, but it was mostly benign. Interestingly, we had recently read "Puss in Boots", and I had wondered about the implausibility of the story. But while staying true to almost every aspect, Walken's acting made it believable. Great fun. I'd watch it again with my daughter.
review 1 : This show is quick-witted, colorful, dark yet fun,

Display the first five negative reviews:

In [14]:
neg_reviews = list(Path('aclImdb/train/neg').iterdir())

r = []
for i in range(5):
  review = first = str(neg_reviews[i])
  review = review [14:]
  review = os.path.join(train_dir, review)
  with open(review) as f:
    print ('review', i, ':', f.read())

review 0 : There was a genie played by Shaq His name was Kazaam, and he was whack His rhymes were corny, this lines were bad some stupid kid cryin over his stupid dad bad actin, bad casting, bad special effects whats next? this movie sucks Prolly didn't make 20 bucks he lives in a boombox not a lamp hurts like a cramp like a wet food stamp...<br /><br />Yeah, you get it, a stupid rhyming genie who can't act, in a stupid movie with horrible special effects. Oh, and its confusing as hell. I'm not even gonna go on. Let's just say, it belongs in the "its so bad, its funny" category. Watch it once with your buddies and get a good laugh. But don't expect anything spectacular.
review 1 : Okay so I went into this movie not really expecting much I figured an action flick similar to The Fast and the Furious. Some nice cars some nice girls somewhat of a decent plot. Unfortunately I would have to say that this was probably the worst movie I have seen this year. Don't get me wrong the cars were nic

# Load the Dataset

Load the data off disk and prepare it into a format suitable for training. To do so, use the helpful text_dataset_from_directory utility, which expects a directory structure as follows:

main_directory/
* ...class_a/
* ......a_text_1.txt
* ......a_text_2.txt
* ...class_b/
* ......b_text_1.txt
* ......b_text_2.txt

To prepare a dataset for binary classification, we need two folders on disk corresponding to class_a and class_b. In our case, the two folders are the positive and negative movie reviews found in 'aclImdb/train/pos' and 'aclImdb/train/neg' respectively. As the IMDB dataset contains additional folders, remove them before using the utility.

In [15]:
import shutil

remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

## Load the Splits

Use the **text_dataset_from_directory** utility to create a labeled tf.data.Dataset. For a machine learning experiment, best practice is to divide into three splits: train, validation, and test.

The IMDB dataset is already divided into train and test, but lacks a validation set. Let's create a train set using an 80:20 split of the training data by using the validation_split argument:

In [16]:
BATCH_SIZE = 32
seed = 0

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
  'aclImdb/train',
  batch_size=BATCH_SIZE,
  validation_split=0.2,
  subset='training',
  seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


The training folder contains 25,000 examples. The training set contains 20,000 examples or 80% of the total. The 'aclImdb/train' directory signals the utililty to draw data from the training folder.

Display examples from the first batch:

In [19]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print ('Review:', text_batch.numpy()[i])
    print ('Label:', label_batch.numpy()[i])

Review: b'This is what makes me proud to be British. This is by far the funniest thing on TV. The league consists of Jeremy dyson, Steve pemberton, mark gatiss and the lovely Reece shearsmith. Totally underrated, this horror-comedy is perfection. The characters are iconic and the catchphrases bizarre, "Hello Dave". It is a comedy that everyone simply must watch.<br /><br />The best thing about the league of gentlemen is that it is always fresh, and always pushing the boundaries. It does not need to rely on catchphrases(unlike little Britain) for it to be funny. the fact that the league are willing to kill off arguably their most famous and iconic characters, shows us that they\'ve got balls of steel.'
Label: 1
Review: b'There is only one use for a film such as Bulletproof: it reminds you just how bad bad can be. We often see films which we describe as "pretty awful" or "not much good", but then you come across a film like this and you can see that although all those other films aren\'t

Batch size is 32, but we just take three examples from it. Lable of '1' is a positive review and label of '0' is a negative one.

Reviews contain raw text (with punctuation and occasional HTML tags like <br/>). We show you how to handle these in the following section.

Verify label names:

In [20]:
print ('Label 0 corresponds to', raw_train_ds.class_names[0])
print ('Label 1 corresponds to', raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


Create a validation set with 5,000 reviews:

In [21]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=BATCH_SIZE, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


The 'aclImdb/train' directory signals the utility to draw data from the training folder. The subset of 'validation' signals the utility to draw the remaining 5,000 examples from the training folder.

**Note:** when using the validation_split and subset arguments, either specify a random seed or pass shuffle=False to ensure that validation and training splits have no overlap.

Create the test set with 25,000 reviews:

In [22]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=BATCH_SIZE)

Found 25000 files belonging to 2 classes.


The 'aclImdb/test' directory signals the utility to draw all 25,000 examples from the test folder.

# Prepare the Dataset for Training

Standardize, tokenize, and vectorize the data using the preprocessing TextVectorization layer.

**Standardization** refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset. **Tokenization** refers to splitting strings into tokens. For example, splitting a sentence into individual words and/or splitting on whitespace. **Vectorization** refers to converting tokens into numbers so they can be fed into a neural network.

Reviews contain various HTML tags. Such tags are not removed by the default standardizer in the TextVectorization layer (which converts text to lowecase and strips punctuation by default, but doesn't strip HTML). We write a custom standardization function to remove the HTML.

Create the custom standardization function:

In [23]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return stripped_html

Create a TextVectorization layer to standardize, tokenize, and vectorize the data. Set the output_mode to **int** to create unique integer indices for each token. Use the default split function and the custom standardization function. Define an explicit maximum sequence_length that causes the layer to pad or truncate sequences to exactly sequence_length values.

In [24]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_features = 10000
sequence_length = 250

vectorize_layer = TextVectorization(
  standardize=custom_standardization,
  max_tokens=max_features,
  output_mode='int',
  output_sequence_length=sequence_length)

Call **adapt** to fit the state of the preprocessing layer to the dataset, which causes the model to build an index of strings to integers:

In [25]:
# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

In [26]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [27]:
# retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b"Not as bad as some are making it out to be, though obviously pathetic compared to the original. In my opinion Amitabh was great as the villain Babban Singh - try not to compare to Gabbar in the original as they were clearly not going for the same effect. Other than some mediocre action scenes however, the rest of the film is flawed. Character development was poor and the development of the story was hopeless, with many loopholes, and missing pieces of information which i wouldn't have known if i hadn't read the back of the DVD case. The worst part of the movie was the support roles from Nisha Kothari and especially this new dude called Prashant Raj. Nisha is just plain annoying from the time her lips first open. As for Prashant Raj - seriously who is this guy? where is he from and why on earth was he present in the film studio for anything other than to serve drinks?. His acting ability is zero and he has the same tone, dialog delivery and staunch expression in every

In [28]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

1287 --->  military
 313 --->  sense
Vocabulary size: 10000


In [29]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [30]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [31]:
embedding_dim = 16

In [32]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Dropout,\
GlobalAveragePooling1D

In [33]:
model = Sequential([
  Embedding(max_features + 1, embedding_dim),
  Dropout(0.2),
  GlobalAveragePooling1D(),
  Dropout(0.2),
  Dense(1)])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          160016    
_________________________________________________________________
dropout (Dropout)            (None, None, 16)          0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
_________________________________________________________________


In [34]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [35]:
epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [41]:
def tokenize_text(text):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)

In [44]:
review = ('Just loved it. My kids thought the movie was cool. '
         'Even my wife liked it. Would recommend it to anyone.')

tok_review = tokenize_text(review)
tok_review

pred = model.predict(tok_review)
pred, pred.shape

(array([[0.8739762]], dtype=float32), (1, 1))