#Sentiment analysis
trains a sentiment analysis model to classify movie reviews as positive or negative, based on the text of the review.

We'll use the Large Movie Review Dataset that contains the text of 50,000 movie reviews from the Internet Movie Database.

##Download and prepare the IMDB dataset
Let's download and extract the dataset, then explore the directory structure.

In [1]:
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

train_dir = os.path.join(dataset_dir, 'train')

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


Next, we will use the text_dataset_from_directory utility to create a labeled tf.data.Dataset.

The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the validation_split argument below.

In [32]:
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 1
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,

    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


##Show a sample of raw data and after processing

In [33]:
for text_batch, label_batch in train_ds.take(1):

    print(f'Review: {text_batch.numpy()[0]}')
    label = label_batch.numpy()[0]
    print(f'Label : {label} ({class_names[label]})')

Review: b"The idea ia a very short film with a lot of information. Interesting, entertaining and leaves the viewer wanting more. The producer has produced a short film of excellent quality that cannot be compared to any other short film that I have seen. I have rated this film at the highest possible rating. I also recommend that it is shown to office managers and business people in any establishment. What comes out of it is the fact that people with ideas are never listened to, their voice is never heard. It is a lesson to be learned by any office that wants to go forward. I hope that the produced will produce a second part to this 'idea'. I look forward to viewing the sequence. Once again congrats to Halaqah media in producing a film of excellence and quality with a lesson in mind."
Label : 1 (pos)


In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re

def preProcess_data(text):
   text = text.lower()
   new_text = re.sub('[^a-zA-z0-9\s]','',text)
   new_text = re.sub('rt', '', new_text)
   return new_text
   
x_l=[]
o_l=[]
for text_batch, label_batch in test_ds.take(-1):
 x_l.append(text_batch.numpy()[0].decode("utf-8"))
 o_l.append(label_batch.numpy()[0])
  
print(x_l[1]) 
print(o_l[0]) 

In England we often feel very attached to British films that we like, as we are so used to the usual American settings and accents. Being from London, where Virtual Sexuality is set, I felt a strong emotional attachment to it. The characters in Virtual Sexuality, particularly the females, are exactly what British teenagers are like, I felt like I was almost in the film. I immediately related to the character of Alex from the film, his shyness is quite common in most British teenage boys, especially around girls. Virtual Sexuality made me feel really good as its one of the only British films that isn't about gangsters or the middle-upper class, but about the people who are watching the film, average teenagers. Americans wouldn't really feel the emotional attachment, but every British teenager should watch it. Anyone from London will recognise the parts of the city from the film, it's definately got a special place in my video box!
0


In [35]:
x_eval=[]
o_eval=[]
for text_batch, label_batch in test_ds.take(-1):
 x_eval.append(text_batch.numpy()[0].decode("utf-8"))
 o_eval.append(label_batch.numpy()[0])
  
print(x_eval[1]) 
print(o_eval[0])
print(len(x_eval))  

In England we often feel very attached to British films that we like, as we are so used to the usual American settings and accents. Being from London, where Virtual Sexuality is set, I felt a strong emotional attachment to it. The characters in Virtual Sexuality, particularly the females, are exactly what British teenagers are like, I felt like I was almost in the film. I immediately related to the character of Alex from the film, his shyness is quite common in most British teenage boys, especially around girls. Virtual Sexuality made me feel really good as its one of the only British films that isn't about gangsters or the middle-upper class, but about the people who are watching the film, average teenagers. Americans wouldn't really feel the emotional attachment, but every British teenager should watch it. Anyone from London will recognise the parts of the city from the film, it's definately got a special place in my video box!
0
25000


##Tokenize and Save Keras Tokenizer
>The tokenizer will transform the text into vectors, it’s important to have the same vector space between training & predicting. The most common way is to save tokenizer and load the same tokenizer at predicting time using pickle.

In [36]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

max_fatures = 20000

tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(x_l)
X = tokenizer.texts_to_sequences(x_l)
X = pad_sequences(X, 200) 

Y = pd.get_dummies(o_l)


with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

##Split Dataset 

In [37]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)

print(len(X_train), "Training sequences")
print(len(X_test), "Validation sequences")


20000 Training sequences
5000 Validation sequences


In [38]:
X_E = tokenizer.texts_to_sequences(x_eval)
X_E = pad_sequences(X_E, 200) 
Y_E = pd.get_dummies(o_eval)

In [39]:
X_eval, X_, Y_eval, Y_ = train_test_split(X_E,Y_E, test_size = 0.00000000001)

print(len(Y_eval), "Testing sequences")


24999 Testing sequences


##Bidirectional LSTM Build the Model

In [40]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review

Build the model

In [43]:
model = Sequential()
model.add(Embedding(max_fatures, 128,input_length = 28))

# Add 2 bidirectional LSTMs
model.add(layers.Bidirectional(layers.LSTM(64, return_sequences=True)))
model.add(layers.Bidirectional(layers.LSTM(64)))
# Add a classifier
model.add(layers.Dense(2, activation="sigmoid"))

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 28, 128)           2560000   
_________________________________________________________________
bidirectional_8 (Bidirection (None, 28, 128)           98816     
_________________________________________________________________
bidirectional_9 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 258       
Total params: 2,757,890
Trainable params: 2,757,890
Non-trainable params: 0
_________________________________________________________________


In [44]:
model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])
model.fit(X_train, Y_train, batch_size=512, epochs=4, validation_data=(X_test, Y_test))


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f12ce85f990>

In [45]:
# importing library

print(len(X_eval), "Testing sequences")
loss, accuracy = model.evaluate(X_eval,Y_eval)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

24999 Testing sequences
Loss: 0.12278015166521072
Accuracy: 0.9636785387992859


#Inference

The tokenizer will transform the text into vectors, it’s important to have the same vector space between training & predicting. The most common way is to save tokenizer and load the same tokenizer at predicting time using pickle.

In [47]:
import pickle

with open('tokenizer.pickle', 'rb') as handle:
   loaded_tokenizer = pickle.load(handle)   

In [56]:
examples = [
    'this is such an amazing movie!',  # this is the same sentence tried earlier
    'The movie was great!',
    'The movie was fine.',
    'The movie was bad.',
    'The movie was terrible...'
]
for ex in examples:
  txt=preProcess_data(ex)
  seq= loaded_tokenizer.texts_to_sequences([txt])
  padded = pad_sequences(seq, maxlen=maxlen)
  pred = model.predict_classes(padded)
  print(txt,pred) 
  





this is such an amazing movie [1]
the movie was great [1]
the movie was fine [1]
the movie was bad [0]
the movie was terrible [0]
