# Sentiment analysis 

This is a jupyter notebook for a project to detect the sentiments of movie reviews.

In [1]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

2024-09-19 13:47:44.773806: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
print('hello world!')

hello world!


In [3]:
print(tf.__version__)

2.16.2


## Import dataset
We will use a dataset of movie reviews from IMDB, provided by [stanford university](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz).
Then, we use the get_file function from keras (included in tensorflow) to download the dataset if it is not already in the cache_dir.

In [4]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                  untar=True, cache_dir='data',
                                  cache_subdir='')

We define variables for the directories where the data is stored. Directories contain a text file for each review, positive ones are in the 
_pos_ directory and negative are in the _neg_ directory.

In [5]:
dataset_dir=os.path.join(os.path.dirname(dataset), 'aclImdb')
train=os.path.join(dataset_dir,'train')
test=os.path.join(dataset_dir,'test')

We will remove the _unsup_ directory in the train data, as we will use supervised learning for this project (and this will simplify data
loading in the following step)

In [6]:
remove=os.path.join(train,'unsup')
shutil.rmtree(remove)

To load the data, we will use the _text_dataset_from_directory_ function in keras to load the data to memory. This function expect the 
structure 
```text
dir/
    class_1/
        point_1.txt
        point_2.txt
    class_2/
        point_1.txt
        point_2.txt
```
which is conveniently followed by the dataset. In this case, neg will have the label 0 and pos the label 1. 
The dataset will be loaded in batches of 32 points (so it will yield groups of 32 points on each iteration) and the value 40 is used
as a random seed for shuffling. It also will reserve 20% of the dataset for validation (used to tune hyperparameters)

In [7]:
batch_size = 32
seed = 40

raw_training_ds = tf.keras.utils.text_dataset_from_directory(
    train,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [8]:
print("Label 0 corresponds to", raw_training_ds.class_names[0])
print("Label 1 corresponds to", raw_training_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


In [9]:
raw_validation_ds = tf.keras.utils.text_dataset_from_directory(
    train,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [10]:
raw_testing_ds = tf.keras.utils.text_dataset_from_directory(
    test,
    batch_size=batch_size)


Found 25000 files belonging to 2 classes.


In [11]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

In [12]:
max_features = 10000
sequence_length = 250

#TODO: use glove embeddings

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization, #A function that is called for each input to standarize it
    max_tokens=max_features, #maximum size of the vocabulary (it will have the "top max_features" words)
    output_mode='int', #return an int for each word
    output_sequence_length=sequence_length #TODO check!
    )

In [13]:
# Make a text-only dataset (without labels), then call adapt
train_text = raw_training_ds.map(lambda x, y: x)
#build a vocabulary of all tokens in the dataset
vectorize_layer.adapt(train_text) 

2024-09-19 13:48:58.779515: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [14]:
def vectorize_text(text, label):
  print('previous expansion', vectorize_layer(text))
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [15]:
# retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_training_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_training_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b'I still liked it though. Warren Beatty is only fair as the comic book hero. What saves this movie is the set, the incredible cast and it offshoots a mediocre script. I really expected something more substantial in the terms of action, or plot but I got very little. The main reason to watch this movie is to watch some of the biggest stars in Hollywood at the time in such an unusual film. <br /><br />The one person who did a terrible job and did not even belong in this film was Madonna. She did not belong in this movie and her acting job was pretty bad. The movie at some points just stood still. You expected something more and you got nothing. Al Pacino plays a really bad dude and he does pretty good. He and Beatty do make an excellent good guy and bad guy. <br /><br />It is also interesting to see Dustin Hoffman, and Warren Beatty in a film other than Isthar. I did not see Ishtar but I heard bad things. The thing about this movie is it is good, but it could have been 

In [17]:
train_ds = raw_training_ds.map(vectorize_text)
val_ds = raw_validation_ds.map(vectorize_text)
test_ds = raw_testing_ds.map(vectorize_text)

previous expansion Tensor("text_vectorization_1/Pad:0", shape=(None, None), dtype=int64)
previous expansion Tensor("text_vectorization_1/Pad:0", shape=(None, None), dtype=int64)
previous expansion Tensor("text_vectorization_1/Pad:0", shape=(None, None), dtype=int64)


In [18]:
autotune = tf.data.AUTOTUNE #dynamically change buffer size
train_ds = train_ds.cache().prefetch(buffer_size=autotune)
val_ds = val_ds.cache().prefetch(buffer_size=autotune)
test_ds = test_ds.cache().prefetch(buffer_size=autotune)

# The neural network

In [19]:
embedding_dim = 16 


model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim), #transform the word index into a vector of dimension embedding_dim (16)
  layers.Dropout(0.2), # prevent overfitting
  layers.GlobalAveragePooling1D(), #Do an average of all the embeddings in the review to get one output (TODO: change to LSTM)
  layers.Dropout(0.2),
  layers.Dense(1)])

model.summary()

In [20]:
model.compile(loss=losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.5)])

# Training

In [21]:
epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - binary_accuracy: 0.5462 - loss: 0.8015 - val_binary_accuracy: 0.7532 - val_loss: 0.6080
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - binary_accuracy: 0.7315 - loss: 0.5806 - val_binary_accuracy: 0.8372 - val_loss: 0.4661
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - binary_accuracy: 0.8223 - loss: 0.4499 - val_binary_accuracy: 0.8536 - val_loss: 0.4020
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - binary_accuracy: 0.8585 - loss: 0.3909 - val_binary_accuracy: 0.8702 - val_loss: 0.4007
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - binary_accuracy: 0.8685 - loss: 0.3647 - val_binary_accuracy: 0.8784 - val_loss: 0.3945
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - binary_accuracy: 0.8357 - loss: 

Test model

In [22]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - binary_accuracy: 0.8557 - loss: 0.4661
Loss:  0.4553247392177582
Accuracy:  0.8552799820899963
