# Sentiment analysis 

This is a jupyter notebook for a project to detect the sentiments of movie reviews.

In [1]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

2024-09-19 01:19:18.482279: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
print('hello world!')

hello world!


In [7]:
print(tf.__version__)

2.16.2


## Import dataset
We will use a dataset of movie reviews from IMDB, provided by [stanford university](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz).
Then, we use the get_file function from keras (included in tensorflow) to download the dataset if it is not already in the cache_dir.

In [4]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                  untar=True, cache_dir='data',
                                  cache_subdir='')

We define variables for the directories where the data is stored. Directories contain a text file for each review, positive ones are in the 
_pos_ directory and negative are in the _neg_ directory.

In [15]:
dataset_dir=os.path.join(os.path.dirname(dataset), 'aclImdb')
train=os.path.join(dataset_dir,'train')
test=os.path.join(dataset_dir,'test')

We will remove the _unsup_ directory in the train data, as we will use supervised learning for this project (and this will simplify data
loading in the following step)

In [6]:
remove=os.path.join(train,'unsup')
shutil.rmtree(remove)

To load the data, we will use the _text_dataset_from_directory_ function in keras to load the data to memory. This function expect the 
structure 
```text
dir/
    class_1/
        point_1.txt
        point_2.txt
    class_2/
        point_1.txt
        point_2.txt
```
which is conveniently followed by the dataset. In this case, neg will have the label 0 and pos the label 1. 
The dataset will be loaded in batches of 32 points (so it will yield groups of 32 points on each iteration) and the value 40 is used
as a random seed for shuffling. It also will reserve 20% of the dataset for validation (used to tune hyperparameters)

In [8]:
batch_size = 32
seed = 40

raw_training_ds = tf.keras.utils.text_dataset_from_directory(
    train,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [9]:
print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


In [12]:
raw_validation_ds = tf.keras.utils.text_dataset_from_directory(
    train,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [16]:
raw_testing_ds = tf.keras.utils.text_dataset_from_directory(
    test,
    batch_size=batch_size)


Found 25000 files belonging to 2 classes.


In [17]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

In [19]:
max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)