<a href="https://colab.research.google.com/github/Benjamin-morel/TensorFlow/blob/main/02_classification_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---


# **Machine Learning Model: basic text classification**

| | |
|------|------|
| Filename | 02_classification_test.ipynb |
| Author(s) | Benjamin Morel (benjaminmorel27@gmail.com) |
| Date | September 4, 2024 |
| Aim(s) | Build, train and evaluate a neural network machine learning model that classifies movie reviews as positives or negatives. |
| Dataset(s) | Stanford dataset [[1]](https://ai.stanford.edu/~amaas/data/sentiment/)|
| Version | Python 3.12 - TensorFlow 2.17.0 |


<br> **!!Read before running!!** <br>
1. Fill in the inputs
2. GPU execution recommended if `training_phase="Yes"`.
3. Run all and read comments.

---

#### **Motivation**

AI is capable of managing and processing text made up of several thousand words, special characters and punctuation. In this Python code, the construction of the neural network and its optimization is similar to what has been done before. The main interest lies in the way the text is processed and broken down into elements.

For this, the Stanford Sentiment Treebank (SST) database, composed of over 50,000 movie reviews on the Internet, is used to build a binary classification neural network. A film review with a rating below 4/10 will be classified as negative and with a rating above 7/10 as positive.



---



#### **0. Input section**

The model has already been trained: **parameters** (weights and biases) of each neuron are already known according to the base dataset. The user can choose to keep these parameters and **not retrain the model** (No), or he can decide to repeat the **training phase** (Yes). The latter choice may be justified by the fact that the user wishes to update the neural network against an updated dataset.

In [None]:
training_phase = 'No'

---


#### **1. Import libraries & prebuilt dataset**

###### **1.1. Presentation of Python libraries**

`os`: provides functions to interact with the operating system (manipulating files and directories, accessing system information...)

`re`: performs complex searches and manipulations on strings (extracting substrings...)

`shutil`: provides utilities for performing operations on files and directories (copying, modifying, deleting...)

`string`: provides tools for handling and processing strings

`numpy`: famous library for scientific computing

`tensorflow`: builds and trains machine learning and deep learning models

`plotly.express`: creates graphs

In [None]:
import os # miscellaneous operating system interfaces
import re # regular expressions
import shutil # operations on files
import string # for manipulating strings
import numpy as np # scientific computing
import tensorflow as tf # machine learning models
import plotly.express as px # graphing packages

###### **1.2. Which database is used?**

The considered database is a version of the Stanford Sentiment Treebank (SST), a dataset created for sentiment analysis in movie reviews. It contains movie reviews labeled with sentiment (positive or negative) and is widely used to train and test sentiment analysis models. Movie reviews are contained into the compressed file `aclimdb`.


In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1", url, extract=True, cache_dir='.', cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

###### **1.3. How data is organized within Stanford movie review database?**

Once the file is extracted, the following structure is composed of 5 folders: `train`, `test`, `README`, `imdbEr.txt` and `imdb.vocab`. The following code line is used to check file name into the directory.

*   `train`: contains movie reviews meant for training
*   `test`: contains movie reviews meant for testing
*   `README`: provides information about the dataset and how to use it
*   `imdb.vocab` and `imdbEr.txt` contain additional information about errors, URL website and specific annotations

Within both the `test` and `train` folders, there are two subfolders:

*   `pos`: contains movie reviews with a positive sentiment (rating > 7/10)
*   `unsup`: contains unlabeled movie reviews for unsupervised learning
*   `neg`: contains movie reviews with a negative sentiment (rating < 4/10)

Each text file in these subfolders represents a single movie review and contains the raw text of the review. They are cleaned to contain only raw text, without additional metadata.

In [None]:
os.listdir(dataset_dir) # check the file names in the aclImdb directory

['test', 'imdb.vocab', 'README', 'imdbEr.txt', 'train']

In [None]:
train_dir = os.path.join(dataset_dir, 'train') # path name of the "train" file in dataset_dir
remove_dir = os.path.join(train_dir, 'unsup') # remove the folder with unlabeled reviews for unsupervised learning
shutil.rmtree(remove_dir)

###### **1.4. How data is submitted during the learning phase**

A good practice for a machine learning experiment is to divide the dataset into 3 splits: `train`, `test` and `validation`. Two of them are already available. The validation set is created by using 20% of the training data set.

The 3 subsets are organized in batches for multiple reasons: memory, parallel computation, fast calculation of gradients, standardize the processing of training, validation and test sets... For this, the TensorFlow function `text_dataset_from_directory` is used to randomly shuffle texts present in the train folder and then divide this shuffle to generate the training dataset (80%) and the validation dataset (20%). Finally, for each dataset, all the data is grouped into batches of 32 texts.

The 3 subsets are stored in the data structures `tf.data.Dataset`. The data is not loaded into immediate memory, but generated when it is called up. The `show_text()` function displays an example of the text in the first batch of the data structure `tf.data.Dataset`.

In [None]:
def show_text(text_batch): # get the first text of the first batch
  for text_batch, label_batch in text_batch: # navigate through the text batch
      if label_batch.numpy()[0] == 0:
        print("Here's an extract from a negative review:")
        print(text_batch.numpy()[0])
      else:
        print("Here's an extract from a positive review:")
        print(text_batch.numpy()[0])

In [None]:
batch_size = 32
raw_train_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', shuffle=True, seed=42) # 625 batches of 32 texts for training set randomly chosen
raw_val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', shuffle=True, seed=42) # 157 batches of 32 texts for validation set randomly chosen
raw_test_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/test', batch_size=batch_size) # 782 batches of 32 texts for test set

first_batch_train = raw_train_ds.take(1) # get the first batch of the training text set

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [None]:
show_text(first_batch_train) # show an example of movie review

Here's an extract from a negative review:
b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'


---


#### **2. Pre-processing & reformating data**

###### **2.1. How to switch from a verbal language to a machine language?**

The textual data is pre-processed and converted before being used by the model. Three crucial phases are established:
- standardization
- tokenization
- vectorization

The first pre-processing step standardizes text data by replacing upper case with lower case letters, by removing html tags and punctuation characters. The variable `punctuation` contains all punctuation characters to be removed. Special characters such as those with accents are not present in English texts. A standardization function is declared and used to process training data in the same way as other data (avoid training/testing bias).

In [None]:
ponctuation = re.escape(string.punctuation)
print("The punctuation characters to be eliminated are: \n", " \n", ponctuation)

The punctuation characters to be eliminated are: 
  
 !"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~


In [None]:
def standardization(input_data): # standardization function
  no_uppercases = tf.strings.lower(input_data) # convert upper cases into lower cases...
  no_html = tf.strings.regex_replace(no_uppercases, '<br />', ' ') # ... then remove HTML strings and...
  no_punctuation = tf.strings.regex_replace(no_html, '[%s]' % ponctuation, '') # ... punctuation
  return no_punctuation

The second pre-processing step is to transform strings of all texts into integers (=neural network inputs). Text examples are split into "substrings" (= word) and then recombined into ***tokens*** (tokenization step). For each text example, the token number is limited to the 10,000 most frequent words in the dataset. Any words beyond this limit will be ignored.

These 10,000 tokens forme a dictionnary called ***vocabulary***. The vocabulary size is set to 10,000 in order to keep high-interest words (verbs, adjectives, common nouns) and get rid of rare words (names, etc.). A fixed length of the text sequences after vectorization is set to 1,000 tokens. If the text is shorter than 1000 tokens, it will be patch to reach this length. If it longer, it will be truncated.

The Keras layer designed specifically for text preprocessing `TextVectorization` which transforms strings into sequences of integers that the neural network can process.

In [None]:
max_features, sequence_length = 10000, 1000 # maximum number of words (=tokens) to consider in the vocabulary.
vectorize_layer = tf.keras.layers.TextVectorization(standardize=standardization, max_tokens=max_features, output_mode='int', output_sequence_length=sequence_length) # transform strings into sequences of integers

vectorize_layer.adapt(raw_train_ds.map(lambda x, y: x)) # vectorization layer to learn the vocabulary from the raw texts in the training dataset

###### **2.2. How do you visualize these transformations?**

The first function `token_to_int` shows the transformation of text elements into integers. The second `int_to_token`is used to identify which token is associated with a given integer.

In [None]:
def token_to_int(raw_text):
  raw_text = next(iter(raw_text))[0]
  print("Review before tokenization and vectorization: \n", raw_text[0])
  print(" ")
  text_vectorized = tf.expand_dims(raw_text, -1) #
  text_vectorized = vectorize_layer(text_vectorized)
  print("Review after tokenization and vectorization: \n", text_vectorized[0])

token_to_int(raw_train_ds)

Review before tokenization and vectorization: 
 tf.Tensor(b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have finally found that special someone.', shape=(), dtype=string)
 
Review after tokenization and vectorization: 
 tf.Tensor(
[  86   17  260    2  222    1  571   31  229   11 2418    1   51   22
   25  404  251   12  306  282    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0   

In [None]:
def int_to_token(index):
  token = vectorize_layer.get_vocabulary()[index]
  print("The integer %d represent the token: %s" %(index, token))

int_to_token(2)

The integer 2 represent the token: the


---


#### **3. Model and training**


###### **3.1. How to configure the datasets for performance?**

During the model training phase, the duration represents the time required to open the data file, read it and train with it. By default, these steps are performed one at a time. With a prefetch method, the model opens the data file, then executes a training step `s` and loads the data for step `s+1` at the same time. Loading the batch in the background during the training phase enables more efficient use of available computing resources, avoiding the risk of a "bottleneck" where computation (GPU) is limited by the speed at which data is supplied (I/O).

The overlapping of these steps is ensured by the `prefetch()` function, where the size of the prefetch buffer is automatically set by TensorFlow via the parameter `AUTOTUNE`.

The function `cache()` temporarily stores transformed data of a batch (transformations detailed below) in the RAM memory (12.7 GB RAM for a Google configuration). This method saves a significant amount of time since complex transformations applied to the texts are only performed during the first epoch and re-used for the others.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE # prefetch buffer size parameter

raw_train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE) # store raw_train_ds temporarily in the RAM + load the next batch in background with a prefetch buffer size computed by AUTOTUNE
raw_val_ds = raw_val_ds.cache().prefetch(buffer_size=AUTOTUNE)
raw_test_ds = raw_test_ds.cache().prefetch(buffer_size=AUTOTUNE)

###### **3.2. What is a neural network formed of?**

In [None]:
def create_model():
  embedding_dim = 16
  model = tf.keras.Sequential()
  model.add(vectorize_layer) # pre-processing layer (standardization + tokenization + vectorization)
  model.add(tf.keras.layers.Embedding(max_features, embedding_dim))
  model.add( tf.keras.layers.Dropout(0.2))
  model.add(tf.keras.layers.GlobalAveragePooling1D())
  model.add(tf.keras.layers.Dropout(0.2))
  model.add(tf.keras.layers.Dense(1, activation='sigmoid')) # 1 output: propbability

  return model

###### **3.3. How to train the model?**


In [None]:
if training_phase == "Yes":
  checkpoint_path = "02_classification_text.weights.h5"
  cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True, verbose=0)
  stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_binary_accuracy', patience=10, restore_best_weights=True, min_delta=0.001)

In [None]:
model = create_model()
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 1000)              0         
 Vectorization)                                                  
                                                                 
 embedding_1 (Embedding)     (None, 1000, 16)          160000    
                                                                 
 dropout_2 (Dropout)         (None, 1000, 16)          0         
                                                                 
 global_average_pooling1d_1  (None, 16)                0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dropout_3 (Dropout)         (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), optimizer='adam', metrics=[tf.metrics.BinaryAccuracy()])

if training_phase == "Yes":
  history = model.fit(raw_train_ds, validation_data=raw_val_ds, epochs=100, callbacks=[stop_early, cp_callback], verbose=0)
  val_acc_per_epoch = history.history['val_binary_accuracy'] # best val_binary_accuracy achived at epoch 32
  best_epoch = val_acc_per_epoch.index(max(val_acc_per_epoch)) + 1
  print('Best epoch: %d' % (best_epoch))
else:
  !git clone https://github.com/Benjamin-morel/TensorFlow.git # go to the Github repertory TensorFlow and clone it
  model.load_weights("TensorFlow/02_classification_text.weights.h5") # import weights from the cloned repertory
  !rm -rf TensorFlow/ # delete the cloned repertory

Cloning into 'TensorFlow'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 71 (delta 30), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (71/71), 14.99 MiB | 14.88 MiB/s, done.
Resolving deltas: 100% (30/30), done.




In [None]:
def show_evolution(history, val):
  history_dict = history.history
  history_dict.keys() # dictionnary: ['accuracy', 'loss', 'val_accuracy', 'val_loss']

  if val == False: # get either the training set accuracy or the validation set accuracy
    acc_train = history_dict['binary_accuracy']
  else:
    acc_train = history_dict['val_binary_accuracy']

  epochs = range(1, len(acc_train) + 1)

  fig = px.line(x = epochs, y = acc_train, width=600, height=400)
  fig.update_layout(legend=dict(x=0.02, y=0.98, xanchor='left', yanchor='top', bgcolor='rgba(255, 255, 255, 0.8)', bordercolor='black', borderwidth=1))
  if val == False: fig.update_traces(name="training", showlegend=True)
  else: fig.update_traces(name="validation", showlegend=True)
  fig.update_xaxes(title = "epochs"), fig.update_yaxes(title = "binary_accuracy")
  fig.show()

In [None]:
if training_phase == 'Yes':
  show_evolution(history, True)

---


#### **4. Evaluation and prediction**


In [None]:
loss, accuracy = model.evaluate(raw_test_ds)

print(f"""
  {round(100*accuracy, 3)}% of the test set is corretly predicted
  """)

It is now possible to use the trained model to recognize the sentiment of the author of a film review. Write your own review in the next section and check the model's prediction.

In [None]:
my_review = ["This movie was terrible and boring. Most of scenes were violents and useless."]

In [None]:
examples = tf.constant(my_review)

prediction = model.predict(examples)
if prediction < 0.5:
  print("The movie looks pretty bad. (Model sures at ", round(100*(1-prediction)[0][0], 1), "%)")
else:
  print("Great movie, go see it in the cinema! (Model sures at ", round(100*prediction[0][0], 1), "%)")

The movie looks pretty bad. (Model sures at  86.2 %)
