<a href="https://colab.research.google.com/github/Benjamin-morel/TensorFlow/blob/main/02_classification_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---


# Machine Learning Model: basic text classification

| | |
|------|------|
| Filename | 02_classification_test.ipynb |
| Author(s) | Benjamin Morel (benjaminmorel27@gmail.com) |
| Date | September 4, 2024 |
| Aim(s) | Build, train and evaluate a neural network machine learning model that classifies movie reviews as positives or negatives. |
| Dataset(s) | Stanford dataset [[1]](https://ai.stanford.edu/~amaas/data/sentiment/)|
| Version | Python 3.12 |


<br> **!!Read before running!!** <br>
1. CPU execution is enough
2. Run all and read comments.

---

## 1. Import libraries & prebuilt dataset

In this Python script, a neural network is built to classify **film reviews** into 2 classes (binary classification): negative - with a rating below 4/10 - and positive - with a rating above 7/10. In addition to TensorFlow, specific libraries for **text and strings manipulation** are used. The training and test data comes from the Stanford's database, which has collected over 50,000 movie reviews on the Internet and tagged them according to the rating awarded.  



In [None]:
import os # miscellaneous operating system interfaces
import re
import shutil # operations on files
import string # for manipulating strings
import numpy as np
import tensorflow as tf # machine learning models
import plotly.express as px # graphing packages

print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.17.0


**Comments**
<br> The database is extracted directly from a compressed file containing "*aclImb*". This contains 2 folders: "*train*" and "*test*". Each of these 2 folders contains **positive and negative reviews**.


In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1", url, extract=True, cache_dir='.', cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

In [None]:
os.listdir(dataset_dir) # file names in the dataset_dir
train_dir = os.path.join(dataset_dir, 'train') # path name of the "train" file in dataset_dir
os.listdir(train_dir)
remove_dir = os.path.join(train_dir, 'unsup') # remove folder with unlabeled data
shutil.rmtree(remove_dir)

**Comments**
<br> Here, data is split in 2 sets: one for **training** (80% = 20,000 reviews) and another for **validation** (20% = 5,000 reviews). The 2 sets of data are taken from a set where the data have been mixed so as not to preserve a certain order between the samples. Within these sets, data is **batched** to make learning easier. Positive movie reviews are **labeled** by 1 while negative reviews by 0. An example of a batch with negative/positive reviews is shown.

In [None]:
batch_size = 32
raw_train_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', seed=42)

for text_batch, label_batch in raw_train_ds.take(1):
    print("Review", text_batch.numpy()[0])
    print("Label", label_batch.numpy()[0])

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0


In [None]:
raw_val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=42)
raw_test_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/test', batch_size=batch_size)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


## 2. Pre-processing & reformating data

**Comments**
<br> In a text classification case, texts have to be converted into **numerical inputs** in order to be understand by the neural network. For this, data are pre-processed by **standardizing** them and converting them into numbers (tokenisation + vectorization). For the first step, uppercases are transformed to lowercases, HTML tags and punctuation strings are removed. A standardization function is declared and used to process training data in the same way as other data (avoid training/testing bias).

In [None]:
def custom_standardization(input_data): # standardize the data
  lowercase = tf.strings.lower(input_data) # convert uppercases into lowercases...
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ') # ... then remove HTML strings and...
  return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '') # ... replace punctuation by backslash

**Comments**
<br> The second pre-processing step is to transform strings of all texts into numbers (=neural network inputs). Text examples are split into "substrings" (= word) and then recombined into ***tokens*** (tokenization step). For each text example, the token number is limited to the 10,000 most frequent words in the dataset. Any words beyond this limit will be ignored. These 10,000 tokens forme a dictionnary called ***vocabulary***. The vocabulary size is set to 10,000 in order to keep high-interest words (verbs, adjectives, common nouns) and get rid of rare words (names, etc.). A fixed length of the text sequences after vectorization is set to 1,000 tokens. If the text is shorter than 1000 tokens, it will be patch to reach this length. If it longer, it will be truncated. The Keras layer designed specifically for text preprocessing - `TextVectorization` which transforms strings into sequences of integers (tokens) that the neural network can process.

In [None]:
max_features, sequence_length = 10000, 1000 # maximum number of words (=tokens) to consider in the vocabulary.
vectorize_layer = tf.keras.layers.TextVectorization(standardize=custom_standardization, max_tokens=max_features, output_mode='int', output_sequence_length=sequence_length) # transform strings into sequences of integers
train_text = raw_train_ds.map(lambda x, y: x) # train_text dataset that contains only the raw text, without the labels
vectorize_layer.adapt(train_text) # vectorization layer to learn the vocabulary from the texts in the training dataset

**Comments**
<br>

In [None]:
def vectorize_text(text, label): # vectorize data
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b'Belmondo is a tough cop. He goes after a big-time drug dealer (played by Henry Silva, normally a great villain - see "Sharky\'s Machine"; but here he is clearly dubbed, and because of that he lacks his usual charisma). He goes to the scuzziest places of Paris and Marseilles, asks for some names, beats up some people, gets the names, goes to more scuzzy places, asks for more names, beats up more people, etc. The whole movie is punch after punch after punch. It seems that the people who made it had no other ambition than to create the French equivalent of "Dirty Harry". Belmondo, who was 50 here, does perform some good stunts at the beginning; apart from those, "Le Marginal" is a violent, episodic, trite, shallow and forgettable cop movie. (*1/2)', shape=(), dtype=string)
Label neg
Vectorized review (<tf.Tensor: shape=(1, 1000), dtype=int64, numpy=
array([[   1,    7,    4, 1233, 1021,   27,  261,  101,    4,    1, 1525,
        6992,  248,   32, 1488,    1, 1659,    4

In [None]:
print(vectorize_layer.get_vocabulary()[1287])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

silent
Vocabulary size: 10000


In [None]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

## 3. Build the neural network machine learning model and train it



**Comments**
<br>

In [None]:
embedding_dim = 16
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(max_features, embedding_dim))
model.add( tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.GlobalAveragePooling1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [None]:
epochs = 10
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), optimizer='adam', metrics=[tf.metrics.BinaryAccuracy(threshold=0.5)])
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs)
model.summary()

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 22ms/step - binary_accuracy: 0.5218 - loss: 0.6924 - val_binary_accuracy: 0.5088 - val_loss: 0.6867
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 17ms/step - binary_accuracy: 0.5657 - loss: 0.6824 - val_binary_accuracy: 0.6122 - val_loss: 0.6685
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 23ms/step - binary_accuracy: 0.6267 - loss: 0.6614 - val_binary_accuracy: 0.6212 - val_loss: 0.6382
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 22ms/step - binary_accuracy: 0.6877 - loss: 0.6219 - val_binary_accuracy: 0.7618 - val_loss: 0.5859
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 17ms/step - binary_accuracy: 0.7340 - loss: 0.5770 - val_binary_accuracy: 0.7830 - val_loss: 0.5416
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 18ms/step - binary_accuracy: 0.7

## 4. Evaluate the model


In [None]:
loss, accuracy = model.evaluate(test_ds)
print(" --------------------------------------------- \n", round(100*accuracy, 1) , "% of the test set is correctly predicted \n", "---------------------------------------------\n")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - binary_accuracy: 0.8482 - loss: 0.4000
 --------------------------------------------- 
 84.9 % of the test set is correctly predicted 
 ---------------------------------------------



**Comments**
<br> The computation of the accuracy and loss function values are imported from `history`. Here, the objective is to study the evolution of the accuracy and the loss according to the epoch (=time training) during the training phase and validation. Below, 2 plots of Loss Vs. time and Accuracy Vs. time.

In [None]:
history_dict = history.history
history_dict.keys() # dictionnary: ['binary_accuracy', 'loss', 'val_binary_accuracy', 'val_loss']
acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

fig = px.scatter(x=epochs, y=loss, labels={"training"}, width=600, height=400)
fig2 = px.line(x=epochs, y=val_loss, labels={"validation"})
fig.add_trace(fig2.data[0])
fig.show()

In [None]:
fig = px.scatter(x = epochs, y = acc, name = "training", width=600, height=400)
fig2 = px.line(x=epochs, y=val_acc, name = "validation")
fig.add_trace(fig2.data[0])
fig.show()

**Comments**
<br>

In [None]:
export_model = tf.keras.Sequential()
export_model.add(vectorize_layer)
export_model.add(model)

export_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy'])

export_model.evaluate(raw_test_ds, return_dict=True)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.8478 - binary_accuracy: 0.0000e+00 - loss: 0.0000e+00


{'accuracy': 0.849120020866394, 'binary_accuracy': 0.0, 'loss': 0.0}

In [None]:
my_review = ["‘Robot Dreams’ is a dialogue-free animated masterpiece that resonates powerfully without words. Its storyline, exploring themes of loneliness and companionship, is deeply moving and lingers in your thoughts long after the credits roll have ended. With spectacular animation, this drama is both bittersweet and, at times, heart-wrenching. It skilfully navigates a delicate balance, offering moments of profound sadness alongside bursts of joy. Beneath its surface, the film is rich with metaphors that reflect on life's complexities."]

In [None]:
my_review = ["This movie was terrible and boring. Most of scenes were violents and useless."]

In [None]:
examples = tf.constant(my_review)

prediction = export_model.predict(examples)
if prediction < 0.5:
  print("The movie looks pretty bad. (probability =", round(100*(1-prediction)[0][0], 1), "%)")
else:
  print("Great movie, go see it in the cinema! (probability =", round(100*prediction[0][0], 1), "%)")