<a href="https://colab.research.google.com/github/HereBeCode/StanceDetectionML-Brockport/blob/main/MultiClass_YT_Comment_Sentiment_Classification_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import matplotlib.pyplot as plt
import glob
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

In [None]:
print(tf.__version__)

# Import Dataset
We can use the text_dataset_from_directory utility which expects a specific directory structure. Import your dataset as a zipped folder with the following structure:


    main_directory/
      ...class_a/
        ...a_text_1.txt
        ...a_text_2.txt
      ...class_b/
        ...b_text_1.txt
        ...b_text_2.txt


The zipped folder can be unzipped as follows:



In [None]:
!unzip "/content/AbortionData.zip" -d "/content"

Next, you will use the text_dataset_from_directory utility to create a labeled tf.data.Dataset. tf.data is a powerful collection of tools for working with data.

When running a machine learning experiment, it is a best practice to divide your dataset into three splits: train, validation, and test.

The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the validation_split argument below.


In [None]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    '/content/AbortionData/train',
    batch_size=batch_size,
    validation_split=0.1,
    subset='training',
    seed=seed
)


As you can see above, of the samples available in the training folder, you will use 80% for training. As you will see in a moment, you can train a model by passing a dataset directly to model.fit. If you're new to tf.data, you can also iterate over the dataset and print out a few examples as follows.


In [None]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Comment", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Notice the reviews contain raw text (with punctuation and occasional HTML tags like <br/>). You will show how to handle these in the following section.

The labels are 0 or 1. To see which of these correspond to positive and negative movie reviews, you can check the class_names property on the dataset.


In [None]:
print("Label 0 corresponds to: ", raw_train_ds.class_names[0])
print("Label 1 corresponds to: ", raw_train_ds.class_names[1])

Next, you will create a validation and test dataset. You will use the remaining 5,000 reviews from the training set for validation.

Note: When using the validation_split and subset arguments, make sure to either specify a random seed, or to pass shuffle=False, so that the validation and training splits have no overlap.

In [None]:
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    '/content/AbortionData/train',
    validation_split=0.2,
    subset='validation',
    seed=seed
)

In [None]:
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    '/content/AbortionData/test',
    batch_size=batch_size
)

In [None]:
raw_test_ds_eval = tf.keras.utils.text_dataset_from_directory(
    '/content/AbortionData/test'
)

# Prepare the dataset for training
Next, you will standardize, tokenize, and vectorize the data using the helpful tf.keras.layers.TextVectorization layer.

Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset. Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words, by splitting on whitespace). Vectorization refers to converting tokens into numbers so they can be fed into a neural network. All of these tasks can be accomplished with this layer.

As you saw above, the reviews contain various HTML tags like <br />. These tags will not be removed by the default standardizer in the TextVectorization layer (which converts text to lowercase and strips punctuation by default, but doesn't strip HTML). You will write a custom standardization function to remove the HTML.

Note: To prevent training-testing skew (also known as training-serving skew), it is important to preprocess the data identically at train and test time. To facilitate this, the TextVectorization layer can be included directly inside your model, as shown later in this tutorial.

In [None]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br /', ' ')
  return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '')

Next, you will create a TextVectorization layer. You will use this layer to standardize, tokenize, and vectorize our data. You set the output_mode to int to create unique integer indices for each token.

Note that you're using the default split function, and the custom standardization function you defined above. You'll also define some constants for the model, like an explicit maximum sequence_length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.


In [None]:
max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length
)

Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

Note: It's important to only use your training data when calling adapt (using the test set would leak information).


In [None]:
#Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

Let's create a function to see the result of using this layer to preprocess some data.

In [None]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [None]:
text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

As you can see above, each token has been replaced by an integer. You can lookup the token (string) that each integer corresponds to by calling .get_vocabulary() on the layer.

In [None]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1238])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(vectorize_layer.get_vocabulary())))

You are nearly ready to train your model. As a final preprocessing step, you will apply the TextVectorization layer you created earlier to the train, validation, and test dataset.

In [None]:
print(vectorize_layer.get_vocabulary())

In [None]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Configure the dataset for performance
These are two important methods you should use when loading data to make sure that I/O does not become blocking.

.cache() keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

.prefetch() overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the data performance guide.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)


# Create the model
It's time to create your neural network:

In [None]:
VOCAB_SIZE = len(vectorize_layer.get_vocabulary())
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(VOCAB_SIZE, 32),
                             tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
                             tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16)),
                             tf.keras.layers.Dense(128, activation='relu'),
                             tf.keras.layers.Dropout(0.1),
                             tf.keras.layers.Dense(64, activation='relu'),
                             tf.keras.layers.Dropout(0.1),
                             tf.keras.layers.Dense(32, activation='relu'),
                             tf.keras.layers.Dropout(0.1),
                             tf.keras.layers.Dense(3)
])

In [None]:
model.summary()

# Loss function and optimizer
A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), you'll use losses.BinaryCrossentropy loss function.

Now, configure the model to use an optimizer and a loss function:




In [None]:
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy']
)

# Train the model
You will train the model by passing the dataset object to the fit method.

In [None]:
epochs = 5
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs
)

# Evaluate the model
Let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
loss, accuracy = model.evaluate(test_ds)
print("Loss: ", loss)
print("Accuracy: ", accuracy)

In [None]:
predictions = model.predict(test_ds)

In [None]:
print(predictions)

In [None]:
import numpy as np
prediction_labels = []
for x in range(len(predictions)):
  prediction_labels.append(np.argmax(predictions[x]))

In [None]:
print(prediction_labels)

In [None]:
actual_labels = []
for element in test_ds:
  for entry in element[1]:
    actual_labels.append(tf.get_static_value(entry))

In [None]:
print(actual_labels)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
accuracy = accuracy_score(actual_labels, prediction_labels)
precision = precision_score(actual_labels, prediction_labels, average = "macro")
recall = recall_score(actual_labels, prediction_labels, average = "macro")
f1_score = f1_score(actual_labels, prediction_labels, average = "macro")

In [None]:
print("Accuracy: " + str(accuracy))
print("Precision: " + str(precision))
print("Recall: " + str(recall))
print("F1 Score: " + str(f1_score))

In [None]:
model.save('drive/MyDrive/Models/BiLSTM_Abortion_Stance_Detect')

In [None]:
shutil.rmtree('./AbortionData')