# **Twitter Sentiment Classification Project**

- [1. Introduction](#1-introduction)
  - [Recurrent Neural Networks](#recurrent-neural-networks)
  - [High Level Steps](#high-level-steps)

---

## **1. Introduction**

In this project, we tackle the **binary classification** of **Twitter messages** into **positive** and **negative** categories. We are provided with two separate directories containing labeled tweets:

- Directory **0**: Negative tweets
- Directory **1**: Positive tweets

Our approach focuses on a more detailed exploration of a single model family rather than broadly comparing multiple families, which could lead to overly generic solutions. By conducting a series of experiments, we aim to dig deeper into the specific characteristics of the chosen model family. This approach allows us to experiment with the architecture in a more targeted way.

### **Recurrent Neural Networks**

We chose to focus on experimenting with Recurrent Neural Networks (RNNs) for the sentiment analysis task. RNNs are particularly well-suited for processing sequential data, as they maintain a hidden state that captures the influence of previous input sequences. Tweets, being short and concise, often require understanding the flow of sentiment within a limited context. RNNs excel at this by processing one word at a time, retaining contextual information from previous words to identify patterns that signal positive or negative sentiment.

Although Transformers are the state-of-the-art for many NLP tasks, we specifically opted not to use them here. Transformers are optimized for modeling long-range dependencies and contextual relationships across entire documents. While highly effective, their self-attention mechanism can be unnecessary for short texts like tweets. Additionally, Transformers are computationally intensive, demanding significant resources for training and inference. For a simpler binary classification task, this added complexity can result in inefficiencies without a proportional gain in performance.

By choosing RNNs, we focus on a model that is both resource-efficient and well-matched to the specific demands of analyzing short texts, ensuring effective results without unnecessary computational overhead.

### **High Level Steps**

From a high level, the project will follow these steps:

1. Load and preprocess the text data.
2. Develop and evaluate a baseline RNN model with a simpler architecture.
3. Starting from the baseline model, develop and evaluate a more complex architecture.
4. Check how the model behaves on **personal tweets** as an informal test.

---


In [68]:
# Dependencies and some utility functions for the notebook
import os
import logging
import warnings
import math
import random
import numpy as np
import tensorflow as tf
import pandas as pd
from rich.table import Table
from rich.console import Console
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras import layers, models, preprocessing, callbacks
import keras_tuner as kt

# Whether to enable debug mode
DEBUG = True

console = Console()

if not DEBUG:
    # Ignore warnings
    warnings.filterwarnings("ignore")
    # Removes info and warning messages from Tensorflow
    os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
    tf.get_logger().setLevel(logging.ERROR)

    os.environ["PYTHONWARNINGS"] = "ignore"

    logging.getLogger("tensorflow").setLevel(logging.ERROR)

else:
    # Check TensorFlow version
    console.print("TensorFlow version:", tf.__version__)

    # List physical devices
    physical_devices = tf.config.list_physical_devices("GPU")
    console.print("GPUs available:", len(physical_devices))

    if physical_devices:
        for gpu in physical_devices:
            console.print("GPU:", gpu.name)
    else:
        console.print("No GPU detected.")


def create_rich_table(data, headers, title="Table"):
    table = Table(title=title, show_lines=True)

    # Add headers to the table
    for header in headers:
        table.add_column(header, justify="center")

    # Add rows to the table
    for row in data:
        table.add_row(*map(str, row))

    return table


def print_best_hyperparameters(best_hp, model_name):
    best_hp_table_data = [(key, best_hp.get(key)) for key in best_hp.values]
    best_hyper_params_table = create_rich_table(
        best_hp_table_data,
        headers=["Hyperparameter", "Value"],
        title="Best Hyperparameters for " + model_name,
    )

    console.print(best_hyper_params_table)


def evaluate_model_and_print_results(model, model_name, test_ds):

    # Get predictions and true labels
    y_true = []
    y_pred = []

    for x_batch, y_batch in test_ds:
        preds = model.predict(x_batch, verbose=0)
        y_true.extend(y_batch.numpy())
        y_pred.extend(
            (preds > 0.5).astype(int).flatten()
        )  # Convert probabilities to binary labels

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    # Prepare the results as a table
    eval_table_data = [
        ("Accuracy", accuracy),
        ("Precision", precision),
        ("Recall", recall),
        ("F1-Score", f1),
    ]

    evaluation_table = create_rich_table(
        eval_table_data,
        headers=["Metric", "Result"],
        title="Evaluation Results for " + model_name,
    )

    console.print(evaluation_table)


# Plot metrics from training history
def plot_training_history(history):
    history_df = pd.DataFrame(history.history)

    plt.figure(figsize=(10, 6))
    for metric in history_df.columns:
        plt.plot(history_df[metric], label=metric)

    plt.title("Training Metrics over Epochs")
    plt.xlabel("Epochs")
    plt.ylabel("Value")
    plt.legend()
    plt.grid()
    plt.show()

## **2. Dataset Loading and Preprocessing**

In this section, we focus on loading the dataset, preprocessing the text, and applying vectorization. Our goal is to maintain a modular approach throughout the project to ensure reusability and adaptability of functions, particularly for experimenting with different hyperparameter configurations (e.g., `max_tokens`, `output_sequence_length`).

**Steps Involved**

1. Dataset Loading
2. Creating a `TextVectorization` layer, adaptable to various text preprocessing configurations

The dataset is loaded through the following configurations:

- Batch Size: 32. We opted for a memory-efficient approach by using smaller batch sizes. Since RNNs process data sequentially, larger batches would have significantly increased memory usage. Additionally, given the not too large size of the dataset, smaller batch sizes enable more frequent gradient updates per epoch.

- Splits: 70% for training, 20% for validation and 10% for testing

For the `TextVectorization` layer we used a custom standardization function, tailored for twitter messages, specific for sentyment analysis. In particular, the following rules are applied:

1. Convert to Lowercase
2. Remove HTTP URLs
3. Remove Hashtags
4. Keep only alphanumeric, spaces, and specific punctuation (!, ?, ...), useful for sentyment analysis tasks.


In [None]:
DATASET_DIR = "../TwitterParsed"


def load_dataset(
    data_dir, batch_size=32, validation_split=0.2, test_split=0.1, seed=42
):
    # Training dataset
    train_ds = preprocessing.text_dataset_from_directory(
        data_dir,
        batch_size=batch_size,
        validation_split=validation_split + test_split,  # Total non-training data
        subset="training",
        seed=seed,
    )

    # Split with both validation and test sets
    val_and_test_ds = preprocessing.text_dataset_from_directory(
        data_dir,
        batch_size=batch_size,
        validation_split=validation_split + test_split,
        subset="validation",
        seed=seed,
    )

    # Further split into validation and test sets
    val_size = math.floor(
        (validation_split / (validation_split + test_split)) * len(val_and_test_ds)
    )

    val_ds = val_and_test_ds.take(val_size)
    test_ds = val_and_test_ds.skip(val_size)

    # Cache and prefetch datasets for better performance
    AUTOTUNE = tf.data.AUTOTUNE

    train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
    val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

    return train_ds, val_ds, test_ds


def create_vectorization_layer(max_tokens=10000, sequence_length=100):

    # Custom standardization function tailored for tweets
    def custom_standardize(input_text):
        # Lowercase the text
        lowercase_text = tf.strings.lower(input_text)
        # Remove URLs
        text_without_urls = tf.strings.regex_replace(lowercase_text, r"http\S+", " ")
        # Remove mentions (e.g., @username)
        text_without_mentions = tf.strings.regex_replace(
            text_without_urls, r"@\w+", " "
        )
        # Replace hashtags with just the word (e.g., #happy -> happy)
        text_without_hashtags = tf.strings.regex_replace(
            text_without_mentions, r"#", ""
        )
        # Replace two or more dots with a placeholder
        text_with_dots_preserved = tf.strings.regex_replace(
            text_without_hashtags, r"\.{2,}", "<MULTI_DOT>"
        )
        # Remove all single periods
        text_without_single_dots = tf.strings.regex_replace(
            text_with_dots_preserved, r"\.", ""
        )
        # Restore the multi-dot sequences
        text_with_restored_dots = tf.strings.regex_replace(
            text_without_single_dots, r"<MULTI_DOT>", "..."
        )
        # Keep only alphanumeric, spaces, and specific punctuation (!, ?, ...)
        cleaned_text = tf.strings.regex_replace(
            text_with_restored_dots, r"[^a-z0-9\s!?...]", ""
        )
        return cleaned_text

    # Create the TextVectorization layer
    vectorizer = layers.TextVectorization(
        max_tokens=max_tokens,  # Vocabulary size
        output_mode="int",  # Map tokens to integers
        output_sequence_length=sequence_length,  # Pad/Truncate to sequence length
        standardize=custom_standardize,  # Use the custom standardization logic
    )

    return vectorizer


# Count dataset samples
def count_samples(dataset):
    return sum(1 for _ in dataset.unbatch())


def compute_class_distribution(dataset):
    neg_count = 0
    pos_count = 0
    for _, label in dataset.unbatch():
        if label.numpy() == 0:
            neg_count += 1
        else:
            pos_count += 1
    total = neg_count + pos_count
    neg_percent = (neg_count / total) * 100 if total > 0 else 0
    pos_percent = (pos_count / total) * 100 if total > 0 else 0
    return neg_count, pos_count, neg_percent, pos_percent


def create_dataset_summary_table(train_ds, val_ds, test_ds):
    train_count = count_samples(train_ds)
    val_count = count_samples(val_ds)
    test_count = count_samples(test_ds)
    total_count = train_count + val_count + test_count

    train_neg, train_pos, train_neg_percent, train_pos_percent = (
        compute_class_distribution(train_ds)
    )
    val_neg, val_pos, val_neg_percent, val_pos_percent = compute_class_distribution(
        val_ds
    )
    test_neg, test_pos, test_neg_percent, test_pos_percent = compute_class_distribution(
        test_ds
    )

    # Prepare the data for the table
    data = [
        [
            "Training",
            train_count,
            f"{train_neg_percent:.2f}%",
            f"{train_pos_percent:.2f}%",
        ],
        ["Validation", val_count, f"{val_neg_percent:.2f}%", f"{val_pos_percent:.2f}%"],
        ["Testing", test_count, f"{test_neg_percent:.2f}%", f"{test_pos_percent:.2f}%"],
        [
            "Total",
            total_count,
            f"{((train_neg + val_neg + test_neg) / total_count) * 100:.2f}%",
            f"{((train_pos + val_pos + test_pos) / total_count) * 100:.2f}%",
        ],
    ]

    # Headers for the table
    headers = ["Dataset", "Number of Tweets", "Negative %", "Positive %"]

    # Creeate a rich table
    rich_table = create_rich_table(data, headers, title="Dataset Summary")

    return rich_table


# Load dataset in training (70%), validation (20%), and test (10%) sets
train_ds, val_ds, test_ds = load_dataset(DATASET_DIR)
summary_table = create_dataset_summary_table(train_ds, val_ds, test_ds)
console.print(summary_table)

Found 149985 files belonging to 2 classes.
Using 104990 files for training.
Found 149985 files belonging to 2 classes.
Using 44995 files for validation.


2024-12-23 19:25:57.855172: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-23 19:26:02.416439: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-23 19:26:05.634104: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-23 19:26:21.158165: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-23 19:26:25.526072: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-23 19:26:29.252167: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## **3. Baseline RNN using LSTM**

We will begin by implementing a baseline LSTM model with a straightforward architecture. This approach allows us to evaluate how a simpler design performs without the computational and memory overhead of more complex models. LSTMs (Long Short-Term Memory networks) leverage memory cells to manage long-term dependencies by maintaining an internal state in addition to the hidden state, making them more effective than basic recurrent layers for capturing sequential patterns.

Starting with this baseline, we can later compare its performance to that of more sophisticated architectures. This comparison will help us assess how much advanced techniques and increased complexity improve performance and at what cost in terms.


In [69]:
MODEL_NAME = "simple_rnn"
TUNING_DIR = "../Tuning"

VOCAB_SIZES = [5000, 10000]
MAX_SEQ_LENGTHS = [50, 100]

EMBEDDING_DIMS = [128, 256]

# ---------------------------------------------------------

# The only purpose of this is to set the seeds and have same results over different runs
tf.keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)
# ---------------------------------------------------------


# Build model for Keras Tuner
def build_model(hp):
    """
    This function defines a simple RNN-based model without memory cells,
    and includes hyperparameters for tuning via Keras Tuner.
    """
    # 1. Add the vectorization layer with the choice of max_tokens and output_sequence_length
    hp_max_tokens = hp.Choice("max_tokens", values=VOCAB_SIZES)
    hp_out_seq_len = hp.Choice("output_sequence_length", values=MAX_SEQ_LENGTHS)

    vectorization_layer = create_vectorization_layer(
        max_tokens=hp_max_tokens, sequence_length=hp_out_seq_len
    )
    train_text = train_ds.map(lambda x, y: x)  # remove labels
    vectorization_layer.adapt(train_text)

    model = models.Sequential()

    model.add(vectorization_layer)

    # 2. Add the embedding layer, with the choice of embedding dimension
    embedding_dim = hp.Choice("embedding_dim", EMBEDDING_DIMS)
    model.add(layers.Embedding(input_dim=hp_max_tokens, output_dim=embedding_dim))

    # 3. # 3. Add a Bidirectional LSTM layer with the choice of number of units, recurrent_initializer and recurrent_dropout
    model.add(
        layers.Bidirectional(
            layers.LSTM(
                units=hp.Int("rnn_units", min_value=32, max_value=128, step=16),
                activation="tanh",
                recurrent_initializer=hp.Choice(
                    "rnn_recurrent_initializer", ["orthogonal", "glorot_uniform"]
                ),
                recurrent_dropout=hp.Float(
                    "rnn_recurrent_dropout",
                    min_value=0.0,
                    max_value=0.3,
                    step=0.1,
                ),
                return_sequences=False,
            )
        )
    )

    # 5. Add a Dense layer with the choice of number of units and kernel_initializer
    model.add(
        layers.Dense(
            units=hp.Int("dense_units", min_value=64, max_value=256, step=32),
            activation="relu",
            kernel_initializer=hp.Choice(
                "kernel_initializer", ["he_normal", "glorot_uniform"]
            ),
        )
    )

    # 7: Final output layer for binary classification
    model.add(layers.Dense(1, activation="sigmoid"))

    # 8. Compile the model
    model.compile(
        loss="binary_crossentropy",
        optimizer="adam",
        metrics=["accuracy"],
    )

    return model


# ---------------------------------------------------------

# Tuner configuration for optimal hyperparameters
tuner = kt.RandomSearch(
    build_model,
    objective="val_accuracy",
    max_trials=10,
    overwrite=True,
    directory=TUNING_DIR,
    project_name="simple_rnn_tuning",
    seed=42,
)

# Start the search for the best hyperparameters
tuner.search(
    train_ds,
    validation_data=val_ds,
    verbose=DEBUG,
    epochs=20,
    callbacks=[callbacks.EarlyStopping(patience=2)],
)


best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]

# Train the model with the best hyperparameters
best_model = tuner.hypermodel.build(best_hp)
history = best_model.fit(
    train_ds,
    validation_data=val_ds,
    verbose=DEBUG,
    epochs=50,
    callbacks=[callbacks.EarlyStopping(patience=4)],
)

# Print best hyperparameters
print_best_hyperparameters(best_hp, MODEL_NAME)

# Evaluate the best model on the test set and print evaluation results
eval_results = evaluate_model_and_print_results(best_model, MODEL_NAME, test_ds)

# Plot training history
plot_training_history(history)

2024-12-23 19:28:27.394710: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence



Search: Running Trial #1

Value             |Best Value So Far |Hyperparameter
10000             |10000             |max_tokens
50                |50                |output_sequence_length
128               |128               |embedding_dim
48                |48                |rnn_units
glorot_uniform    |glorot_uniform    |rnn_recurrent_initializer
0.1               |0.1               |rnn_recurrent_dropout
160               |160               |dense_units
he_normal         |he_normal         |kernel_initializer

Epoch 1/20


2024-12-23 19:28:57.297199: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


[1m 219/3281[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m3:48:16[0m 4s/step - accuracy: 0.5786 - loss: 0.6642

KeyboardInterrupt: 