# Natural Language Processing Lab

The objective of this task is to carry out various experiments to represent and classify tweets by their sentiment 3-class polarity (positive, neutral and negative). For this purpose, we will work with the same corpus used for the previous task, created for the [TASS 2020](http://www.sepln.org/workshops/tass/) competition (IberLEF - SEPLN). Different machine learning models will be comparated by their Macro-F1 results on the test set, and then we will compare our results with state-of-the-art python library [pysentimiento](https://github.com/pysentimiento/pysentimiento) for spanish sentiment analysis.


Feature extraction tools used:

- Word embeddings, mean vector, concatenation vector and adding context to these vectors (3 more values on each vector and these values indicate the class of the tweet).

Machine learning models explored:

- MLP
- SVM
- Logistic Regression
- Naive Bayes
- LSTM Neural networks


## 1. Data Loading and Preprocessing
We load tweets (train, dev, test) and a lexicon of positive and negative words. We also apply preprocessing: remove mentions and URLs, unify laugh patterns, replace insults, remove accents, convert to lowercase, and remove stopwords.

In this iteration of preprocessing, we will remove mentions and URLs, replacing them with an empty string (`""`). For hashtags, we will only remove the `#` symbol without replacing it with the word "HASHTAG." Tweets will be converted to lowercase, stopwords will be removed, numbers and accents will be eliminated, as these are deemed unnecessary for the objective.

Swear words will be replaced with the word "insulto," as it is included in the negative lexicon, emphasizing negative statements more effectively. Instead of replacing laughter patterns with "jaja," we will use "jajaja," which is included in the positive lexicon, unlike "jaja."

Accents will be removed again to standardize the tweets, as they are often inconsistently used. This approach will maximize the recognition of lexicon words in tweets. Unlike in Task 1, no syntactic analysis will be conducted, making accents and capitalization less relevant.

In [None]:
# standard libraries
import csv
import random
import numpy as np

# NLTK
import nltk

# sklearn
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

# pysentimiento
from pysentimiento import create_analyzer
import transformers

# custom imports
from Logic import LSTMUtils, Preprocessing, build_custom_embeddings, load_fasttext



nltk.download('punkt_tab')

TRAIN_SET_PATH = 'data/train.csv'
DEV_SET_PATH   = 'data/devel.csv'
TEST_SET_PATH  = 'data/test.csv'

POS_LEXICON_PATH = 'data/lexico_pos_lemas_grande.csv'
NEG_LEXICON_PATH = 'data/lexico_neg_lemas_grande.csv'
STOP_WORDS_PATH  = 'data/stop_words_esp_anasent.csv'
WORD_VECTORS_PATH = 'data/cc.es.300.vec.gz' 

preprocessor = Preprocessing()
lstm_utils = LSTMUtils()


def load_csv(file_path, transform=None):
    with open(file_path, newline='', encoding="utf-8") as f:
        reader = csv.reader(f)
        next(reader)  # skip header
        return [transform(row) if transform else row for row in reader]

train_set = load_csv(TRAIN_SET_PATH)
devel_set = load_csv(DEV_SET_PATH)
test_set  = load_csv(TEST_SET_PATH)
pos_set   = load_csv(POS_LEXICON_PATH)
neg_set   = load_csv(NEG_LEXICON_PATH)
stop_words_set = [row[0] for row in load_csv(STOP_WORDS_PATH)]

In [2]:
random_tweet = random.choice(train_set)
print(f"Tweet id: {random_tweet[0]}")
print(f"Tweet: {random_tweet[1]}")
print(f"Label: {random_tweet[2]}")

Tweet id: 168833726111956992
Tweet: Listas de espera al alza en Catalunya, estallido en Grecia y debate #EntreTodos sobre la #refomalaboral. http://t.co/PIZAYzPe#portadaEPC
Label: N


## 2. Word Embeddings

To represent tweets, models based on Word Embeddings will be used.

* Each tweet represented as the **mean vector** of the word embeddings of its components.
* Each tweet represented as the **concatenation** of the word embeddings of its components, resulting in a fixed-length vector.

The word embedding collections are available at [Spanish Word Embeddings](https://github.com/dccuchile/spanish-word-embeddings). 

In [3]:
train_processed  = preprocessor.preprocess_corpus(train_set)
devel_processed  = preprocessor.preprocess_corpus(devel_set)
test_processed   = preprocessor.preprocess_corpus(test_set)

train_processed_stopwords = preprocessor.preprocess_corpus(train_processed, stop_words_set)
devel_processed_stopwords = preprocessor.preprocess_corpus(devel_processed, stop_words_set)
test_processed_stopwords  = preprocessor.preprocess_corpus(test_processed, stop_words_set)

small_fasttext = load_fasttext(WORD_VECTORS_PATH, limit=50000)
custom_emb_dict = build_custom_embeddings(small_fasttext, train_processed, top_n=18000) 
custom_emb_dict_stopwords = build_custom_embeddings(small_fasttext, train_processed_stopwords, top_n=18000)  


Custom dictionary size: 9277
Custom dictionary size: 9169


In [4]:
pos_lexicon = [x[0] for x in pos_set]
neg_lexicon = [x[0] for x in neg_set]

## 3. Classical Models with Word Embeddings
We can feed these embeddings into standard classifiers such as MLP, SVM, etc., using either the mean vector.

In [14]:
def main_pipeline(
    train_processed, 
    dev_processed, 
    custom_emb_dict, 
    lstm_utils,
    train_model_with_random_search,
    evaluate_model
):
    """
    Trains multiple machine learning models (MLP, SVM, Logistic Regression, Naive Bayes)
    on mean-vector embeddings, optionally uses RandomizedSearchCV for hyperparameter tuning,
    and evaluates each model with Macro-F1 on the development set.

    Args:
        train_processed (list): List of [id, preprocessed_text, label] for training.
        dev_processed   (list): List of [id, preprocessed_text, label] for development.
        custom_emb_dict (dict): Dictionary of custom word embeddings {token: vector}.
        lstm_utils (object): Utility instance containing `preprocess_data_mean(...)`.
        train_model_with_random_search (function): Function that performs a randomized search.
        evaluate_model (function): Function to evaluate a trained model returning F1-score.

    Returns:
        dict: A dictionary with model names as keys and Macro-F1 scores as values.
    """
    train_X_mean, train_y_enc = lstm_utils.preprocess_data_mean(train_processed, custom_emb_dict)
    dev_X_mean, dev_y_enc     = lstm_utils.preprocess_data_mean(dev_processed, custom_emb_dict)

    scaler = StandardScaler()
    train_X_mean_scaled = scaler.fit_transform(train_X_mean)
    dev_X_mean_scaled   = scaler.transform(dev_X_mean)

    models = {
        'MLP': (
            MLPClassifier(max_iter=1000, random_state=123), 
            {
                'hidden_layer_sizes': [(50,), (100,), (100, 50)],
                'activation': ['tanh', 'relu'],
                'alpha': [0.0001, 0.001, 0.01]
            }
        ),
        'SVM': (
            SVC(random_state=123), 
            {
                'C': [0.1, 1, 10, 100],
                'kernel': ['linear', 'rbf'],
                'gamma': ['scale', 'auto']
            }
        ),
        'Logistic Regression': (
            LogisticRegression(max_iter=1000, random_state=123), 
            {
                'C': [0.1, 1, 10, 100],
                'penalty': ['l1', 'l2', 'elasticnet'],
                'solver': ['saga']
            }
        ),
        'Naive Bayes': (
            GaussianNB(), 
            {}  # No hyperparameters to tune
        )
    }

    results = {}
    for model_name, (model, param_distributions) in models.items():
        print(f"Training {model_name}...")

        if param_distributions:
            best_model = train_model_with_random_search(
                model, 
                param_distributions, 
                train_X_mean_scaled, 
                train_y_enc
            )
        else:
            best_model = model.fit(train_X_mean_scaled, train_y_enc)

        f1_score_macro = evaluate_model(best_model, dev_X_mean_scaled, dev_y_enc)
        results[model_name] = f1_score_macro
        print(f"{model_name} Macro-F1: {f1_score_macro:.4f}")

    return results


In [15]:
results = main_pipeline(train_processed, devel_processed, custom_emb_dict)

Training MLP...
MLP Macro-F1: 0.5361
Training SVM...
SVM Macro-F1: 0.6044
Training Logistic Regression...
Logistic Regression Macro-F1: 0.5910
Training Naive Bayes...
Naive Bayes Macro-F1: 0.4253


## 4. LSTM (Keras / TensorFlow) with Mean Vector and Concatenation Vector

Next, we implement two LSTM variants:
1. **LSTM with Mean Vector** (a single word embedding per tweet).
2. **LSTM with Concat Sequence** (multiple time steps per tweet, up to a maximum `max_length`).

In [16]:
lstm_utils = LSTMUtils()

### LSTM Mean Vector


In [7]:
# datasets
train_X_mean_3d, train_y_mean = lstm_utils.build_mean_vector_dataset(train_processed, custom_emb_dict)
dev_X_mean_3d, dev_y_mean = lstm_utils.build_mean_vector_dataset(devel_processed, custom_emb_dict)

# label encoding
train_y_mean_oh = lstm_utils.one_hot_3classes(train_y_mean)
dev_y_mean_oh = lstm_utils.one_hot_3classes(dev_y_mean)

# main training
model_lstm_mean = lstm_utils.build_lstm_model_for_mean_vector(input_dim=302)
history_mean = model_lstm_mean.fit(
    train_X_mean_3d,
    train_y_mean_oh,
    validation_data=(dev_X_mean_3d, dev_y_mean_oh),
    epochs=15,
    batch_size=64,
    verbose=1
)

# metrics
dev_preds_mean = model_lstm_mean.predict(dev_X_mean_3d)
dev_preds_labels_mean = np.argmax(dev_preds_mean, axis=1)
dev_f1_mean = f1_score(dev_y_mean, dev_preds_labels_mean, average="macro")
print(f"LSTM (mean vector) Dev Macro-F1: {dev_f1_mean:.4f}")


Epoch 1/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.4211 - loss: 1.0745 - val_accuracy: 0.4885 - val_loss: 1.0283
Epoch 2/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.4953 - loss: 1.0166 - val_accuracy: 0.5115 - val_loss: 0.9892
Epoch 3/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.5200 - loss: 0.9816 - val_accuracy: 0.5345 - val_loss: 0.9667
Epoch 4/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.5377 - loss: 0.9602 - val_accuracy: 0.5459 - val_loss: 0.9571
Epoch 5/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.5383 - loss: 0.9489 - val_accuracy: 0.5486 - val_loss: 0.9358
Epoch 6/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.5463 - loss: 0.9398 - val_accuracy: 0.5592 - val_loss: 0.9271
Epoch 7/15
[1m130/130[0m 

### LSTM Concat Sequence

In [8]:
lstm_utils = LSTMUtils()

MAX_LEN = 20 
train_X_concat, train_y_concat = lstm_utils.build_concat_sequence_dataset(
    train_processed, custom_emb_dict, max_length=MAX_LEN
)
dev_X_concat, dev_y_concat = lstm_utils.build_concat_sequence_dataset(
    devel_processed, custom_emb_dict, max_length=MAX_LEN
)

train_y_concat_oh = lstm_utils.one_hot_3classes(train_y_concat)
dev_y_concat_oh = lstm_utils.one_hot_3classes(dev_y_concat)

model_lstm_concat = lstm_utils.build_lstm_model_for_concat_sequence(
    max_length=MAX_LEN, embedding_dim=302
)
history_concat = model_lstm_concat.fit(
    train_X_concat,
    train_y_concat_oh,
    validation_data=(dev_X_concat, dev_y_concat_oh),
    epochs=15,
    batch_size=64,
    verbose=1
)

dev_preds_concat = model_lstm_concat.predict(dev_X_concat)
dev_preds_labels_concat = np.argmax(dev_preds_concat, axis=1)
dev_f1_concat = f1_score(dev_y_concat, dev_preds_labels_concat, average="macro")
print(f"LSTM (concat vector) Dev Macro-F1: {dev_f1_concat:.4f}")


Epoch 1/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 22ms/step - accuracy: 0.4139 - loss: 1.0675 - val_accuracy: 0.5362 - val_loss: 0.9813
Epoch 2/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.5119 - loss: 0.9957 - val_accuracy: 0.5663 - val_loss: 0.9405
Epoch 3/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.5586 - loss: 0.9311 - val_accuracy: 0.5777 - val_loss: 0.9194
Epoch 4/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.5672 - loss: 0.9055 - val_accuracy: 0.5751 - val_loss: 0.9045
Epoch 5/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.5770 - loss: 0.8887 - val_accuracy: 0.5707 - val_loss: 0.9029
Epoch 6/15
[1m130/130[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.5864 - loss: 0.8885 - val_accuracy: 0.5919 - val_loss: 0.8899
Epoch 7/15
[1m130/130

## 5. Evaluation on Test and Comparison with PySentimiento
Finalmente, evaluamos en el test set (con la misma representación que en train). Luego comparamos con un modelo pretrained de [pysentimiento](https://github.com/pysentimiento).

In [9]:
test_X_mean_3d, test_y_mean = lstm_utils.build_mean_vector_dataset(test_processed, custom_emb_dict)
test_y_mean_oh = lstm_utils.one_hot_3classes(test_y_mean)


pred_probs_mean = model_lstm_mean.predict(test_X_mean_3d)
pred_labels_mean = np.argmax(pred_probs_mean, axis=1)
test_f1_mean = f1_score(test_y_mean, pred_labels_mean, average="macro")
print(f"LSTM (mean vector) Test Macro-F1: {test_f1_mean:.4f}")

test_X_concat, test_y_concat = lstm_utils.build_concat_sequence_dataset(
    test_processed, custom_emb_dict, max_length=MAX_LEN
)
test_y_concat_oh = lstm_utils.one_hot_3classes(test_y_concat)

pred_probs_concat = model_lstm_concat.predict(test_X_concat)
pred_labels_concat = np.argmax(pred_probs_concat, axis=1)
test_f1_concat = f1_score(test_y_concat, pred_labels_concat, average="macro")
print(f"LSTM (concat vector) Test Macro-F1: {test_f1_concat:.4f}")


[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
LSTM (mean vector) Test Macro-F1: 0.5795
[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
LSTM (concat vector) Test Macro-F1: 0.5974


In [10]:
transformers.logging.set_verbosity_error()

analyzer = create_analyzer(task="sentiment", lang="es")

def convert_pysentimiento_label(label):
    if label == 'POS':
        return 0
    elif label == 'NEG':
        return 1
    else:
        return 2

pys_preds = []
test_labels_for_pys = lstm_utils.encode_label([x[2] for x in test_processed])

for row in test_processed:
    text = row[1]
    res  = analyzer.predict(text)
    pys_preds.append(convert_pysentimiento_label(res.output))

pys_f1 = f1_score(test_labels_for_pys, pys_preds, average='macro')
print("PySentimiento Test Macro-F1:", f'{pys_f1:.4f}')



PySentimiento Test Macro-F1: 0.6963
