#
<h1 style="text-align: center;">Sentiment Analysis RELOADED</h1>

<h4 style="text-align: center;">

Esteban Gomez Valerio

Roi Jared Flores Garza Stone

Rafael Takata Garcia

Text Mining - O2025_MAF3654H

Ing. Juan Antonio Vega Fernández, M. Sc., M. T. Ed

ITESO
</h4>


### Context:
In the modern political landscape, social media platforms have transformed into an indispensable barometer of public opinion and a driving force in shaping electoral discourse. The X platform (formerly Twitter), in particular, is an epicenter where candidates, media, and voters interact in real-time, generating a massive flow of textual data that reflects the collective mood.

The 2024 U.S. Presidential election cycle is an event of global significance, and the ability to measure, understand, and predict public sentiment via social media is crucial for political analysts, campaign strategists, and academics.

### Objective: 
This Text Analysis Project aims to decode this digital pulse by applying advanced Natural Language Processing (NLP) and Machine Learning techniques.

We will focus on analyzing the Kaggle dataset titled [2024 U.S. Election Sentiment on X](https://www.kaggle.com/datasets/emirhanai/2024-u-s-election-sentiment-on-x/data?select=train.csv) which provides an labeled corpus of posts capturing the conversation around key candidates and political parties.

### Goals

- Classify and quantify the sentiment (positive, negative, neutral) of posts directed at the main candidates and parties.

- Identify patterns and trends in the polarization of the discourse over time.

- Extract key topics that dominate the online electoral conversation.

### Structure 
To achieve these objectives, the project will be based on the analysis of the train.csv file from the dataset. This dataset is a robust source containing the post text, pre-existing sentiment labels, and rich metadata on party affiliation and engagement metrics (likes, retweets).

The methodology will include the following key steps:

Text Preprocessing: Data cleaning, tokenization, and language normalization.

Exploratory Data Analysis (EDA): Visualization of sentiment distribution and most frequent words.

Sentiment Modeling: Training a classification model (e.g., based on Transformers or traditional Machine Learning) to automatically predict and validate sentiment.

The final outcome will be a deep, data-driven understanding of the emotional and thematic dynamics shaping the 2024 presidential race in the digital sphere.

### Libraries

In [1]:
import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
import torch.optim as optim
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import spacy
from tqdm import tqdm
from tqdm.notebook import tqdm
from collections import Counter
from sklearn.neural_network import MLPClassifier
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

### Load Data

In [2]:
df_train = pd.read_csv("../data/train.csv")
df_test = pd.read_csv("../data/test.csv")
df_train.head()

Unnamed: 0,tweet_id,user_handle,timestamp,tweet_text,candidate,party,retweets,likes,sentiment
0,1,@user123,11/3/2024 8:45,Excited to see Kamala Harris leading the Democ...,Kamala Harris,Democratic Party,120,450,positive
1,2,@politicsFan,11/3/2024 9:15,Donald Trump's policies are the best for our e...,Donald Trump,Republican Party,85,300,positive
2,3,@greenAdvocate,11/3/2024 10:05,Jill Stein's environmental plans are exactly w...,Jill Stein,Green Party,60,200,positive
3,4,@indieVoice,11/3/2024 11:20,Robert Kennedy offers a fresh perspective outs...,Robert Kennedy,Independent,40,150,neutral
4,5,@libertyLover,11/3/2024 12:35,Chase Oliver's libertarian stance promotes tru...,Chase Oliver,Libertarian Party,30,120,positive


In this case, we will only need the `Democratic` & `Republican` party in order to make them a conflict of just two factions, plus we will need only the text of the tweet.

In [3]:
X_train = df_train['tweet_text']
y_train = df_train['sentiment']

X_test = df_test['tweet_text']
y_test = df_test['sentiment']

print(f"Length of the dataset: {len(df_train)}")
print("\nSentiment Distribution:")
print(df_train['sentiment'].value_counts())


Length of the dataset: 1000

Sentiment Distribution:
sentiment
positive    650
neutral     240
negative    110
Name: count, dtype: int64


### Baseline Model

In [4]:
baseline_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        stop_words='english',
        ngram_range=(1, 2)
    )),
    
    ('lr', LogisticRegression(
        random_state=42,
        solver='liblinear'
    ))
])

In [5]:
# training the baseline model on the entire dataset
baseline_pipeline.fit(X_train, y_train)
y_pred = baseline_pipeline.predict(X_test)



In [6]:
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the Baseline Model: {accuracy:.4f}")
print("-" * 50)

print("Informe de Clasificación:")
print(classification_report(y_test, y_pred, zero_division=0))

Accuracy of the Baseline Model: 0.8636
--------------------------------------------------
Informe de Clasificación:
              precision    recall  f1-score   support

    negative       0.83      0.29      0.43        17
     neutral       1.00      0.93      0.96        29
    positive       0.82      0.98      0.89        64

    accuracy                           0.86       110
   macro avg       0.88      0.74      0.76       110
weighted avg       0.87      0.86      0.84       110



This baseline Logistic Regression model achieves a decent overall Accuracy of 86.36%. The model performs well on the `Neutral class` (perfect Precision and Recall), indicating these samples are easily separable. Performance on the `Positive class` is also strong (F1-score of 0.89), due to its high Recall (0.98), meaning it successfully identifies almost all positive tweets. However, the model presents a critical weakness in identifying the `Negative class`: while its Precision is reasonable (0.83), its Recall is extremely low (0.26).

### Feature Engineering with POS and NER

In [7]:
# !python -m spacy download en_core_web_sm 

In [8]:
nlp = spacy.load("en_core_web_sm")
print(" Model ready")

 Model ready


In [9]:
POS_TAGS = ['NOUN', 'VERB', 'ADJ', 'ADV', 'PRON', 'DET', 'ADP', 'AUX', 'SCONJ', 'CCONJ', 'INTJ']
NER_TAGS = ['PERSON', 'ORG', 'GPE', 'LOC', 'DATE', 'TIME', 'NORP', 'EVENT']

In [10]:
def extract_pos_ner_counts(text):
    # Process the text with spaCy
    doc = nlp(text)
    features = Counter()
    
    # 1. Count of POS (Part of Speech) Tags
    for token in doc:
        if token.pos_ in POS_TAGS:
            features[f"POS_{token.pos_}"] += 1
            
    # 2. Count of NED (Named Entity Recognition)
    for ent in doc.ents:
        if ent.label_ in NER_TAGS:
            features[f"NER_{ent.label_}"] += 1
            
    return features

In [11]:
# Prove the function with a sample 
sample_tweet = X_train.iloc[0]
sample_features = extract_pos_ner_counts(sample_tweet)
print(f"Tweet: {sample_tweet}")
print(f"Counts: {sample_features}")

Tweet: Excited to see Kamala Harris leading the Democratic charge!
Counts: Counter({'POS_ADJ': 2, 'POS_VERB': 2, 'POS_DET': 1, 'POS_NOUN': 1, 'NER_PERSON': 1, 'NER_NORP': 1})


| Tag | Meaning | Sentiment Relevance |
| :--- | :--- | :--- |
| **POS\_ADJ** | Adjective | Describe something (e.g., "terrible," "excellent"). |
| **POS\_VERB** | Verb | Defines the action (e.g., "supports," "criticizes"). |
| **NER\_PERSON** | Named Person | Identifies the subject (e.g., a candidate). |
| **NER\_ORG** | Organization | Identifies the associated group or political party. |

### Examples in our output

The test example (`Tweet: Excited to see Kamala Harris leading the Democratic charge!`) validates that the function correctly maps the language to numerical features:

* **`POS_ADJ: 2`**: "Excited," "Democratic"
* **`NER_PERSON: 1`**: "Kamala Harris"
* **`POS_VERB: 2`**: "see," "leading"
* **`POS_DET: 1`**: "the"
* **`POS_NOUN: 1`**: "charge"
* **`NER_NORP: 1`**: "Democratic"

---

In [12]:
# 1. Training Data
print("Extracting POS/NER features for Training Set...")
X_train_dicts = [extract_pos_ner_counts(text) for text in tqdm(X_train)]
print("Extraction complete for Training Set.")

# 2. Vectorizer fitting
vectorizer = DictVectorizer(dtype=float) 
X_train_features = vectorizer.fit_transform(X_train_dicts)

# 3. Feature Extraction on Test Data
print("Extracting POS/NER features for Test Set...")
X_test_dicts = [extract_pos_ner_counts(text) for text in tqdm(X_test)]
print("Extraction complete for Test Set.")

# 4. Vectorizer transformation 
X_test_features = vectorizer.transform(X_test_dicts)

print(f"\nTraining Features Shape: {X_train_features.shape}")
print(f"Test Features Shape: {X_test_features.shape}")
print(f"Number of unique POS/NER features extracted: {X_train_features.shape[1]}")

Extracting POS/NER features for Training Set...


  0%|          | 0/1000 [00:00<?, ?it/s]

Extraction complete for Training Set.
Extracting POS/NER features for Test Set...


  0%|          | 0/110 [00:00<?, ?it/s]

Extraction complete for Test Set.

Training Features Shape: (1000, 15)
Test Features Shape: (110, 15)
Number of unique POS/NER features extracted: 15


In [13]:
# Matrix conversion
X_train_dense = X_train_features.toarray().astype('float64')
X_test_dense = X_test_features.toarray().astype('float64')

# Filter classes 
class_counts = y_train.value_counts()
valid_classes = class_counts[class_counts >= 2].index

# Filter Training Set
train_mask = y_train.isin(valid_classes)
X_train_filtered = X_train_dense[train_mask.values, :] 
y_train_filtered = y_train.loc[train_mask]

# Filter Test Set
test_mask = y_test.isin(valid_classes)
X_test_filtered = X_test_dense[test_mask.values, :]
y_test_filtered = y_test.loc[test_mask]

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train_filtered.values)
y_test_encoded = le.transform(y_test_filtered.values)

X_train_final = X_train_filtered
X_test_final = X_test_filtered
y_train_final = y_train_encoded
y_test_final = y_test_encoded

# MLP Model Definition 
mlp_model = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    max_iter=500,
    random_state=42,
    solver='adam', 
    early_stopping=True 
)

# Training
print("\nTraining MLP...")
mlp_model.fit(X_train_final, y_train_final) 
print("Training complete.")

# Predictions and Evaluation
y_pred_nn = mlp_model.predict(X_test_final)

y_pred_decoded = le.inverse_transform(y_pred_nn)
y_test_decoded = le.inverse_transform(y_test_final)

accuracy_nn = accuracy_score(y_test_decoded, y_pred_decoded)

# Results 
print("-" * 60)
print(f"Accuracy of the POS/NER Model (MLP): {accuracy_nn:.4f}")
print("-" * 60)

print("Classification Report (MLP on POS/NER Features):")
print(classification_report(y_test_decoded, y_pred_decoded, zero_division=0))


Training MLP...
Training complete.
------------------------------------------------------------
Accuracy of the POS/NER Model (MLP): 0.7545
------------------------------------------------------------
Classification Report (MLP on POS/NER Features):
              precision    recall  f1-score   support

    negative       0.67      0.35      0.46        17
     neutral       0.74      0.79      0.77        29
    positive       0.77      0.84      0.81        64

    accuracy                           0.75       110
   macro avg       0.73      0.66      0.68       110
weighted avg       0.75      0.75      0.74       110



The MLP model, relying only on absatract linguistic counts (like the number of adjectives or verbs) rather than the actual words themselves, resulted in a lower overall accuracy ($\mathbf{0.74}$) compared to the inflated Logistic Regression baseline ($\mathbf{0.86}$). 

___

### Transformer-Based Model

In [14]:
MODEL_NAME = 'distilbert-base-uncased'

device = torch.device('cpu')
print(f"Using device: {device}")

tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

Using device: cpu


In [15]:
label_map = {id: name for id, name in enumerate(le.classes_)}
num_labels = len(le.classes_)

In [16]:
# Tokenization of Texts
train_encodings = tokenizer(list(X_train.values), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test.values), truncation=True, padding=True)


# PyTorch Dataset Class Definition
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Return token IDs, attention mask, and the integer label
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create Dataset instances
train_dataset = SentimentDataset(train_encodings, y_train_encoded)
test_dataset = SentimentDataset(test_encodings, y_test_encoded)

print("Data ready for Trainer. Number of labels:", num_labels)
print("Label mapping (ID to Name):", label_map)

Data ready for Trainer. Number of labels: 3
Label mapping (ID to Name): {0: 'negative', 1: 'neutral', 2: 'positive'}


In [17]:
# 1. Load the Model and Move to CPU
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)
# CRITICAL: Model moved to CPU
model.to(device) 
model.train() # Set model to training mode

# 2. Define Optimizer and Learning Rate
optimizer = AdamW(model.parameters(), lr=5e-5)

# 3. Define Parameters
epochs = 3
# IMPORTANT: Increase batch size slightly for CPU, but keep it modest (e.g., 32 or 64) if memory allows
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) 

# 4. Training Loop
print(f"\nStarting manual Fine-Tuning on CPU for {epochs} epochs...")

for epoch in range(epochs):
    print(f"--- Epoch {epoch + 1}/{epochs} ---")
    
    # Training loop by batch
    for batch in tqdm(train_loader):
        optimizer.zero_grad() # Clear gradients
        
        # CRITICAL: Move data batches to CPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Pass data to the model
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        
        # Backpropagation
        loss.backward()
        optimizer.step()

print("\nTraining complete.")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Starting manual Fine-Tuning on CPU for 3 epochs...
--- Epoch 1/3 ---


  0%|          | 0/32 [00:00<?, ?it/s]

--- Epoch 2/3 ---


  0%|          | 0/32 [00:00<?, ?it/s]

--- Epoch 3/3 ---


  0%|          | 0/32 [00:00<?, ?it/s]


Training complete.


In [18]:
# Final Evaluation 
model.eval() 
all_preds = []
all_labels = []

# DataLoader
test_loader = DataLoader(test_dataset, batch_size=64) 

with torch.no_grad():
    for batch in tqdm(test_loader):
        # Move data to CPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        
        # Get the predicted class
        preds = torch.argmax(logits, dim=-1)
        
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# Decode and Final Results
y_pred_final = le.inverse_transform(np.array(all_preds))
y_test_final = le.inverse_transform(np.array(all_labels))

accuracy_transformer = accuracy_score(y_test_final, y_pred_final)

print("-" * 60)
print(f"Accuracy: {accuracy_transformer:.4f}")
print("-" * 60)

print("Classification Report:")
print(classification_report(y_test_final, y_pred_final, zero_division=0))

  0%|          | 0/2 [00:00<?, ?it/s]

------------------------------------------------------------
Accuracy: 0.8636
------------------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

    negative       0.67      0.35      0.46        17
     neutral       0.97      1.00      0.98        29
    positive       0.85      0.94      0.89        64

    accuracy                           0.86       110
   macro avg       0.83      0.76      0.78       110
weighted avg       0.85      0.86      0.85       110



The DistilBERT model delivered strong overall results with a robust accuracy of 85.45%. It was perfect at identifying the major sentiment groups: `Neutral` tweets were classified perfectly (Recall 1.00), and the `Positive` tweets were almost always found (Recall 0.97). However, the model ran into the exact same fundamental problem as the simpler Baseline: it failed  on the hard-to-find Negative tweets. The model could only correctly identify a mere 26% of the actual `negative` examples, confirming that the high ambiguity or noise in that minority class remains a critical weakness, even for an advanced Transformer.

___

### Mixed Embeddings + Attention

In [19]:
# Load spaCy model
nlp = spacy.load("en_core_web_sm")

MAX_LEN = 128 
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

def get_pos_sequence(text):
    """Processes text to return a sequence of POS tags (strings)."""
    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc]
    return pos_tags

# Get POS sequences for training and testing
X_train_pos_sequences = [get_pos_sequence(text) for text in X_train]
X_test_pos_sequences = [get_pos_sequence(text) for text in X_test]

# Build the vocabulary of all unique POS tags
all_pos_tags = set(tag for seq in X_train_pos_sequences for tag in seq)
pos_vocab = {tag: i + 1 for i, tag in enumerate(sorted(list(all_pos_tags)))}
pos_vocab['<PAD>'] = 0 # Padding token ID
POS_VOCAB_SIZE = len(pos_vocab)

print(f"POS Vocabulary Size: {POS_VOCAB_SIZE}")

# Function to convert POS sequence strings to IDs, and pad them
def encode_and_pad_pos(pos_sequences, pos_vocab, max_len=MAX_LEN):
    encoded_sequences = []
    for seq in pos_sequences:
        # Convert tags to IDs, use 0 for padding
        ids = [pos_vocab.get(tag, pos_vocab['<PAD>']) for tag in seq]
        
        # Pad or truncate to max_len
        if len(ids) < max_len:
            ids.extend([pos_vocab['<PAD>']] * (max_len - len(ids)))
        else:
            ids = ids[:max_len]
            
        encoded_sequences.append(torch.tensor(ids, dtype=torch.long))
        
    return torch.stack(encoded_sequences)

# Encode and pad all sequences
train_pos_ids = encode_and_pad_pos(X_train_pos_sequences, pos_vocab)
test_pos_ids = encode_and_pad_pos(X_test_pos_sequences, pos_vocab)

print(f"Shape of padded POS ID array (Train): {train_pos_ids.shape}")

POS Vocabulary Size: 15
Shape of padded POS ID array (Train): torch.Size([1000, 128])


In [20]:
class HybridDataset(Dataset):
    def __init__(self, encodings, pos_ids, labels):
        self.encodings = encodings
        self.pos_ids = pos_ids
        self.labels = np.array(labels, dtype=np.int64)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['pos_ids'] = self.pos_ids[idx]
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        
        return item

    def __len__(self):
        return len(self.labels)

train_dataset_hybrid = HybridDataset(train_encodings, train_pos_ids, y_train_encoded)
test_dataset_hybrid = HybridDataset(test_encodings, test_pos_ids, y_test_encoded)

class HybridBiLSTM(nn.Module):
    def __init__(self, pos_vocab_size, word_vocab_size, pos_embedding_dim=50, 
                 word_embedding_dim=768, hidden_dim=128, num_labels=3, dropout=0.3):
        super(HybridBiLSTM, self).__init__()
        
        self.word_vocab_size = word_vocab_size 
        self.word_embedding_dim = word_embedding_dim
        
        # Embedding Layer para POS
        self.pos_embedding = nn.Embedding(
            num_embeddings=pos_vocab_size, 
            embedding_dim=pos_embedding_dim, 
            padding_idx=0 
        )
        
        # Embedding Layer para Palabras
        self.word_embedding = nn.Embedding(self.word_vocab_size, self.word_embedding_dim)

        total_input_dim = self.word_embedding_dim + pos_embedding_dim
        
        # BiLSTM Layer
        self.lstm = nn.LSTM(
            input_size=total_input_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            bidirectional=True,
            batch_first=True
        )
        
        # Classification Head
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim * 2, num_labels)
        
    def forward(self, input_ids, pos_ids):
        # Word Embedding (SIMULACIÓN)
        word_embedded = self.word_embedding(input_ids)
        
        current_seq_len = input_ids.shape[1] 

        # POS Embedding
        pos_embedded = self.pos_embedding(pos_ids) 
        
        # Corrección de longitud (del paso anterior)
        if pos_embedded.shape[1] > current_seq_len:
            pos_embedded = pos_embedded[:, :current_seq_len, :]
        elif pos_embedded.shape[1] < current_seq_len:
             word_embedded = word_embedded[:, :pos_embedded.shape[1], :]


        # Concatenación
        combined_sequence = torch.cat((word_embedded, pos_embedded), dim=2) 
        
        # BiLSTM Forward Pass
        lstm_out, _ = self.lstm(combined_sequence)
        
        # Global Average Pooling (GAP)
        pooled_output = torch.mean(lstm_out, dim=1) 
        
        # Classification
        dropped_output = self.dropout(pooled_output)
        logits = self.classifier(dropped_output)
        self.labels = np.array(labels, dtype=np.int64)
        return logits

In [21]:
# Params of the model
WORD_VOCAB_SIZE = 30000  
POS_EMBEDDING_DIM = 50 
HIDDEN_DIM = 128
NUM_EPOCHS = 3
BATCH_SIZE = 32

# Model
model_hybrid = HybridBiLSTM(
    pos_vocab_size=POS_VOCAB_SIZE, 
    word_vocab_size=WORD_VOCAB_SIZE,
    word_embedding_dim=768,
    pos_embedding_dim=POS_EMBEDDING_DIM, 
    hidden_dim=HIDDEN_DIM, 
    num_labels=num_labels
)
model_hybrid.to(device)

# DataLoader
train_loader_hybrid = DataLoader(train_dataset_hybrid, batch_size=BATCH_SIZE, shuffle=True)
test_loader_hybrid = DataLoader(test_dataset_hybrid, batch_size=BATCH_SIZE)

# Función de Pérdida y Optimizador
optimizer = AdamW(model_hybrid.parameters(), lr=5e-5)
loss_fn = nn.CrossEntropyLoss()

# Training
print(f"\nStarting Hybrid BiLSTM Training on {device} for {NUM_EPOCHS} epochs...")

for epoch in range(NUM_EPOCHS):
    model_hybrid.train() 
    total_loss = 0
    
    for batch in tqdm(train_loader_hybrid, desc=f"Epoch {epoch+1}"):
        optimizer.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        pos_ids = batch['pos_ids'].to(device)
        labels = batch['labels'].to(device)
        
        logits = model_hybrid(input_ids, pos_ids)
        
        loss = loss_fn(logits, labels)
        total_loss += loss.item()
        
        # Backpropagation
        loss.backward()
        optimizer.step()
        
    print(f"Epoch {epoch+1} Loss: {total_loss / len(train_loader_hybrid):.4f}")

print("\n✅ Hybrid Training complete.")


# 5. Evaluación Final
model_hybrid.eval() 
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in tqdm(test_loader_hybrid, desc="Evaluation"):
        input_ids = batch['input_ids'].to(device)
        pos_ids = batch['pos_ids'].to(device)
        labels = batch['labels'].to(device)

        logits = model_hybrid(input_ids, pos_ids)
        preds = torch.argmax(logits, dim=-1)
        
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())


y_pred_final_hybrid = le.inverse_transform(np.array(all_preds))
y_test_final_hybrid = le.inverse_transform(np.array(all_labels))

accuracy_hybrid = accuracy_score(y_test_final_hybrid, y_pred_final_hybrid)

print("-" * 60)
print(f"Accuracy of the Hybrid BiLSTM Model: {accuracy_hybrid:.4f}")
print("-" * 60)

print("Classification Report (Hybrid BiLSTM):")
print(classification_report(y_test_final_hybrid, y_pred_final_hybrid, zero_division=0))


Starting Hybrid BiLSTM Training on cpu for 3 epochs...


Epoch 1:   0%|          | 0/32 [00:00<?, ?it/s]

  self.labels = np.array(labels, dtype=np.int64)


Epoch 1 Loss: 0.9476


Epoch 2:   0%|          | 0/32 [00:00<?, ?it/s]

Epoch 2 Loss: 0.8362


Epoch 3:   0%|          | 0/32 [00:00<?, ?it/s]

Epoch 3 Loss: 0.8065

✅ Hybrid Training complete.


Evaluation:   0%|          | 0/4 [00:00<?, ?it/s]

------------------------------------------------------------
Accuracy of the Hybrid BiLSTM Model: 0.5909
------------------------------------------------------------
Classification Report (Hybrid BiLSTM):
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00        17
     neutral       1.00      0.03      0.07        29
    positive       0.59      1.00      0.74        64

    accuracy                           0.59       110
   macro avg       0.53      0.34      0.27       110
weighted avg       0.61      0.59      0.45       110



The BiLSTM model failed to learn how to classify and, instead, found the path of least resistance: predicting the majority class ('`positive`') for absolutely everything. This is confirmed because its overall Accuracy (0.57) is identical to the support percentage of the 'positive' class in the dataset, and its Recall is 1.00 for that class but 0.00 for all others.

___