# BERTBQ: Taglish Complaint Classification for Philippine E-Commerce Reviews
## Using Multilingual Transformers (RoBERTa-TL)
 This notebook implements a complaint classification system for Taglish (Tagalog-English) reviews from Philippine e-commerce platforms.

## 1. Setup and Installation
Run this cell first to install all required packages

In [1]:
# Install required packages
!pip install -q transformers datasets accelerate scikit-learn pandas numpy matplotlib seaborn wordcloud emoji
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Installs all the necessary libraries for the project. The first command installs libraries used for natural language processing, data processing, and visualization, including Transformers, Datasets, Accelerate, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn, WordCloud, and Emoji. The second command installs PyTorch along with Torchvision and Torchaudio, using a CUDA-compatible version to enable GPU acceleration if available.

## 2. Import Libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import re
import emoji
from collections import Counter

# Hugging Face imports
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)

# Sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    confusion_matrix,
    classification_report
)

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Using device: cuda
GPU: Tesla T4


Imports all the required Python libraries for data processing, visualization, natural language processing, and machine learning. Warnings are disabled for cleaner output. Core libraries such as NumPy, Pandas, and PyTorch are loaded for numerical computation, data handling, and model training. Visualization tools such as Matplotlib and Seaborn are imported for creating plots. Additional utilities such as regular expressions, emoji handling, and word frequency counting are included.

Hugging Face libraries are imported to handle tokenization, dataset loading, and model training. Scikit-learn modules are imported for data splitting, feature extraction, machine learning modeling, and evaluation.

A fixed random seed is set to ensure reproducibility of results. The code checks whether a GPU is available and prints the device being used for computation.

## 3. Load and Explore the Dataset
We'll use the SentiTaglishProductsAndServices dataset from Hugging Face

In [3]:
# Load the dataset
try:
    dataset = load_dataset("ccosme/SentiTaglishProductsAndServices")
    print("Dataset loaded successfully!")
    print(f"Available splits: {list(dataset.keys())}")

    # Display dataset info
    if 'train' in dataset:
        print(f"\nNumber of training samples: {len(dataset['train'])}")
        print(f"Features: {dataset['train'].features}")

        # Show sample
        print("\nSample entry:")
        print(dataset['train'][0])

except Exception as e:
    print(f"Could not load dataset: {e}")
    print("Creating sample dataset for demonstration...")

    # Create sample Taglish dataset
    sample_data = {
        'text': [
            "Sobrang bagal ng delivery, 2 weeks bago dumating!",
            "The product is good pero mahal masyado for the quality",
            "Ang ganda ng packaging at mabilis ang shipping",
            "Sira yung item na natanggap ko, requesting refund",
            "Great seller, very responsive and helpful!",
            "Hindi tugma sa description, disappointed ako",
            "Worth it! Sulit na sulit ang price",
            "Late delivery tapos wrong item pa ang natanggap",
            "Excellent quality, exactly as described",
            "Ang pangit ng customer service, di nagrereply",
            "Super satisfied with my purchase!",
            "Defective yung product, waste of money",
            "Fast shipping and good packaging",
            "Scam ba ito? Hindi tugma sa picture",
            "Highly recommended seller!"
        ] * 20,  # Multiply for more samples
        'label': [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 20
    }

    dataset = DatasetDict({'train': Dataset.from_dict(sample_data)})
    print(f"Sample dataset created with {len(dataset['train'])} entries")

Dataset loaded successfully!
Available splits: ['train']

Number of training samples: 10510
Features: {'review': Value('string'), 'sentiment': Value('int64')}

Sample entry:
{'review': 'at first gumagana cya..pero pagnalowbat cya ndi na ya magamit kahit ilang oras mo cya icharge namamatay agad..poor quality..not for recommended..', 'sentiment': 1}


This section attempts to load a Taglish sentiment dataset from Hugging Face. If the dataset is successfully loaded, the code prints its available splits, the number of training samples, the feature information, and displays a sample entry. If the dataset cannot be loaded due to an error, a fallback sample dataset is created manually for demonstration purposes. The sample dataset contains Taglish product and service reviews with corresponding sentiment labels.

## 4. Data Preprocessing
Clean and prepare the Taglish text for training

In [4]:
def clean_taglish_text(text):
    """Clean and normalize Taglish text"""
    if pd.isna(text) or text == "":
        return ""

    text = str(text)

    # Handle emojis
    text = emoji.demojize(text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # Handle repeated characters
    text = re.sub(r'(.)\1{3,}', r'\1\1', text)

    # Handle excessive punctuation
    text = re.sub(r'[!?]{2,}', '!', text)
    text = re.sub(r'\.{2,}', '.', text)

    # Remove extra whitespaces
    text = ' '.join(text.split())

    return text.strip()


# Process the dataset
def preprocess_dataset(dataset):
    """Preprocess the entire dataset"""
    processed_data = []

    for item in dataset:
        # Find text field
        text = None
        for field in ['text', 'review', 'review_text', 'content']:
            if field in item:
                text = item[field]
                break

        # Find label field
        label = None
        for field in ['label', 'sentiment', 'sentiment_label']:
            if field in item:
                label = item[field]
                break

        if text and label is not None:
            cleaned_text = clean_taglish_text(text)

            # Convert sentiment labels (1–4) to binary complaint labels
            # 1 = Negative → Complaint (1)
            # 2 = Neutral  → Non-Complaint (0)
            # 3 = Positive → Non-Complaint (0)
            # 4 = Mixed    → Complaint (1)
            try:
                label = int(label)
                if label in [1, 4]:
                    binary_label = 1
                elif label in [2, 3]:
                    binary_label = 0
                else:
                    binary_label = 0
            except:
                # If labels are strings like 'positive', 'negative'
                binary_label = 1 if str(label).lower() in ['negative', 'mixed'] else 0

            processed_data.append({
                'text': cleaned_text,
                'label': binary_label
            })

    return processed_data


# Process the data
if 'train' in dataset:
    processed_data = preprocess_dataset(dataset['train'])
else:
    processed_data = preprocess_dataset(dataset)

print(f"Processed {len(processed_data)} samples")

# Check label distribution
labels = [item['label'] for item in processed_data]
print(f"\nLabel Distribution:")
print(f"  Complaints (1): {sum(labels)} ({sum(labels)/len(labels)*100:.1f}%)")
print(f"  Non-complaints (0): {len(labels)-sum(labels)} ({(len(labels)-sum(labels))/len(labels)*100:.1f}%)")


Processed 10510 samples

Label Distribution:
  Complaints (1): 6805 (64.7%)
  Non-complaints (0): 3705 (35.3%)


This functions used to clean and normalize Taglish text before modeling. The clean_taglish_text function handles emoji conversion, removes URLs, reduces repeated characters, normalizes punctuation, and trims extra spaces to produce cleaner input text. The preprocess_dataset function applies this cleaning process to the entire dataset while also converting original sentiment labels into binary labels, where values indicating negative or mixed sentiment are treated as complaints (1), and positive or neutral sentiments are treated as non-complaints (0). After preprocessing, the code prints the number of processed samples and displays the distribution of complaint versus non-complaint labels.

## 5. Split the Dataset
Create train, validation, and test sets with stratification

In [5]:
# Convert to DataFrame
df = pd.DataFrame(processed_data)

# Split data: 70% train, 15% validation, 15% test
train_val_df, test_df = train_test_split(
    df, test_size=0.15, stratify=df['label'], random_state=RANDOM_SEED
)

train_df, val_df = train_test_split(
    train_val_df, test_size=0.176, stratify=train_val_df['label'], random_state=RANDOM_SEED
)  # 0.176 ≈ 15% of original

print(f"Dataset splits:")
print(f"  Train: {len(train_df)} samples")
print(f"  Validation: {len(val_df)} samples")
print(f"  Test: {len(test_df)} samples")

# Verify stratification
for name, split_df in [("Train", train_df), ("Val", val_df), ("Test", test_df)]:
    complaint_ratio = split_df['label'].mean()
    print(f"  {name} complaint ratio: {complaint_ratio:.2%}")

Dataset splits:
  Train: 7360 samples
  Validation: 1573 samples
  Test: 1577 samples
  Train complaint ratio: 64.76%
  Val complaint ratio: 64.72%
  Test complaint ratio: 64.74%


Converting to DataFrame and Splitting the Data

Converts the processed data into a Pandas DataFrame for easier manipulation. The dataset is then split into training, validation, and test sets, where 70% is used for training, 15% for validation, and 15% for testing. Stratified sampling is applied to ensure that the proportion of complaint and non-complaint labels remains consistent across all splits. After splitting, the code prints the number of samples in each group and verifies that the label distribution is balanced.

## 6. Baseline Model: TF-IDF + Logistic Regression
Establish baseline performance

In [6]:
print("Training Baseline Model (TF-IDF + Logistic Regression)")
print("="*50)

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 3),
    min_df=2,
    max_df=0.95
)

# Vectorize text
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['text'])
X_val_tfidf = tfidf_vectorizer.transform(val_df['text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['text'])

print(f"TF-IDF features shape: {X_train_tfidf.shape}")

# Train Logistic Regression
lr_classifier = LogisticRegression(
    random_state=RANDOM_SEED,
    max_iter=1000,
    class_weight='balanced'
)

lr_classifier.fit(X_train_tfidf, train_df['label'])

# Evaluate on validation set
val_pred_baseline = lr_classifier.predict(X_val_tfidf)
val_acc = accuracy_score(val_df['label'], val_pred_baseline)
precision, recall, f1, _ = precision_recall_fscore_support(
    val_df['label'], val_pred_baseline, average='macro'
)

print(f"\nBaseline Validation Results:")
print(f"  Accuracy: {val_acc:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall: {recall:.4f}")
print(f"  F1-Score: {f1:.4f}")

# Test set evaluation
test_pred_baseline = lr_classifier.predict(X_test_tfidf)
baseline_test_metrics = {
    'accuracy': accuracy_score(test_df['label'], test_pred_baseline),
    'precision': precision_recall_fscore_support(test_df['label'], test_pred_baseline, average='macro')[0],
    'recall': precision_recall_fscore_support(test_df['label'], test_pred_baseline, average='macro')[1],
    'f1': precision_recall_fscore_support(test_df['label'], test_pred_baseline, average='macro')[2]
}

print(f"\nBaseline Test Results:")
for metric, value in baseline_test_metrics.items():
    print(f"  {metric}: {value:.4f}")

Training Baseline Model (TF-IDF + Logistic Regression)
TF-IDF features shape: (7360, 5000)

Baseline Validation Results:
  Accuracy: 0.8964
  Precision: 0.8887
  Recall: 0.8831
  F1-Score: 0.8857

Baseline Test Results:
  accuracy: 0.8947
  precision: 0.8847
  recall: 0.8847
  f1: 0.8847


Trains a baseline text classification model using TF-IDF features and Logistic Regression. The TF-IDF vectorizer converts the text into numerical features based on word frequency patterns, using unigrams, bigrams, and trigrams while limiting vocabulary size and removing very rare or overly common terms. These features are used to train a Logistic Regression classifier with balanced class weights to handle any label imbalance. The model is first evaluated on the validation set to check performance during development, and then tested on the test set to obtain final baseline accuracy, precision, recall, and F1 score.

## 7. Feature Analysis for Baseline
Identify important words for classification

In [7]:
# Get feature importance
feature_names = tfidf_vectorizer.get_feature_names_out()
coef = lr_classifier.coef_[0]

# Top complaint indicators
top_complaint_idx = coef.argsort()[-15:][::-1]
top_complaint_features = [(feature_names[i], coef[i]) for i in top_complaint_idx]

print("Top 15 Complaint Indicators:")
for feature, score in top_complaint_features:
    print(f"  {feature}: {score:.3f}")

# Top non-complaint indicators
top_non_complaint_idx = coef.argsort()[:15]
top_non_complaint_features = [(feature_names[i], coef[i]) for i in top_non_complaint_idx]

print("\nTop 15 Non-Complaint Indicators:")
for feature, score in top_non_complaint_features:
    print(f"  {feature}: {score:.3f}")

Top 15 Complaint Indicators:
  kaso: 6.877
  not: 5.475
  pero: 4.372
  disappointed: 3.546
  lang: 3.251
  yung: 3.159
  hindi: 3.065
  but: 3.006
  sira: 2.944
  di: 2.818
  sayang: 2.528
  tapos: 2.336
  mali: 2.325
  wrong: 2.323
  poor: 2.257

Top 15 Non-Complaint Indicators:
  ganda: -5.070
  thank: -3.600
  super: -3.220
  salamat: -2.868
  love: -2.767
  maganda: -2.581
  thank you: -2.522
  good: -2.510
  ulit: -2.448
  nice: -2.434
  sulit: -2.339
  safe: -2.299
  ang ganda: -2.298
  thankyou: -2.155
  thanks: -2.151


Identifying Important Features

The TF-IDF feature names are retrieved, and the logistic regression coefficients are used to determine how strongly each word influences the classification. The top positive coefficients represent words that are strong indicators of complaints, while the top negative coefficients represent words that are more commonly found in non-complaint reviews. The code prints the top 15 words for each group along with their contribution scores.

## 8. Transformer Model Setup
Prepare functions for training transformer models

In [8]:
def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='macro'
    )

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

def prepare_dataset_for_transformer(df, tokenizer, max_length=256):
    """Prepare dataset for transformer training"""
    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            padding='max_length',
            truncation=True,
            max_length=max_length
        )

    dataset = Dataset.from_pandas(df)
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
    tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

    return tokenized_dataset

Defines two functions used when training the transformer model. The compute_metrics function calculates evaluation metrics, including accuracy, precision, recall, and F1 score, based on model predictions and true labels. The prepare_dataset_for_transformer function converts the text data into a format suitable for transformer models by tokenizing the text, setting a maximum sequence length, and renaming the label column to match the expected input format. The processed dataset is then formatted into tensors so it can be used directly for model training.

## 9. Train RoBERTa-TL (Filipino) Model
Fine-tune Filipino-specific transformer

In [18]:
print("Training RoBERTa-TL (Filipino) Model")
print("="*50)

# Load model and tokenizer
model_name_tl = "jcblaise/roberta-tagalog-base"
tl_tokenizer = AutoTokenizer.from_pretrained(model_name_tl)
tl_model = AutoModelForSequenceClassification.from_pretrained(
    model_name_tl,
    num_labels=2
).to(device)

# Add padding token if needed
if tl_tokenizer.pad_token is None:
    tl_tokenizer.pad_token = tl_tokenizer.eos_token

# Prepare datasets
train_dataset_tl = prepare_dataset_for_transformer(train_df, tl_tokenizer)
val_dataset_tl = prepare_dataset_for_transformer(val_df, tl_tokenizer)

# Use same training arguments
training_args_tl = TrainingArguments(
    output_dir="./results/roberta-tl",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    warmup_steps=200,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    lr_scheduler_type="linear",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    report_to="none",
    seed=RANDOM_SEED
)

# Create trainer
tl_trainer = Trainer(
    model=tl_model,
    args=training_args_tl,
    train_dataset=train_dataset_tl,
    eval_dataset=val_dataset_tl,
    tokenizer=tl_tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=7)]
)

# Train
print("Starting training...")
tl_trainer.train()

# Evaluate
tl_val_results = tl_trainer.evaluate()
print("\nRoBERTa-TL Validation Results:")
for key, value in tl_val_results.items():
    if not key.startswith('eval_'):
        continue
    metric_name = key.replace('eval_', '')
    print(f"  {metric_name}: {value:.4f}")

Training RoBERTa-TL (Filipino) Model


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at jcblaise/roberta-tagalog-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/7360 [00:00<?, ? examples/s]

Map:   0%|          | 0/1573 [00:00<?, ? examples/s]

Starting training...


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.3059,0.315494,0.895105,0.904138,0.865283,0.879881
2,0.2163,0.412538,0.908455,0.91477,0.883792,0.896174
3,0.1364,0.459711,0.907184,0.905371,0.889365,0.896424



RoBERTa-TL Validation Results:
  loss: 0.4597
  accuracy: 0.9072
  precision: 0.9054
  recall: 0.8894
  f1: 0.8964
  runtime: 5.6874
  samples_per_second: 276.5760
  steps_per_second: 34.6380


The Filipino RoBERTa model (RoBERTa-TL) was loaded along with its tokenizer and configured for binary text classification. A padding token was added if needed to ensure consistent input formatting. The training and validation datasets were tokenized and prepared using the same preprocessing method as the previous model. The model was trained using the same training parameters, including early stopping to prevent overfitting. After training, the model’s performance was evaluated on the validation dataset, and key metrics such as accuracy, precision, recall, and F1-score were generated to assess its classification effectiveness.