# BERT Embeddings + XGBoost for Ordinal Text Classification (IMDb)

## Introduction

In this notebook, we explore ordinal text classification by combining **pretrained BERT embeddings** with a **gradient-boosted decision tree model (XGBoost)**. While BERT helps capture rich semantic features from text, XGBoost handles the ordinal regression by treating it as a regression task followed by rounding.

Although IMDb is originally a binary sentiment dataset, we simulate ordinal behavior by mapping positive/negative sentiments to multiple stars (1–5), mimicking a real-world rating system.


In [None]:
!pip install -q transformers xgboost accelerate scikit-learn datasets

In [17]:
import numpy as np
import xgboost as xgb
from transformers import BertTokenizer, BertModel
from datasets import load_dataset
from sklearn.metrics import mean_absolute_error, cohen_kappa_score
from sklearn.model_selection import train_test_split
import torch
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Dataset: IMDb (Simulated Ordinal Labels)

- Source: Hugging Face `datasets`
- Original Task: Binary sentiment classification (positive/negative)
- Transformation:
  - Negative → randomly assigned 1 or 2 stars
  - Positive → randomly assigned 4 or 5 stars
  - This simulates **ordinal labels** while preserving underlying sentiment


In [None]:
# Load small IMDb dataset
dataset = load_dataset("imdb")

# Simulate ordinal labels: map binary IMDb to 1–5
def simulate_ordinal(example):
    if example['label'] == 0:
        example['stars'] = np.random.choice([1, 2])
    else:
        example['stars'] = np.random.choice([4, 5])
    return example

dataset = dataset['train'].select(range(1000)).map(simulate_ordinal)
texts = dataset['text']
labels = np.array(dataset['stars'])

## BERT Embedding Extraction

We use the `bert-base-uncased` model from Hugging Face to extract semantic representations:
- Each review is tokenized and passed through BERT.
- The `[CLS]` token embedding is extracted as the sentence-level representation.
- Embeddings are stored and used as input features for XGBoost.


In [19]:
# Load BERT for embeddings
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert = BertModel.from_pretrained("bert-base-uncased").to(device)
bert.eval()

# Function to extract CLS token embedding
def get_bert_embeddings(texts):
    embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(texts), 16)):
            batch = texts[i:i+16]
            enc = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=128)
            input_ids = enc['input_ids'].to(device)
            attention_mask = enc['attention_mask'].to(device)
            outputs = bert(input_ids=input_ids, attention_mask=attention_mask)
            cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            embeddings.append(cls_embeddings)
    return np.vstack(embeddings)

# Get BERT embeddings
X = get_bert_embeddings(texts)
y = labels

100%|██████████| 63/63 [00:20<00:00,  3.14it/s]


## XGBoost for Ordinal Approximation

- Model: `XGBRegressor` with `reg:squarederror` (regression loss)
- Labels are treated as continuous values between 1–5
- Predictions are rounded and clipped to ensure valid ordinal outputs
- While not true ordinal regression, this approach provides a strong baseline


In [20]:
# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost with ordinal loss approximation (use 'reg:squarederror' + rounding)
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=4, n_estimators=100)
xgb_model.fit(X_train, y_train)


## Evaluation Metrics

We assess model performance using:
- **Mean Absolute Error (MAE)** – Measures the average absolute difference between predicted and true ratings.
- **Quadratic Weighted Kappa (QWK)** – Evaluates agreement between predicted and true ratings, considering ordinal distance.


In [21]:
# Predict and round to nearest integer
y_pred = np.clip(np.round(xgb_model.predict(X_val)), 1, 5).astype(int)

# Evaluation
mae = mean_absolute_error(y_val, y_pred)
qwk = cohen_kappa_score(y_val, y_pred, weights='quadratic')

print(f"MAE: {mae:.4f}, QWK: {qwk:.4f}")

MAE: 0.5050, QWK: -0.0096


## Notes

- This is a lightweight, interpretable method that’s faster than fine-tuning transformers.
- Works well when you want to decouple feature extraction (deep learning) and modeling (classical ML).