# BERT solution

This is our solution with BERT binary classification. We chose to analyse reviews that have text only (as mentioned in `README.md`). This solution is fast and cheap, as well as great accuracy.

## Install packages

Download torch according to official website and CUDA version as mentioned in `README.md`.

In [None]:
%pip install transformers datasets scikit-learn pandas
%pip install torch torchvision --index-url https://download.pytorch.org/whl/cu129

## Import packages

In [None]:
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from datasets import Dataset
import pandas as pd

## Load data

We used the gpt-5 labelled data as our training data.

In [None]:
data = pd.read_csv("./data/review-Vermont_10-cleaned.csv")
train_df, val_df = train_test_split(
    data,
    test_size=0.2,
    stratify=data['label'],
    random_state=42
)
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))

## Load tokenizer

Tokenize the dataset with BERT tokenizer.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True, max_length=128)
train_dataset = train_dataset.map(tokenize, batched=True)
val_dataset = val_dataset.map(tokenize, batched=True)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)

## Load model

BERT is the main model for our solution.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

## Training model

We trained the model with 10 epochs.

In [None]:
epochs = 10
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    avg_train_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{epochs} - Training loss: {avg_train_loss:.4f}")
    model.eval()
    preds, true_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds.extend(torch.argmax(logits, dim=1).cpu().numpy())
            true_labels.extend(labels.cpu().numpy())
    print("Validation Results:")
    print(classification_report(true_labels, preds, target_names=["clean", "flagged"]))

## Sample input

Test samples from other sources.

In [None]:
reviews = [
    "You can review a lake? How does that work",
    "I work for them Barre Vt Location",
    "Didn't go here lol",
    "Awesome Customer service, quick response, and willing to help whether you are a customer or not. A+ we need more businesses like this."
]
inputs = tokenizer(reviews, padding=True, truncation=True, max_length=128, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

## Predictions

Test predictions for the test samples.

In [None]:
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=1)
for review, pred in zip(reviews, predictions):
    label = "clean" if pred.item() == 0 else "flagged"
    print(f"Review: {review}\nPredicted label: {label}\n")