# `02_transformer_experiments.ipynb`

The purpose of this notebook is to build upon `01_logistic_regression_baseline.ipynb` and explore transformer-based models for [Google Reviews]("https://www.kaggle.com/datasets/cgrowe96/google-reviews-of-us-medical-facilities") sentiment classification.

Baseline Logistic Regression + TF-IDF:
- Accuracy: 0.968
- F1 (positive): 0.972

**Due to the age of my hardware I've trained this model in a [Google Colab notebook]("https://colab.research.google.com/drive/1-4vLDxnuPr18D0Jq5XBjITgBFCuVwffM?usp=sharing") where you can see outputs. The only notable difference is in the evaluation step where locally I use [evaluation.py](../scripts/evaluation.py) for display of a Confusion Matrix.**

## Setup

### Imports

In [None]:
!pip install transformers datasets evaluate accelerate -q

In [None]:
import pandas as pd
import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
import sys
sys.path.append("../scripts")
from evaluation import evaluate_model
import torch.nn.functional as F

### Device Check

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

### Load and Preview Data

In [None]:
df = pd.read_csv("../data/cleaned_reviews.csv")
df.head()

### Convert to Hugging Face Dataset

In [None]:
hf_dataset = Dataset.from_pandas(df)
hf_dataset = hf_dataset.class_encode_column("label")
hf_dataset = hf_dataset.train_test_split(test_size=0.2, stratify_by_column="label")

### Tokenisation

In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(batch["Review Text"], padding="max_length", truncation=True)

tokenized_datasets = hf_dataset.map(tokenize, batched=True)

### Load Model

In [None]:
num_labels = len(set(df["label"]))
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)

## Modelling

### Training Setup

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
)

### Training

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer
)

trainer.train()

## Evaluation and Export

### Evaluate Model

In [None]:
label_names = ["negative", "positive"]
y_true, y_pred, _ = evaluate_model(trainer, tokenized_datasets=tokenized_datasets, label_names=label_names)

### Save Model

In [None]:
trainer.save_model("./distilbert_sentiment")
tokenizer.save_pretrained("./distilbert_sentiment")