# Finetuning Transformer for Sentiment Analysis
Objective: Finetune a BERT model on SST2 dataset to perform binary sentiment analysis

### Dataset

Stanford Sentiment Treebank v2: set of 67349 phrases and human-labeled sentiment class

### Model
Chosen to finetune distilbert base uncased (67M parameters) based on smaller size and speed of training

In [1]:
!pip install -q datasets transformers accelerate evaluate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch

if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    print(f"✓ Number of GPUs: {num_gpus}")
    for i in range(num_gpus):
        gpu_name = torch.cuda.get_device_name(i)
        gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1e9
        print(f"  GPU {i}: {gpu_name} ({gpu_memory:.2f} GB)")
    device = torch.device("cuda")
else:
    print("⚠ No GPU available, using CPU")
    device = torch.device("cpu")

# Clear GPU cache
torch.cuda.empty_cache()


✓ Number of GPUs: 2
  GPU 0: Tesla T4 (15.83 GB)
  GPU 1: Tesla T4 (15.83 GB)


In [3]:
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)


2025-12-29 01:05:59.321386: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1766970359.765534      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766970359.881425      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1766970361.024894      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766970361.024936      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766970361.024939      55 computation_placer.cc:177] computation placer alr

In [4]:
# Set to True for quick pipeline testing, False for full training
TEST_MODE = False
TEST_SUBSET_SIZE = 1000

checkpoint = "distilbert-base-uncased"

In [5]:
dataset = load_dataset("stanfordnlp/sst2")

print("Original dataset structure:")
print(dataset)

# Create a proper test set from training data since official test labels are hidden
train_valid_test = dataset["train"].train_test_split(test_size=0.1, seed=42)
train_valid = train_valid_test["train"].train_test_split(test_size=0.1, seed=42)

dataset["train"] = train_valid["train"]
dataset["validation"] = train_valid["test"]  # Use as validation
dataset["test"] = train_valid_test["test"]   # Use as test

print("\nModified dataset structure:")
print(f"  Train: {len(dataset['train'])} samples")
print(f"  Validation: {len(dataset['validation'])} samples")
print(f"  Test: {len(dataset['test'])} samples")

if TEST_MODE:
    print(f"\n⚠ TEST MODE: Using {TEST_SUBSET_SIZE} samples per split")
    dataset["train"] = dataset["train"].shuffle(seed=42).select(range(min(TEST_SUBSET_SIZE, len(dataset["train"]))))
    dataset["validation"] = dataset["validation"].shuffle(seed=42).select(range(min(TEST_SUBSET_SIZE, len(dataset["validation"]))))
    dataset["test"] = dataset["test"].shuffle(seed=42).select(range(min(TEST_SUBSET_SIZE, len(dataset["test"]))))


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Original dataset structure:
DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

Modified dataset structure:
  Train: 54552 samples
  Validation: 6062 samples
  Test: 6735 samples


In [6]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# Move model to GPU
model.to(device)

print(f"\n✓ Loaded model: {checkpoint}")
print(f"  Parameters: {model.num_parameters():,}")
print(f"  Device: {next(model.parameters()).device}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



✓ Loaded model: distilbert-base-uncased
  Parameters: 66,955,010
  Device: cuda:0


In [7]:
def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        truncation=True,
        max_length=128,
    )

tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["sentence", "idx"],
    desc="Tokenizing"
)

Tokenizing:   0%|          | 0/54552 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/6062 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/6735 [00:00<?, ? examples/s]

In [8]:
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding=True,
    return_tensors="pt"
)


In [9]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    """Compute accuracy and F1 score for evaluation."""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="binary")

    return {
        "accuracy": accuracy["accuracy"],
        "f1": f1["f1"]
    }


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [10]:
training_args = TrainingArguments(
    output_dir="/kaggle/working/sst2-distilbert-finetuned",

    # Training hyperparameters
    num_train_epochs=5, # increased 3 -> 5
    learning_rate=2e-5,
    weight_decay=0.05, # increased 0.01 -> 0.05
    warmup_ratio=0.1,

    # Batch sizes (optimized for T4 x2)
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,

    # Evaluation strategy
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,

    # Logging
    logging_dir="/kaggle/working/logs",
    logging_steps=100,
    report_to="none",

    # Best model
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",  # 'accuracy' -> 'eval_loss'
    greater_is_better=False,

    # Reproducibility
    seed=42,

    # GPU-specific settings
    fp16=True,                    
    dataloader_num_workers=2,     # Parallel data loading

    # label smoothing
    label_smoothing_factor = 0.1, # new added
)

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)


  trainer = Trainer(


In [12]:
num_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 1
effective_batch_size = training_args.per_device_train_batch_size * num_gpus

print("\n" + "=" * 50)
print("Starting GPU Training...")
print(f"Number of GPUs: {num_gpus}")
print(f"Effective batch size: {effective_batch_size}")
print("=" * 50)

train_result = trainer.train()

print(f"\n{'=' * 50}")
print("Training Complete!")
print(f"{'=' * 50}")
print(f"Total training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"  ({train_result.metrics['train_runtime'] / 60:.2f} minutes)")
print(f"Samples/second: {train_result.metrics['train_samples_per_second']:.2f}")


Starting GPU Training...
Number of GPUs: 2
Effective batch size: 64


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3354,0.311524,0.934345,0.940914
2,0.2834,0.287019,0.951006,0.956975
3,0.2606,0.284998,0.952821,0.958635
4,0.2477,0.287668,0.953316,0.95886
5,0.2274,0.290434,0.954141,0.959499


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av


Training Complete!
Total training time: 785.78 seconds
  (13.10 minutes)
Samples/second: 347.12


In [13]:
print(f"\n{'=' * 50}")
print("Validation Set Evaluation:")
print(f"{'=' * 50}")

val_results = trainer.evaluate(tokenized_datasets["validation"])
for key, value in val_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")



Validation Set Evaluation:


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  eval_loss: 0.2850
  eval_accuracy: 0.9528
  eval_f1: 0.9586
  eval_runtime: 5.1396
  eval_samples_per_second: 1179.4630
  eval_steps_per_second: 9.3390
  epoch: 5.0000


In [14]:
print(f"\n{'=' * 50}")
print("Test Set Evaluation:")
print(f"{'=' * 50}")

test_results = trainer.evaluate(tokenized_datasets["test"])
for key, value in test_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")


Test Set Evaluation:


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  eval_loss: 0.2961
  eval_accuracy: 0.9471
  eval_f1: 0.9521
  eval_runtime: 5.7867
  eval_samples_per_second: 1163.8850
  eval_steps_per_second: 9.1590
  epoch: 5.0000


In [15]:
save_path = "/kaggle/working/sst2-distilbert-final"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)
print(f"\n✓ Model saved to: {save_path}")


✓ Model saved to: /kaggle/working/sst2-distilbert-final


In [16]:
num_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 1
gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"

print(f"\n{'=' * 50}")
print("TRAINING SUMMARY")
print(f"{'=' * 50}")
print(f"  Model: {checkpoint}")
print(f"  Accelerator: {num_gpus}x {gpu_name}")
print(f"  Mode: {'TEST' if TEST_MODE else 'FULL'}")
print(f"  Training samples: {len(tokenized_datasets['train'])}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * num_gpus}")
print(f"  Validation Accuracy: {val_results['eval_accuracy']:.4f}")
print(f"  Test Accuracy: {test_results['eval_accuracy']:.4f}")
print(f"  Total Time: {train_result.metrics['train_runtime'] / 60:.2f} minutes")
print(f"{'=' * 50}")


TRAINING SUMMARY
  Model: distilbert-base-uncased
  Accelerator: 2x Tesla T4
  Mode: FULL
  Training samples: 54552
  Effective batch size: 64
  Validation Accuracy: 0.9528
  Test Accuracy: 0.9471
  Total Time: 13.10 minutes


### Run Evaluation
1st Run: 

CRITICAL ERROR - Used validation data for model finetuning and test evaluation. Results were skewed.

Changes - Adjusted epoch amount, changed evaluation metric from accuracy to eval_loss, added label smoothing to address overfitting

2nd Run: 

CRITICAL ERROR - best model selected was the one with largest eval_loss, not the least eval_loss

Changes - greater_is_better value = False

Final Run: 

Model: distilbert-base-uncased

Accelerator: 2x Tesla T4

Mode: FULL

Training samples: 54552

Effective batch size: 64

Validation Accuracy: 0.9528

Test Accuracy: 0.9471

Total Time: 13.10 minutes


### Results

Stopping finetuning here. Test accuracy over 6000 test data is 94.71% Improvements possible, but I would rather compare initial results to initial results for traditional sentiment analysis method and LLM call method