# k fold cross-validation for BERT

--> Code from: https://vtiya.medium.com/lets-code-k-fold-validation-on-bert-722f9438f932

## 0. Setup

### 0.1 Install libraries

In [22]:
! pip install -r requirements.txt




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
# If you work with GPU-support (CUDA 12.8):
! pip install torch==2.7.1+cu128 -f https://download.pytorch.org/whl/torch/
! pip install torchaudio==2.7.1+cu128 -f https://download.pytorch.org/whl/torchaudio/
! pip install torchvision==0.22.1+cu128 -f https://download.pytorch.org/whl/torchvision/

Looking in links: https://download.pytorch.org/whl/torch/



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in links: https://download.pytorch.org/whl/torchaudio/



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in links: https://download.pytorch.org/whl/torchvision/



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
# If you only work with CPU-support:
# ! pip install torch==2.7.1
# ! pip install torchaudio==2.7.1
# ! pip install torchvision==0.22.1

### 0.2 GPU setup

In [25]:
# Check if CUDA is available and print the current device's name
import torch
print(torch.cuda.is_available())
if torch.cuda.is_available():
    print(torch.cuda.current_device())
    print(torch.cuda.get_device_name(0))

True
0
NVIDIA GeForce RTX 2070 with Max-Q Design


## 1. Configuration

In [26]:
data_path = r"../../data/labeled/2025-06-28_labeled_data.xlsx"
text_column_name = "expanded"
label_column_name = "label_strict"
num_labels = 2 # number of labels: just correct (0) and incorrect (1)

model_name = "bert-base-uncased" # Choose, which kind of model to train: Regular BERT - "bert-base-uncased", SciBERT - "allenai/scibert_scivocab_uncased"; PubMedBERT - "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"

# Set seed for reproducibility
import random
import numpy as np

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

num_epochs = 3 # Number of epochs for training
batch_size = 16 # Batch size for training and validation
learning_rate = 2e-5 # Learning rate for the optimizer

k_folds = 5 # How many folds whould the k fold approach have?
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=seed)

fold_accuracies = [] # Initializing a list to store accuracies for each fold

## 2. Read and prepare dataset

In [27]:
import pandas as pd

# Load the data
df = pd.read_excel(data_path)

# Remove all columns but 'expanded' and 'label_strict'
df = df.drop(columns=['statement_id', 'section', 'header','div', 'p', 's', 'article_id', 'p_comp', 'p_value', 'context', 'text', 'label_kind', 'label_keywords', 'label_comment'])

# Change column names
df = df.rename(columns={text_column_name: 'text', label_column_name: 'label'})

# Remove rows where 'label' is NA or -99
df = df[~df['label'].isna() & (df['label'] != -99)]

# Convert label column to int (required for classification)
df['label'] = df['label'].astype(int)

### This is now being done right in the StratifiedKFold method, so it is not needed here anymore:
# # Balance the dataset: all 1s and an equal number of random 0s
# ones = df[df['label'] == 1]
# zeros = df[df['label'] == 0].sample(n=len(ones), random_state=42)
# df = pd.concat([ones, zeros]).sample(frac=1, random_state=42).reset_index(drop=True)

# Show head
df.head()

Unnamed: 0,text,label
0,"Indeed, there was no significant difference in...",0
1,Cortisol concentrations were comparable at bas...,1
2,"Finally, there was no significant interaction ...",0
3,Paired t tests showed that only for the neutra...,1
5,Tukey's HSD tests revealed significant differe...,0


## 3. Perform k-fold cross-validation

In [28]:
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from tqdm import tqdm

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

fold_accuracies = []

class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=350):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            padding='max_length',
            truncation=True,
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx])
        }

tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = TextDataset(
    texts=df['text'].tolist(),
    labels=df['label'].tolist(),
    tokenizer=tokenizer
)

for fold, (train_indices, val_indices) in enumerate(skf.split(df['text'], df['label'])):
    print(f"Training Fold {fold+1}/{k_folds}")

    # Reinitialize the model at the start of each fold
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )
    model.to(device) # move model to device (GPU or CPU)

    # Split dataset into train and validation sets for the current fold
    train_dataset = torch.utils.data.Subset(dataset, train_indices)
    val_dataset = torch.utils.data.Subset(dataset, val_indices)
    
    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    # Training loop
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    criterion = torch.nn.CrossEntropyLoss()
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.train()
    for epoch in range(num_epochs):  # Adjust the number of epochs as needed
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
            optimizer.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
    
    # Evaluation loop
    model.eval()
    val_predictions = []
    val_labels = []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            _, predicted_labels = torch.max(outputs.logits, dim=1)
            val_predictions.extend(predicted_labels.tolist())
            val_labels.extend(labels.tolist())

    fold_accuracy = accuracy_score(val_labels, val_predictions)
    fold_accuracies.append(fold_accuracy)
    print(f"Accuracy for Fold {fold+1}: {fold_accuracy}")

# Calculate average accuracy across all folds
average_accuracy = sum(fold_accuracies) / len(fold_accuracies)
print(f"Average Accuracy: {average_accuracy}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training Fold 1/5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1: 100%|██████████| 38/38 [00:31<00:00,  1.19it/s]
Epoch 2: 100%|██████████| 38/38 [00:32<00:00,  1.17it/s]
Epoch 3: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]


Accuracy for Fold 1: 0.9078947368421053
Training Fold 2/5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1: 100%|██████████| 38/38 [00:32<00:00,  1.18it/s]
Epoch 2: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]
Epoch 3: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]


Accuracy for Fold 2: 0.9403973509933775
Training Fold 3/5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1: 100%|██████████| 38/38 [00:32<00:00,  1.18it/s]
Epoch 2: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]
Epoch 3: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]


Accuracy for Fold 3: 0.8609271523178808
Training Fold 4/5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1: 100%|██████████| 38/38 [00:32<00:00,  1.18it/s]
Epoch 2: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]
Epoch 3: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]


Accuracy for Fold 4: 0.9536423841059603
Training Fold 5/5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1: 100%|██████████| 38/38 [00:32<00:00,  1.18it/s]
Epoch 2: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]
Epoch 3: 100%|██████████| 38/38 [00:32<00:00,  1.16it/s]


Accuracy for Fold 5: 0.9337748344370861
Average Accuracy: 0.919327291739282


In [None]:
# The end...