# BERT Training Dataset: CoLA (Corpus of Linguistic Acceptability)

## Overview

The Corpus of Linguistic Acceptability (CoLA) in its full form consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by their original authors. The public version provided here contains 9594 sentences belonging to training and development sets, and excludes 1063 sentences belonging to a held out test set. Contact alexwarstadt [at] gmail [dot] com with any questions or issues. Read the paper or check out the source code for baselines.


Access dataset at: https://nyu-mll.github.io/CoLA/





## Dataset Structure

### in_domain_train.tsv
- **Purpose**: Training data from the target domain
- **Format**: Tab-separated values (TSV)
- **Typical columns**:
  - `sentence`: The input text for classification or other NLP tasks
  - `label`: The target label/category
- **Usage**: Used to fine-tune BERT on domain-specific data

### out_of_domain_dev.tsv
- **Purpose**: Development/validation data from a different domain
- **Format**: Tab-separated values (TSV)
- **Typical columns**: Same structure as training file
- **Usage**: Used to evaluate model performance on unseen domain data

## Why This Dataset is Suitable for BERT

### 1. **Text Classification Tasks**
BERT excels at understanding contextual relationships in text, making it ideal for classification tasks that these datasets typically represent.

### 2. **Transfer Learning Evaluation**
- The domain mismatch between training and evaluation data tests BERT's ability to generalize learned representations
- This setup is crucial for real-world applications where training and deployment domains often differ

### 3. **Fine-tuning Architecture**
- BERT's pre-trained representations can be fine-tuned on the in-domain training data
- The model learns domain-specific patterns while retaining general language understanding

### 4. **Sentence-Level Understanding**
- BERT's bidirectional attention mechanism captures full sentence context
- This is essential for tasks requiring understanding of complete sentences rather than individual words

## Typical Use Cases

- **Domain Adaptation**: Training on one domain (e.g., movie reviews) and testing on another (e.g., product reviews)
- **Cross-Domain Sentiment Analysis**: Evaluating sentiment classification across different text types
- **Robustness Testing**: Measuring how well models perform when domain assumptions are violated

## Training Process with BERT

1. **Pre-processing**: Tokenize text using BERT's WordPiece tokenizer
2. **Fine-tuning**: Add classification head to pre-trained BERT model
3. **Training**: Fine-tune on `in_domain_train.tsv`
4. **Evaluation**: Test generalization on `out_of_domain_dev.tsv`

## Example Dataset Format

```
sentence	label
I love this movie, it's amazing!	positive
This product is terrible quality	negative
The service was okay, nothing special	neutral
```

## Key Benefits for BERT Training

- **Contextual Understanding**: BERT's bidirectional encoder captures rich contextual information
- **Transfer Learning**: Pre-trained weights provide strong initialization for domain-specific tasks
- **Attention Mechanism**: Self-attention allows the model to focus on relevant parts of the input
- **Robustness Evaluation**: Out-of-domain testing reveals model generalization capabilities

This dataset configuration is particularly valuable for research into domain adaptation, transfer learning, and the robustness of transformer-based models like BERT.

# **Set up and Train**

### 📦 Install Required Libraries

Install all the necessary packages such as PyTorch, HuggingFace Transformers, scikit-learn, and others for training a BERT model.


In [None]:
!pip install torch scikit-learn transformers pandas numpy matplotlib

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

### 🔧 Import Libraries & Load Dataset

Import essential libraries and load the training dataset (`in_domain_train.tsv`). This dataset contains sentence-level binary classification data.


In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.optim import AdamW

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
df = pd.read_csv("/content/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
print(df.shape)

(8551, 4)


In [None]:
df.head()

Unnamed: 0,sentence_source,label,label_notes,sentence
0,gj04,1,,"our friends wo n't buy this analysis , let alo..."
1,gj04,1,,one more pseudo generalization and i 'm giving...
2,gj04,1,,one more pseudo generalization or i 'm giving ...
3,gj04,1,,"the more we study verbs , the crazier they get ."
4,gj04,1,,day by day the facts are getting murkier .


### 📝 Data Preprocessing and Tokenization

Prepare the input data for BERT:
- Add `[CLS]` and `[SEP]` tokens.
- Tokenize each sentence using BERT tokenizer.
- Convert tokens to token IDs.


In [None]:
# Creating sentence, label lists and adding Bert tokens
sentences = df.sentence.values

# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values

In [None]:
len(sentences)

8551

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print("Tokenize the first sentence:")
print(tokenized_texts[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokenize the first sentence:
['[CLS]', 'our', 'friends', 'wo', 'n', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']


### 🧼 Padding & Attention Masks

Pad the token IDs to a fixed length (`MAX_LEN`) and create attention masks to distinguish between padded and actual tokens.


In [None]:
# Processing the data
MAX_LEN = 128

# Use the BERT Tokenizer to convert the tokens to their index numbers in the BERT vocalubary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# Pad our input tokens
input_ids = pad_sequences(input_ids,maxlen=MAX_LEN,dtype="long",truncating="post",padding="post")


In [None]:
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

### 🔀 Split into Training and Validation Sets

Split the dataset into 90% training and 10% validation using `train_test_split`. Convert all inputs and labels into PyTorch tensors.


In [None]:
# Use train_test_split to split our data into train and validation sets for training

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels,
                                                            random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,
                                             random_state=2018, test_size=0.1)

In [None]:
# Torch tensors are the required datatype for our model

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

### 🚚 Create DataLoaders

Wrap the training and validation tensors into PyTorch `DataLoader`s using `TensorDataset`, `RandomSampler`, and `SequentialSampler`.


In [None]:
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 16

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

### ⚙️ Initialize BERT Model

Create a BERT model using `BertForSequenceClassification` with two output labels (binary classification). Move it to GPU if available.


In [None]:
# Initializing a BERT bert-base-uncased style configuration
from transformers import BertModel, BertConfig
configuration = BertConfig()

# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)

# Accessing the model configuration
configuration = model.config
print(configuration)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.53.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [None]:
# Loading the Hugging Face Bert Uncased Base Model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### 🧠 Optimizer and Scheduler Setup

Set up an AdamW optimizer with weight decay and configure a linear learning rate scheduler without warmup steps.


In [None]:
# Don't apply weight decay to any parameters whose names include these tokens.
# (Here, the BERT doesn't have `gamma` or `beta` parameters, only `bias` terms)
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.weight']
# Separate the `weight` parameters from the `bias` parameters.
# - For the `weight` parameters, this specifies a 'weight_decay_rate' of 0.01.
# - For the `bias` parameters, the 'weight_decay_rate' is 0.0.
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},

    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [None]:
# Number of training epochs (authors recommend between 2 and 4)
epochs = 2

optimizer = AdamW(optimizer_grouped_parameters,
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                  )
# Total number of training steps is number of batches * number of epochs.
# `train_dataloader` contains batched data so `len(train_dataloader)` gives
# us the number of batches.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

### 📈 Accuracy Calculation Function

Define a utility function to compute accuracy by comparing model predictions with actual labels.


In [None]:
#Creating the Accuracy Measurement Function
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

### 🔁 Training Loop

Train the BERT model over a number of epochs:
- Perform forward and backward passes
- Optimize weights
- Evaluate model performance on validation data


In [None]:
#The Training Loop
t = []

# Store our loss and accuracy for plotting
train_loss_set = []

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):


    # Training

    # Set our model to training mode (as opposed to evaluation mode)
    model.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    # Train the data for one epoch
    for step, batch in enumerate(train_dataloader):
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Clear out the gradients (by default they accumulate)
        optimizer.zero_grad()
        # Forward pass
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs['loss']
        train_loss_set.append(loss.item())
        # Backward pass
        loss.backward()
        # Update parameters and take a step using the computed gradient
        optimizer.step()

        # Update the learning rate.
        scheduler.step()


        # Update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    print("Train loss: {}".format(tr_loss/nb_tr_steps))


    # Validation

    # Put model in evaluation mode to evaluate loss on the validation set
    model.eval()

    # Tracking variables
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

        # Move logits and labels to CPU
        logits = logits['logits'].detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        tmp_eval_accuracy = flat_accuracy(logits, label_ids)

        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Train loss: 0.47506702940701945


Epoch:  50%|█████     | 1/2 [02:41<02:41, 161.27s/it]

Validation Accuracy: 0.8067129629629629
Train loss: 0.2568341679019145


Epoch: 100%|██████████| 2/2 [05:23<00:00, 161.78s/it]

Validation Accuracy: 0.8206018518518519





# **Test**

### 📊 Load and Preprocess Test Data

Load the test dataset (`out_of_domain_dev.tsv`), preprocess it similarly to training data, and prepare it for evaluation.


In [None]:
test_df = pd.read_csv("/content/out_of_domain_dev.tsv",delimiter="\t",header=None,names=['sentence_source', 'label', 'label_notes', 'sentence'])

In [None]:
test_df.head()

Unnamed: 0,sentence_source,label,label_notes,sentence
0,clc95,1,,Somebody just left - guess who.
1,clc95,1,,"They claimed they had settled on something, bu..."
2,clc95,1,,"If Sam was going, Sally would know where."
3,clc95,1,,"They're going to serve the guests something, b..."
4,clc95,1,,She's reading. I can't imagine what.


In [None]:
# Create sentence and label lists
sentences = df.sentence.values

# Adding special tokens at the start and end of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values

tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

In [None]:
MAX_LEN = 128

# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

### 🤖 Run Predictions on Test Set

Evaluate the trained model on the test dataset by performing a forward pass and collecting predictions and true labels.


In [None]:
prediction_inputs = torch.tensor(input_ids)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(labels)

batch_size = 32


prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

### 🧮 Evaluate with Matthew’s Correlation Coefficient

Use Matthew’s Correlation Coefficient (MCC) to evaluate classification performance for each batch in the test dataset.


In [None]:
# Prediction on test set

# Put model in evaluation mode
model.eval()

# Tracking variables
predictions , true_labels = [], []

# Predict
for batch in prediction_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up prediction
    with torch.no_grad():
        # Forward pass, calculate logit predictions
        logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

    # Move logits and labels to CPU
    logits = logits['logits'].detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)

In [None]:
# Evaluating Using Matthew's Correlation Coefficient
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef
matthews_set = []

for i in range(len(true_labels)):
    matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
    matthews_set.append(matthews)



In [None]:
# Score of Individual Batches
matthews_set

[np.float64(1.0),
 np.float64(1.0),
 np.float64(0.7142857142857143),
 np.float64(0.8459051693633014),
 np.float64(0.938872452190116),
 np.float64(0.7562449037944323),
 np.float64(0.7644707871564383),
 np.float64(0.7419408268023742),
 np.float64(0.6546536707079771),
 np.float64(0.5826364566706337),
 np.float64(0.8320502943378436),
 np.float64(0.7848566748139434),
 np.float64(0.936441710371274),
 np.float64(0.936441710371274),
 np.float64(1.0),
 np.float64(0.8454106280193237),
 np.float64(0.7895918772038132),
 0.0,
 np.float64(0.6255774501577784),
 np.float64(0.7141684885491869),
 np.float64(0.4622501635210242),
 np.float64(0.9278305692406299),
 np.float64(1.0),
 np.float64(0.8823529411764706),
 np.float64(0.936441710371274),
 np.float64(0.7644707871564383),
 np.float64(0.8805899139163632),
 np.float64(0.629940788348712),
 np.float64(0.7092993656151906),
 np.float64(0.4472135954999579),
 np.float64(0.8783100656536799),
 np.float64(0.8958064164776167),
 np.float64(0.35986374603287324),
 n

In [None]:
predictions[0].shape

(32, 2)

### 📊 MCC on Entire Test Set

Flatten all predictions and true labels to compute the final MCC score on the whole test set.


Matthew's Evaluation on the Whole Dataset

In [None]:
# Matthew's Evaluation on the Whole Dataset
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
# print(flat_predictions)
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
print(min(flat_predictions))
flat_true_labels = [item for sublist in true_labels for item in sublist]
# print(flat_true_labels)
matthews_corrcoef(flat_true_labels, flat_predictions)

0


np.float64(0.8376880005354778)