In [61]:
import torch
import datasets
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments

## Datasets

In [55]:
data_files = {"train": "data/train.csv", "test": "data/test.csv"}

train_dataset = load_dataset("csv", data_files=data_files["train"])
train_dataset = train_dataset.with_format("torch") # allows you to directly load the dataset into PyTorch models

test_dataset = load_dataset("csv", data_files=data_files["test"])
test_dataset = test_dataset.with_format("torch") # allows you to directly load the dataset into PyTorch models

In [56]:
# example row in the dataset
# train_dataset['train'][0]
# train_dataset['train']

## Load pre-trained BERT model

Loads a pretrained BERT model and its corresponding tokenizer from the HuggingFace Transformers library. AutoModelForSequenceClassification loads various types of transformer-based models (like BERT, RoBERTa, GPT, etc.) and adds a sequence classification head on top of the model. The classification head typically consists of a fully connected (dense) layer on top of the pooled output of the transformer, which maps the output to the desired number of classes.

In [57]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenize the dataset

In [58]:
def tokenize_inputs(q1, q2, previous_correct):
    text_input = f"Q1: {q1} Q2: {q2} Previous Correct: {previous_correct}"
    inputs = tokenizer(text_input, truncation=True, padding='max_length', return_tensors="pt")
    
    # should they be tokenized separately?
#     inputs = tokenizer(previous_correct, q1, q2, truncation=True, padding='max_length', return_tensors="pt")

    
#     input_ids = torch.tensor(inputs['input_ids'])
#     attention_masks = torch.tensor(inputs['attention_mask'])
#     return input_ids, attention_masks
    
    return inputs

def tokenize_inputs(batch):
    q1 = batch['q1']
    q2 = batch['q2']
    previous_correct = batch['previous_correct']
    
    text_inputs = [f"Q1: {q1[i]} Q2: {q2[i]} Previous Correct: {previous_correct[i]}" for i in range(len(q1))]
    inputs = tokenizer(text_inputs, truncation=True, padding='max_length', return_tensors="pt")
    
    # should they be tokenized separately?
    # inputs = tokenizer(previous_correct, q1, q2, truncation=True, padding='max_length', return_tensors="pt")

    
#     input_ids = torch.tensor(inputs['input_ids'])
#     attention_masks = torch.tensor(inputs['attention_mask'])
#     return input_ids, attention_masks    
    return inputs

In [59]:
tokenized_train_dataset = train_dataset.map(tokenize_inputs, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_inputs, batched=True)

Map:   0%|          | 0/18480 [00:00<?, ? examples/s]

Map:   0%|          | 0/492 [00:00<?, ? examples/s]

## Finetune BERT model

In [62]:
training_args = TrainingArguments(output_dir="test_trainer")

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

## Questions for Keyon

1. Why split the tokenized inputs into input ids and attention masks?
2. Is there an art to deciding how multiple inputs get tokenized together? I concatenated all of mine into one string. You seemed to deliberately place "previous_correct" before the first question.
3. Why did you use BertEnsemble instead of a single BertForSequenceClassification model?
4. Did you use GPUs for this stuff? How do you get access?