# Finetune Pretrained Models With the WikiSQL Dataset

Before Hugging Face models can be compared for a given conditional generation task they must be finetuned using an associated dataset. This notebook serves as a pipeline to this end, for use with the WikiSQL dataset. Various model choices are possible, though some small editing of the `model_info` dictionary is necessary. Other datasets could be used here, but not without their own `format_dataset` setup.

In [1]:
#provide your token
huggingface_token = ""

In [2]:
! nvidia-smi

Fri Apr 11 17:19:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.86                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3080      WDDM  |   00000000:01:00.0  On |                  N/A |
| 70%   51C    P8             34W /  350W |    1217MiB /  12288MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import torch
cuda = torch.cuda.is_available()
print(cuda)  # Should print True if GPU is available
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 3080


In [4]:
from huggingface_hub import login
from datasets import load_dataset, DatasetDict
import evaluate
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments

login(token=huggingface_token)

model_info = {'name': 'facebook/bart-base', 'path': 'finetuned/bart-base-wikisql', 'batch_size': 64}

In [5]:
name = model_info['name']
path = model_info['path']
batch_size = model_info['batch_size']
epochs = 5 # just picking this because of testing

tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSeq2SeqLM.from_pretrained(name)

device = torch.device("cuda" if cuda else "cpu")
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_lay

In [6]:
print(model.__class__.__name__)

BartForConditionalGeneration


## Format for WikiSQL

To prepare training data for finetuning, input prompts are constructed by simply prepending example questions with 'translate to SQL: '. This serves as a minimum complexity approach for later benchmarking of any prepared models.

In [7]:
dataset = DatasetDict({ 'train': load_dataset('wikisql', split='train'),
                            'validation': load_dataset('wikisql', split='validation'),
                      })

def format_dataset(example):
    return {'input': 'translate to SQL: ' + example['question'], 'target': example['sql']['human_readable']}

formatted_dataset = dataset.map(format_dataset, remove_columns=dataset['train'].column_names).shuffle(seed=42) # also shuffles!

print("DatasetDict class details: \n", formatted_dataset, "\n", "First formatted example: \n", formatted_dataset['train'][0])


DatasetDict class details: 
 DatasetDict({
    train: Dataset({
        features: ['input', 'target'],
        num_rows: 56355
    })
    validation: Dataset({
        features: ['input', 'target'],
        num_rows: 8421
    })
}) 
 First formatted example: 
 {'input': 'translate to SQL: Which sum of week that had an attendance larger than 55,767 on September 28, 1986?', 'target': 'SELECT SUM Week FROM table WHERE Attendance > 55,767 AND Date = september 28, 1986'}


## Tokenization Scheme Depends Upon the Model Choice

These tokenization functions cover several T5 and BART based variants. Minimal additional code should be necessary to expand these. 

In [8]:
if model.__class__.__name__ == "T5ForConditionalGeneration":
    # map with tokenizer to provide tokenized dataset to the Seq2SeqTrainer
    def tokenize_function(example_batch):
        '''use direct tokenizer call, construct encodings dictionary'''
        input_encodings = tokenizer(example_batch['input'], padding='max_length', truncation=True, max_length=64)
        target_encodings = tokenizer(example_batch['target'], padding='max_length', truncation=True, max_length=64)
    
    
        encodings = {
            'input_ids': input_encodings['input_ids'], 
            'attention_mask': input_encodings['attention_mask'],
            'labels': target_encodings['input_ids'],
            'decoder_attention_mask': target_encodings['attention_mask']
        }
    
        return encodings
    
    columns = ['input_ids', 'attention_mask', 'labels', 'decoder_attention_mask']
    
elif model.__class__.__name__ == "BartForConditionalGeneration":
    
    def tokenize_function(example_batch):
        '''use this for bart'''
        input_encodings = tokenizer(example_batch['input'], padding='max_length', truncation=True, max_length=64)
        with tokenizer.as_target_tokenizer():
            target_encodings = tokenizer(example_batch['target'], padding='max_length', truncation=True, max_length=64)
        
        encodings = {
            'input_ids': input_encodings['input_ids'], 
            'attention_mask': input_encodings['attention_mask'],
            'labels': target_encodings['input_ids']
        }
        # Remove decoder_attention_mask for BART models
        return encodings

    columns = ['input_ids', 'attention_mask', 'labels']

tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True, remove_columns=formatted_dataset['train'].column_names)

tokenized_dataset.set_format(type='torch', columns=columns)

In [9]:
# Metric calculation 
# Exact Match https://huggingface.co/spaces/evaluate-metric/exact_match
# ROUGE2 score https://huggingface.co/spaces/evaluate-metric/rouge
# BLEU score https://huggingface.co/spaces/evaluate-metric/sacrebleu
exact_match = evaluate.load("exact_match")
rouge = evaluate.load("rouge")
sacrebleu = evaluate.load("sacrebleu")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # Decode the predictions and labels
    pred_ids[pred_ids == -100] = tokenizer.pad_token_id
    
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    
    return {
        "exact_match": exact_match.compute(predictions=pred_str, references=label_str)['exact_match'],
        "rouge2": rouge.compute(predictions=pred_str, references=label_str)["rouge2"],
        "bleu": sacrebleu.compute(predictions=pred_str, references=label_str)["score"],
    }

# arguments for Seq2SeqTrainer
trainer_args = Seq2SeqTrainingArguments(
    output_dir=path,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    overwrite_output_dir=True,
    save_total_limit=3,
    load_best_model_at_end=True,
    push_to_hub=False
    #fp16=True, 
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=trainer_args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
)

# memory stats for the current device
if cuda:
    pre_allocated = torch.cuda.memory_allocated(device) / (1024**3)  # GB
    pre_reserved = torch.cuda.memory_reserved(device) / (1024**3)  # GB

## Optional Preliminary Run of Evaluate to Verify the Model Before Training

In [10]:
trainer.evaluate()

{'eval_loss': 15.01809310913086,
 'eval_model_preparation_time': 0.003,
 'eval_exact_match': 0.0,
 'eval_rouge2': 0.12279714302204521,
 'eval_bleu': 3.7702632841491566,
 'eval_runtime': 155.8653,
 'eval_samples_per_second': 54.027,
 'eval_steps_per_second': 0.847}

In [11]:
train_output = trainer.train()
train_output = train_output._asdict()

# end memory stats for the current device
if cuda:
    post_allocated = torch.cuda.memory_allocated(device) / (1024**3)  # GB
    post_reserved = torch.cuda.memory_reserved(device) / (1024**3)  # GB
    
    train_output['additional_memory_allocated'] = post_allocated - pre_allocated
    train_output['additional_memory_reserved'] = post_reserved - pre_reserved

print("Train output", train_output)

Epoch,Training Loss,Validation Loss,Model Preparation Time,Exact Match,Rouge2,Bleu
1,0.6739,0.096609,0.003,0.369909,0.832159,71.611776
2,0.0949,0.087119,0.003,0.399121,0.84316,73.626441
3,0.0733,0.08261,0.003,0.41254,0.847493,74.432418
4,0.0624,0.081649,0.003,0.41919,0.850105,74.869931
5,0.0559,0.080982,0.003,0.425365,0.852719,75.240423


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


Train output {'global_step': 4405, 'training_loss': 0.14407037438382897, 'metrics': {'train_runtime': 1947.5326, 'train_samples_per_second': 144.683, 'train_steps_per_second': 2.262, 'total_flos': 1.0738030657536e+16, 'train_loss': 0.14407037438382897, 'epoch': 5.0}, 'additional_memory_allocated': 1.1029548645019531, 'additional_memory_reserved': 6.87109375}


## Save the Model and Its Training Metric Data

In [12]:
import json
# store the model and maybe push to huggingface hub?
with open(path + "/train_output.json", "w") as f:
    json.dump(train_output, f, indent=4)

trainer.save_model()

tokenizer.save_pretrained(path)

trainer.create_model_card()

#trainer.push_to_hub()

In [13]:
# clean up memory
del model
if cuda:
    torch.cuda.empty_cache()