## Setup & EDA

We first need to install all the needed packages

In [1]:
!pip install antlr4-python3-runtime==4.9.2

Collecting antlr4-python3-runtime==4.9.2
  Downloading antlr4-python3-runtime-4.9.2.tar.gz (117 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25ldone
[?25h  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.2-py3-none-any.whl size=144547 sha256=9b50d8998d848e902ef007d5368032df056e2b14e166a97ecc0482624d4a501e
  Stored in directory: /root/.cache/pip/wheels/1d/2f/50/8609b0155597d16b276307cb7899e7c832412f4d0d3a1db01c
Successfully built antlr4-python3-runtime
Installing collected packages: antlr4-python3-runtime
Successfully installed antlr4-python3-runtime-4.9.2
[0m

We will then import all the needed libraries

In [2]:
import json
from pathlib import Path
from datasets import Dataset,DatasetDict, load_dataset
import pandas as pd
from antlr4 import *
from pli.PLILexer import PLILexer
from pli.PLIParser import PLIParser
from pli.PLIVisitor import PLIVisitor

We reformatted the data set now so that we can use a standard load_dataset

In [3]:
data = load_dataset("json", data_files="/notebooks/data/train1.json", field="data")

Using custom data configuration default-2cdb10d3c3013ad5


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-2cdb10d3c3013ad5/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-2cdb10d3c3013ad5/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

lets just display the dataset to check

In [4]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 11
    })
})

Great, lest split it up, and create validation and training datasets

In [5]:
split_datasets = data["train"].train_test_split(train_size=0.9, seed=20)
split_datasets["validation"] = split_datasets.pop("test")
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 9
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 2
    })
})

In [6]:
split_datasets["train"][1]["translation"]

{'ktl': 'fun main (args: {{type0}} <{{type1}}>)',
 'pli': 'PROCEDURE MAIN {{type0}} {{type1}}'}

## Tokenize

Import a tokenizer to convert all our inputs and targets

TOKENs in compiler and LLM/Transformers have different meanings. Tokenisation here refer to the creation of a numerical value for the input tokens, a vector that can be used by the model.

In [18]:
from transformers import AutoTokenizer

model_checkpoint = "google-t5/t5-base" # Replace this with your desired model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

Note because we merged the dataset, we can actually use the same dataset for both source and target.
The most interesting thing here is the tokenizer, a hugging face function, that produces the attention_mask.

In [19]:
pli_sentence = split_datasets["train"][1]["translation"]["pli"]
ktl_sentence = split_datasets["train"][1]["translation"]["ktl"]

inputs = tokenizer(pli_sentence, return_tensors="pt")
targets = tokenizer(ktl_sentence, return_tensors="pt")
inputs, targets

({'input_ids': tensor([[ 6828,   254,  2326, 18290,   283, 13570,     3,     2,  6137,   632,
              2,     3,     2,  6137,   536,     2,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 {'input_ids': tensor([[ 694,  711,   41, 8240,    7,   10,    3,    2, 6137,  632,    2,    3,
             2, 6137,  536,    2, 3155,   61,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])})

In [21]:
wrong_targets = tokenizer(pli_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))

['▁PRO', 'C', 'ED', 'URE', '▁M', 'AIN', '▁', '<unk>', 'type', '0', '<unk>', '▁', '<unk>', 'type', '1', '<unk>', '</s>']


A quick function to clean up the input sequence and set the model up to accept the input sequence

overflowing_tokens and num_truncated tokens are things like whitespace, and start of sequence/end of sequence etc.

This model expects the inputs to be named "labels".


In [9]:
max_length = 128

def preprocess_function(examples):
    inputs = [ex["pli"] for ex in examples["translation"]]
    targets = [ex["ktl"] for ex in examples["translation"]]
    
    # Tokenize inputs and targets separately
    model_inputs = tokenizer(inputs, max_length=max_length, truncation=True, padding=True)
    model_targets = tokenizer(targets, max_length=max_length, truncation=True, padding=True)
    
    # Remove unnecessary keys from model_inputs
    model_inputs.pop("overflowing_tokens", None)
    model_inputs.pop("num_truncated_tokens", None)
    
    # Add targets to model_inputs
    model_inputs["labels"] = model_targets["input_ids"]
    
    return model_inputs

In [16]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-2cdb10d3c3013ad5/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253/cache-cde49ae52028470c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-2cdb10d3c3013ad5/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253/cache-238adf16c7a61394.arrow


## Fine-tuning model

In [17]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [18]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [19]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [20]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [21]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",  # No evaluation during training
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=150,
    predict_with_generate=True,
)

In [22]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=None,  # Assuming no evaluation dataset
    data_collator=data_collator,
    tokenizer=tokenizer
)

In [23]:
import os

# Disable wandb
os.environ["WANDB_DISABLED"] = "true"

In [24]:
trainer.train()

***** Running training *****
  Num examples = 9
  Num Epochs = 150
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 150
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=150, training_loss=0.14063475290934244, metrics={'train_runtime': 27.7835, 'train_samples_per_second': 48.59, 'train_steps_per_second': 5.399, 'total_flos': 8938045440000.0, 'train_loss': 0.14063475290934244, 'epoch': 150.0})