# Setup and EDA

## Installing libraries

We are installing the necessary libraries: **transformers** for our model, **datasets** for handling data, **torch** for PyTorch functionalities, and **antlr4-python3-runtime** for parsing the code

In [None]:
!pip install transformers datasets evaluate sacrebleu torch
!pip install antlr4-python3-runtime==4.9.2

## Loading and prepping the data

We are loading the dataset from JSON files using the datasets library. The dataset is split into training and validation sets

In [2]:
from datasets import load_dataset

data_files = {
    "train": "data/train4.json",
    "validation": "data/test2.json"
}
data = load_dataset("json", data_files=data_files, field="data")

## Preprocess the Data

We are loading a pre-trained BART tokenizer from the Hugging Face hub, which we will use to preprocess our text data. We also define the source and target languages and a prefix for the translation task

In [None]:
from transformers import AutoTokenizer

checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, force_download=True)

source_lang = "pli"
target_lang = "ktl"
prefix = "translate PL/I to Kotlin: "


We define a preprocessing function to prepare the data for the model. This function adds a prefix to the source text, tokenizes both inputs and targets, and truncates them to a maximum length. We will then apply this function to our dataset

In [5]:
def preprocess_function(examples):
    inputs = [prefix + " ".join(example) for example in examples[source_lang]]
    targets = [" ".join(example) for example in examples[target_lang]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

tokenized_datasets = data.map(preprocess_function, batched=True)

## Create a data collator

We are creating a data collator that will dynamically pad the inputs and targets during batching to ensure they have the same length, which is necessary for efficient training

In [6]:
from transformers import DataCollatorForSeq2Seq

# Create data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Initialize the Model

We are loading the BART model, which is pre-trained for sequence-to-sequence tasks, from the Hugging Face hub

In [7]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

## Set Up Training Arguments

We are setting up the training arguments, including the output directory, evaluation strategy, learning rate, batch sizes, weight decay, number of epochs, and logging strategy. These arguments configure the training process

In [8]:
training_args = Seq2SeqTrainingArguments(
    output_dir="pli_to_kotlin",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=150,
    predict_with_generate=True,
    fp16=True,
    logging_strategy="steps",
    logging_steps=10,
)

## Initialize the Trainer

We are initializing the Seq2SeqTrainer with our model, training arguments, datasets, tokenizer, and data collator. The trainer handles the training and evaluation process

In [9]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

## Train the Model

We are starting the training process using the trainer we initialized. This will train the model on our dataset for the specified number of epochs

In [10]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,6.62749
2,No log,6.62749
3,No log,4.001041
4,No log,2.610912
5,No log,2.07026
6,No log,1.903501
7,No log,1.693196
8,No log,1.557779
9,No log,1.404213
10,3.689200,1.291395


TrainOutput(global_step=150, training_loss=0.5374304989973704, metrics={'train_runtime': 18.391, 'train_samples_per_second': 130.499, 'train_steps_per_second': 8.156, 'total_flos': 40013955072000.0, 'train_loss': 0.5374304989973704, 'epoch': 150.0})

## Saving the model 

We are saving the trained model and tokenizer to the specified directory. This allows us to reuse the trained model for inference later

In [11]:
model.save_pretrained("./article_checkpoint")
tokenizer.save_pretrained("./article_checkpoint")

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('./article_checkpoint/tokenizer_config.json',
 './article_checkpoint/special_tokens_map.json',
 './article_checkpoint/vocab.json',
 './article_checkpoint/merges.txt',
 './article_checkpoint/added_tokens.json',
 './article_checkpoint/tokenizer.json')

## Running inference

We are loading the saved model and tokenizer for inference. This allows us to actually use the trained model to translate new PL/I code to Kotlin

In [12]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("./article_checkpoint")
tokenizer = AutoTokenizer.from_pretrained("./article_checkpoint")

We are setting up the device (GPU if available, otherwise CPU) and moving the model to the appropriate device for efficient computation during inference

In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): Laye

The function tokenizes the input, generates the translated sequence, and decodes it back to a readable format

In [14]:
def translate_sequence(sentence, tokenizer, model, device, max_length=50):
    inputs = tokenizer(prefix + sentence, return_tensors="pt", max_length=512, truncation=True).to(device)
    outputs = model.generate(inputs["input_ids"], max_length=max_length, num_beams=4, early_stopping=True)
    translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_sentence

We are defining a function to format the translated Kotlin code by handling indentation and using Jinja2 templates to insert context-specific data into the code

In [15]:
from jinja2 import Template

def transpile_sequence(translated, context, level=0):
    tokens = translated.split()
    lint = []
    current_line = ""
    for t in tokens:
        t = context.get(t, t)
        if t in ["{", "}"]:
            if current_line:
                lint.append("".rjust(level * 4) + current_line.strip())
                current_line = ""
            if t == "{":
                lint.append("".rjust(level * 4) + t)
                level += 1
            elif t == "}":
                level -= 1
                lint.append("".rjust(level * 4) + t)
        else:
            current_line += " " + t
    if current_line:
        lint.append("".rjust(level * 4) + current_line.strip())

    formatted_code = "\n".join(lint)
    template = Template(formatted_code)
    rendered_code = template.render(context)
    return rendered_code, level

We are defining a function to parse a PL/I file, translate each statement to Kotlin, and transpile the translated code. The function uses ANTLR to parse the PL/I code and a visitor pattern to extract statements for translation and transpilation

In [16]:
from antlr4 import *
from pli.PLILexer import PLILexer
from pli.PLIParser import PLIParser
from pli.PLIVisitor import PLIVisitor
from jinja2 import Template

def parse_and_translate(filename):
    with open(filename, 'r') as file:
        original_code = file.read()
    print("PL/I Code:")
    print(original_code)
    print()

    # Lexer and parser setup
    input_stream = InputStream(original_code)
    lexer = PLILexer(input_stream)
    stream = CommonTokenStream(lexer)
    parser = PLIParser(stream)
    tree = parser.program()

    # Visitor setup
    visitor = PLIVisitor()
    statements = visitor.visit(tree)

    # Translate and transpile each statement
    transpiled_code = ""
    level = 0
    for stmt in statements:
        pli_code = " ".join(stmt['pli'])
        context = stmt.get('context', {})
        translated = translate_sequence(pli_code, tokenizer, model, device)
        transpiled, level = transpile_sequence(translated, context, level)
        transpiled_code += transpiled + "\n"

    print("Kotlin Code:")
    print(transpiled_code)

We are specifying the filename of the PL/I file to be translated and transpiled and then calling the parse_and_translate function to process the file

In [17]:
# Example usage
filename = "FIB.PLI"  # Replace with the actual filename
parse_and_translate(filename)

PL/I Code:
 Factorial: proc options (main);
    dcl (n,result) fixed bin(31);
    n  = 5;
    result = Compute_factorial(n);

 end Factorial;
  /***********************************************/
  /* Subroutine                                  */
  /***********************************************/
  Compute_factorial: proc (n)  returns (fixed bin(31));
     dcl n fixed bin(15);
      if n <= 1 then
        return(1);

     return( n*Compute_factorial(n-1) );

  end Compute_factorial;


Kotlin Code:
fun main (args: Array<String>)
{
    var n : Int
    var result : Int
    n = 5
    result = compute_factorial(n)
}
fun compute_factorial(n : Int) : Int
{
    var n : Int
    if(n<=1)
    {
        return 1
    }
    return n*compute_factorial(n-1)
}

