# Improving LLMs on underrepresented programming languages

Importing the Necessary Libraries

In [1]:
import os
import glob

from transformers import AutoModelForMaskedLM
from transformers import RobertaForCausalLM
from transformers import AutoTokenizer
from transformers import AdamW, get_scheduler
from transformers.modeling_outputs import MaskedLMOutput

from transformers import GPT2LMHeadModel, GPT2Tokenizer

from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split

import torch

## 1. Dataset Creation
This section details the process of creating the code completion dataset for Kotlin.

 ### *Choice of Codebase:*

The Ktor framework will be the foundation for my Kotlin code completion dataset. Ktor is a popular Kotlin-based framework specifically designed for building asynchronous servers and clients. Its focus on conciseness, expressiveness, and scalability makes it a valuable source of real-world Kotlin code, ensuring the dataset's relevance for training and evaluation.

In [2]:
cd = os.getcwd()
cd

'C:\\Users\\ismai\\OneDrive\\Masaüstü\\Project Adaptation for Code Modeling models'

In [3]:
# Cloning the Ktor files from git source.
!git clone https://github.com/ktorio/ktor.git

fatal: destination path 'ktor' already exists and is not an empty directory.


### *Data Extraction:*

I extract all Kotlin code from the Ktor project to create the dataset. This provides a comprehensive collection of code snippets and structures commonly encountered in Kotlin development. The extracted code serves as the positive training examples for the fine-tuned model, enabling it to learn effective code completion patterns for the Kotlin language.

In [4]:
def find_kotlin_files(root_dir):
    kotlin_files = []
    for root, dirs, files in os.walk(root_dir):
        for file in files:
            if file.endswith('.kt'):
                kotlin_files.append(os.path.join(root, file))
    return kotlin_files

In [5]:
def read_kotlin_files(kotlin_files):
    kotlin_code = []
    for file in kotlin_files:
        with open(file, 'r', encoding='utf-8') as f:
            code = f.read()
            kotlin_code.append(code)
    return kotlin_code

In [6]:
ktor_repo_dir = cd + '/ktor'
ktor_repo_dir

'C:\\Users\\ismai\\OneDrive\\Masaüstü\\Project Adaptation for Code Modeling models/ktor'

In [7]:
kotlin_files = find_kotlin_files(ktor_repo_dir)
kotlin_code = read_kotlin_files(kotlin_files)

In [8]:
print("Number of Kotlin files found:", len(kotlin_files))

print("\nSample Kotlin code:")
print(kotlin_code[5][0:250])

Number of Kotlin files found: 1971

Sample Kotlin code:
/*
 * Copyright 2014-2021 JetBrains s.r.o and contributors. Use of this source code is governed by the Apache 2.0 license.
 */
import org.gradle.api.*
import org.gradle.kotlin.dsl.*
import org.jetbrains.kotlin.gradle.dsl.*
import org.jetbrains.kotlin


## 2. Model Adaptation
In this section, I will do the process of adapting a pre-trained Transformer model for Kotlin code completion and evaluating its performance.

### *Model Selection and Adaptation:*

I have selected CodeBERT is a variant of BERT (Bidirectional Encoder Representations from Transformers) that has been pre-trained on a large corpus of code from GitHub. It is specifically designed for code-related tasks and performs well on code completion tasks.


In [9]:
model_name = "microsoft/codebert-base"

In [10]:
# Loading the pre-trained CodeBERT model from model_name path
model = AutoModelForMaskedLM.from_pretrained(model_name)
model2 = RobertaForCausalLM.from_pretrained(model_name)

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# Loading the tokenizer associated with the CodeBERT model
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

Sample code to verify model functionality for generating code from tokenized Kotlin

In [13]:
# Tokenizing Sample Kotlin code
sample_kotlin_code = "fun main() { println(\"Hello, world!\") }"
sample_tokenized_kotlin = tokenizer.encode(sample_kotlin_code, return_tensors="pt")

In [14]:
# Generating completion using the adapted model
max_length = 50
generated = model2.generate(sample_tokenized_kotlin, max_length=max_length, num_return_sequences=1)

# Decoding and see generated completion
generated_code = tokenizer.decode(generated[0], skip_special_tokens=True)
print("Generated code completion:", generated_code)

Generated code completion: fun main() { println("Hello, world!") }cellent ambulilar ambul McKilarilar sund ambul Dundilar ambul McKilar McKilar McKilar McKilar McKilarmal ambul McK Dund McKilar McK Dund Dund Dund Dundilar ambul Dund


## 3. Data Preprocessing

I've tokenized the Kotlin code dataset using the CodeBERT tokenizer, split the dataset into training, validation, and test sets, and created PyTorch Dataset and DataLoader objects for each set in this section.

In [15]:
# Tokenization is don on the Kotlin code dataset
tokenized_kotlin_code = tokenizer(kotlin_code, padding=True, truncation='longest_first', return_tensors="pt", max_length=512)
print("Tokenized input shape:", tokenized_kotlin_code.input_ids.shape)

Tokenized input shape: torch.Size([1971, 512])


In [16]:
# See the tokenized sequence sample
for i, encoding in enumerate(tokenized_kotlin_code.encodings):
    print(f"Tokenized sequence {i + 1}:")
    print(encoding.tokens)
    if i == 0:
        break

Tokenized sequence 1:
['<s>', '/*', 'Ċ', 'Ġ*', 'ĠCopyright', 'Ġ2014', '-', '20', '21', 'ĠJet', 'Br', 'ains', 'Ġs', '.', 'r', '.', 'o', 'Ġand', 'Ġcontributors', '.', 'ĠUse', 'Ġof', 'Ġthis', 'Ġsource', 'Ġcode', 'Ġis', 'Ġgoverned', 'Ġby', 'Ġthe', 'ĠApache', 'Ġ2', '.', '0', 'Ġlicense', '.', 'Ċ', 'Ġ*/', 'Ċ', 'import', 'Ġorg', '.', 'grad', 'le', '.', 'api', '.*', 'Ċ', 'import', 'Ġorg', '.', 'grad', 'le', '.', 'api', '.', 't', 'asks', '.*', 'Ċ', 'import', 'Ġorg', '.', 'grad', 'le', '.', 'k', 'ot', 'lin', '.', 'd', 'sl', '.*', 'Ċ', 'import', 'Ġorg', '.', 'j', 'mail', 'en', '.', 'grad', 'le', '.', 'k', 'ot', 'lin', 'ter', '.', 't', 'asks', '.*', 'Ċ', 'Ċ', 'fun', 'ĠProject', '.', 'config', 'ure', 'Cod', 'estyle', '()', 'Ġ{', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġapply', '(', 'plugin', 'Ġ=', 'Ġ"', 'org', '.', 'j', 'mail', 'en', '.', 'k', 'ot', 'lin', 'ter', '")', 'ĊĊ', 'Ġ', 'Ġ', 'Ġ', 'Ġk', 'ot', 'lin', 'ter', '.', 'apply', 'Ġ{', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġignore', 'Fail', 'ures', 'Ġ=', 'Ġtrue', 'Ċ

In [17]:
train_size = 0.6
val_test_size = 0.2
train_kotlin_code, temp_kotlin_code = train_test_split(kotlin_code, train_size=train_size, random_state=42)
val_kotlin_code, test_kotlin_code = train_test_split(temp_kotlin_code, test_size=val_test_size, random_state=42)

In [18]:
print("Training set size:", len(train_kotlin_code))
print("Validation set size:", len(val_kotlin_code))
print("Test set size:", len(test_kotlin_code))

Training set size: 1182
Validation set size: 631
Test set size: 158


In [19]:
class KotlinDataset(Dataset):
    def __init__(self, tokenized_kotlin_code):
        self.tokenized_kotlin_code = tokenized_kotlin_code

    def __len__(self):
        return len(self.tokenized_kotlin_code)

    def __getitem__(self, idx):
        encoding = self.tokenized_kotlin_code[idx]
        input_ids = encoding.ids
        attention_mask = encoding.attention_mask
        return {
            'input_ids': torch.tensor(input_ids),
            'attention_mask': torch.tensor(attention_mask)
        }

In [20]:
# Set function to decode tokenized sequences
def decode_tokens(tokenizer, token_ids):
    return tokenizer.decode(token_ids, skip_special_tokens=True)

In [21]:
kotlin_train_dataset = KotlinDataset(tokenized_kotlin_code[:len(train_kotlin_code)])
kotlin_val_dataset = KotlinDataset(tokenized_kotlin_code[len(train_kotlin_code):len(train_kotlin_code) + len(val_kotlin_code)])
kotlin_test_dataset = KotlinDataset(tokenized_kotlin_code[len(train_kotlin_code) + len(val_kotlin_code):])

In [22]:
sample = kotlin_train_dataset[10]
input_ids = sample['input_ids']
attention_mask = sample['attention_mask']
print(f"Sample {11}:"),
print("Input IDs Shape:", input_ids.shape)
print("Attention Mask Shape:", attention_mask.shape)

Sample 11:
Input IDs Shape: torch.Size([512])
Attention Mask Shape: torch.Size([512])


In [23]:
sample = kotlin_test_dataset[8]
input_ids = sample['input_ids']
attention_mask = sample['attention_mask']
print(f"Sample {9}:")
print("Input IDs Shape:", input_ids.shape)
print("Attention Mask Shape:", attention_mask.shape)

Sample 9:
Input IDs Shape: torch.Size([512])
Attention Mask Shape: torch.Size([512])


In [24]:
batch_size = 8

kotlin_train_dataloader = DataLoader(kotlin_train_dataset, batch_size=batch_size, shuffle=True)
kotlin_val_dataloader = DataLoader(kotlin_val_dataset, batch_size=batch_size)
kotlin_test_dataloader = DataLoader(kotlin_test_dataset, batch_size=batch_size)

In [25]:
# See the structure of the first batch
for batch in kotlin_test_dataloader:
    print("Batch Structure:", batch)
    break 

Batch Structure: {'input_ids': tensor([[    0, 49051, 50118,  ..., 30333,  1531,     2],
        [    0, 49051, 50118,  ...,     1,     1,     1],
        [    0, 49051, 50118,  ...,     1,     1,     1],
        ...,
        [    0, 49051, 50118,  ...,     1,     1,     1],
        [    0, 49051, 50118,  ...,     1,     1,     1],
        [    0, 49051, 50118,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}


In [26]:
batch = next(iter(kotlin_train_dataloader))

input_ids = batch['input_ids']
attention_mask = batch['attention_mask']

print("Shape of Kotlin Train DataLoader Sample:")
print("Input tokens shape:", input_ids.shape)
print("Attention mask shape:", attention_mask.shape)

Shape of Kotlin Train DataLoader Sample:
Input tokens shape: torch.Size([8, 512])
Attention mask shape: torch.Size([8, 512])


## 4. Fine-Tuning on Kotlin Dataset
I will fine-tune the adapted model on the Kotlin dataset which is extracted from the Ktor project. This step will help the model learn Kotlin-specific patterns and improve its performance on Kotlin code completion tasks.

In [27]:
# Setting Hyperparameters
learning_rate = 5e-5
epochs = 3

In [28]:
# Setting the optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Setting the scheduler
num_training_steps = len(kotlin_train_dataloader) * epochs
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=int(0.1 * num_training_steps),
    num_training_steps=num_training_steps
)



In [None]:
model.train()

for epoch in range(epochs):
    total_loss = 0.0

    for batch_idx, batch in enumerate(kotlin_train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        optimizer.zero_grad()
        outputs = model(**batch)

        if isinstance(outputs, MaskedLMOutput) and outputs.loss is not None:
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            print(f"Epoch [{epoch + 1}/{epochs}], Batch [{batch_idx + 1}/{len(kotlin_train_dataloader)}], Loss: {loss.item()}")
    avg_train_loss = total_loss / len(kotlin_train_dataloader)

    # For Validation
    val_total_loss = 0.0
    model.eval()
    with torch.no_grad():
        for val_batch in kotlin_val_dataloader:
            val_batch = {k: v.to(device) for k, v in val_batch.items()}
            val_outputs = model(**val_batch)
            if isinstance(val_outputs, MaskedLMOutput) and val_outputs.loss is not None:
                val_loss = val_outputs.loss
                val_total_loss += val_loss.item()
    avg_val_loss = val_total_loss / len(kotlin_val_dataloader)

    print(f"Epoch {epoch + 1}: Average training loss {avg_train_loss:.4f}, Average validation loss {avg_val_loss:.4f}")


In [29]:
# Saving the model parameters
torch.save(model.state_dict(), 'model_state.pth')

In [30]:
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Loading the parameters' values to new model
model.load_state_dict(torch.load('model_state.pth'))

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<All keys matched successfully>

## 5. Evaluation 
After fine-tuning the model, I'll evaluate its performance on both Kotlin test code.

In [31]:
def evaluate_model_on_kotlin_test(model, test_dataloader):
    model.eval() 
    device = next(model.parameters()).device
    total_loss = 0.0

    with torch.no_grad():
        for batch in test_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            if isinstance(outputs, MaskedLMOutput) and outputs.loss is not None:
                loss = outputs.loss
                total_loss += loss.item()

    avg_loss = total_loss / len(test_dataloader)
    print(f"Average loss on Kotlin test dataset: {avg_loss:.4f}")


In [32]:
evaluate_model_on_kotlin_test(model, kotlin_test_dataloader)

Average loss on Kotlin test dataset: 0.0000


## 6. Review and Analysis

Unfortunately, during the fine-tuning process, I encountered challenges in obtaining meaningful training and validation losses. Additionally, due to limitations in the evaluation procedure, I couldn't adequately assess the model's performance.

### *Areas for Improvement:*

Training and Validation Loss: The inability to achieve sensible training and validation losses suggests potential issues with the model's training process. This could be due to various factors such as inappropriate hyperparameters, insufficient training data, or model architecture mismatch.
Model Evaluation: Without proper evaluation metrics, it's challenging to gauge the effectiveness of the fine-tuned model accurately. Implementing comprehensive evaluation metrics and procedures is crucial for assessing the model's performance reliably.


### *Completed Tasks:*

Dataset Creation: Successfully extracted Kotlin code from the Ktor project to create a dataset for code completion tasks.
Model Adaptation: Adapted the pre-trained CodeBERT model for Kotlin code completion by fine-tuning it on the created dataset.
Evaluation Setup: Established a framework for evaluating the fine-tuned model on the Kotlin test dataset.


### *Next Steps:*

Hyperparameter Tuning: Experiment with different hyperparameter configurations, including learning rate, batch size, and optimizer settings, to improve training stability and convergence.
Evaluation Enhancement: Implement comprehensive evaluation metrics, such as accuracy, precision, and recall, to assess the model's performance accurately.
Debugging and Troubleshooting: Investigate potential issues in the training pipeline, such as data preprocessing, input formatting, and model architecture compatibility, to identify and resolve any underlying problems.
By addressing these areas for improvement and refining the training and evaluation processes, I can enhance the model's performance and ensure reliable results for Kotlin code completion tasks.