# **Finetuning Llama-2-7b for Software Test Case Generation**
This notebook contains the complete code for finetuning of the Llama-2-7b-chat-hf model. The A100 GPU from Google Colab was used for the entirety of this project, and the settings specified in the notebook will not work unless the A100 GPU is used due to memory overheads.

## **Import Necessary Libraries and Methods**
It is important to install the required packages before executing the code on Google Colab. The versions and order of installation of these packages needs to be maintained so no issues arise due to non-compliance of module versions

In [None]:
!pip install -q transformers==4.30.0 trl bitsandbytes==0.43.3 accelerate==0.21.0 peft==0.4.0 tensorboard==2.15.0 wandb
!pip install -q cudf-cu12==24.4.1
!pip install -q ibis-framework --upgrade
!pip install -q bigframes --upgrade
!pip install -q gcsfs==2024.3.1
!pip install -q datasets==2.19.1

In [None]:
# Update the package list
!apt-get update

# Install development libraries for pycairo and other required packages
!apt-get install -y libcairo2-dev pkg-config python3-dev

# Install pycairo
!pip install -q pycairo

In [None]:
!pip install -q tensorboard==2.17.0
!pip install -q requests==2.32.3
!pip install -q xformers triton

In [None]:
!pip check
!nvcc --version

In [None]:
!python -m bitsandbytes

In [None]:
import numpy
import gc
import os
import pandas as pd
import torch
from transformers import AutoTokenizer, BitsAndBytesConfig, LlamaForCausalLM, TrainingArguments, pipeline
from peft import LoraConfig, prepare_model_for_kbit_training, PeftModel
from torch.utils.tensorboard import SummaryWriter
from trl import SFTTrainer
from datasets import Dataset, concatenate_datasets, DatasetDict, load_from_disk
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings('ignore')

## **Converting Text Files into Pandas Dataframe**

In [None]:
"""Firstm we assign text files for input and output to corresponding variables"""
input_path = '/content/drive/MyDrive/MSc Project/new_attempt/data/train/input.methods.txt'
output_path = '/content/drive/MyDrive/MSc Project/new_attempt/data/train/output.tests.txt'
with open (input_path, 'r') as f:
  input = f.read() # Read the input file and assign it to the input variable
with open (output_path, 'r') as f:
  output = f.read() # Read the output file and assign it to the output variable

"""Now, we can create a pandas dataframe from the input and output text files"""
df = pd.DataFrame({'input': input.split('\n'), 'output': output.split('\n')})
df

## **Defining a Chat Template for the Model**

In [None]:
"""This function is used to define a chat template for the inference phase.
The dataframe will be mapped according to this template so that the model expects
this format for inference"""
def chat_template(sample):
  return f"---Focal Method---\n{sample}\n\n---Unit Test---\n"

df['input'] = df['input'].map(chat_template) # The input records are mapped as per the chat template
df

## **Reducing the size of the Dataset to a Manageable Size**

In [None]:
"""Taking a chunk (25000 entries) of the dataset for the finetuning process
due to memory limitations. The full dataset has over 600000 records, and the finetuning process
cannot be carried out on a single consumer grade GPU for such a large dataset.
So, this project will serve as the baseline for future work in this domain and work as
a proof of concept for the finetuning process"""
df_chunk = df.iloc[:25000]
df_chunk

## **Loading the Model and Tokenizer**

In [None]:
base_model = "NousResearch/Llama-2-7b-chat-hf" # Model from NousResearch available on Hugging Face
new_model = "/content/drive/MyDrive/MSc Project/new_attempt/finetuned_llama_for_software_testing" # Specifying the directory for the new model after finetuning
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast = True) # Loading the tokenizer

## **Tokenizing the Data**

In [None]:
"""The dataset cannot be used directly by the Llama-2 model for training, it first needs to be
converted into tokens. This piece of code converts the dataset into tokens, and also returns list
objects for the token lengths of the input, output, and combined (input + output) tokens.
Batchwise processing is used during the tokenization process because this process can be
memory intensive, and can cause the runtime to crash if the entire dataset is tokenized at once."""
batch_size = 5000
final_dataset = None
input_token_length_list = [] # Initializing lists for storing token lengths
output_token_length_list = []
combined_token_length_list = []

for start_idx in range(0, len(df_chunk), batch_size):
  end_idx = min(start_idx + batch_size, len(df_chunk))
  df_chunk_batch = df_chunk.iloc[start_idx:end_idx]

  # Tokenizing the batch
  input_chunk_tokens = tokenizer(df_chunk_batch['input'].tolist(), return_tensors = 'np')
  output_chunk_tokens = tokenizer(df_chunk_batch['output'].tolist(), return_tensors = 'np')

  # Creating a list of token lengths for each row for inputs, outputs, and combined lengths
  input_token_length = [len(token) for token in input_chunk_tokens['input_ids']]
  output_token_length = [len(token) for token in output_chunk_tokens['input_ids']]
  combined_token_length = [x + y for x, y in zip(input_token_length, output_token_length)]

  # Appending the token length lists to the corresponding main lists outside the loop
  input_token_length_list.extend(input_token_length)
  output_token_length_list.extend(output_token_length)
  combined_token_length_list.extend(combined_token_length)

  # Creating separate numpy arrays
  input_ids_np = input_chunk_tokens['input_ids']
  attention_mask_np = input_chunk_tokens['attention_mask']
  labels_np = output_chunk_tokens['input_ids']

  # Adding the text field in the dataset for SFTTrainer
  combined_text_list = [
                        f"Input: {input_text}\nOutput: {output_text}"
                        for input_text, output_text in zip(df_chunk_batch['input'].tolist(), df_chunk_batch['output'].tolist())
                       ]

  # Creating a dictionary for the batch where keys are column names and values are the lists
  # This is the format in which the model expects to receive the data during finetuning
  tokenized_dataset_dict = {
                            'input_ids': input_ids_np,
                            'attention_mask': attention_mask_np,
                            'labels': labels_np,
                            'text': combined_text_list
                           }

  # Creating a dataset object from the batch dictionary
  batch_dataset = Dataset.from_dict(tokenized_dataset_dict)

  # Concatenating the batch datasets with the final dataset
  if final_dataset is None:
    final_dataset = batch_dataset
  else:
    final_dataset = concatenate_datasets([final_dataset, batch_dataset])

  # Removing intermediate objects from the memory to make the process efficient
  del (input_chunk_tokens, output_chunk_tokens, input_token_length, output_token_length,
       combined_token_length, input_ids_np, attention_mask_np, labels_np,
       combined_text_list, tokenized_dataset_dict, batch_dataset)
  gc.collect()

final_dataset

## **Plotting Distribution of Token Lengths for Input, Output, and Combined Tokens**

In [None]:
""" Defining a function for plotting the token length distributions"""
def plot_distribution(token_counts, title):
    sns.set_style("whitegrid")
    plt.figure(figsize=(15, 6))
    plt.hist(token_counts, bins=50, color='#348ddb', edgecolor='black')
    plt.title(title, fontsize=16)
    plt.xlabel("Number of tokens", fontsize=14)
    plt.ylabel("Number of examples", fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.tight_layout()
    plt.show()

# Plotting the token length distributions for input, output, and combined token lengths
plot_distribution(input_token_length_list, "Distribution of Input Token Lengths")
plot_distribution(output_token_length_list, "Distribution of Output Token Lengths")
plot_distribution(combined_token_length_list, "Distribution of Combined Token Lengths")

## **Filtering out entries with more than 1024 tokens**

In [None]:
"""The distribution plots indicate that most of the records have combined token lengths of <= 1000. So, the records
with combined token lengths of >=1024 and input token lengths of >512 are filetered out. This is done to fix the
long tail distribution. The context window of Llama 2 is 4096, but we will limit the token lengths to 1024 for
computational efficiency"""
valid_indices = [i for i, count in enumerate(combined_token_length_list) if count < 1024 and input_token_length_list[i] <= 512] # removing indices with large token lengths
print(f"Number of valid records: {len(valid_indices)}")
print(f"So, removing {len(final_dataset) - len(valid_indices)} records") # displaying the filtration results

final_dataset = final_dataset.select(valid_indices) # updating the final dataset to include only the valid indices

valid_token_lengths = [combined_token_length_list[i] for i in valid_indices] # Get combined token counts for each row in the updated dataset

plot_distribution(valid_token_lengths, "Distribution of Valid Token Lengths") # Plotting the updated distribution of valid token lengths

## **Train Test Split**

In [None]:
"""Splitting the data for training and testing. Both datasets will be used during the finetuning process
to determine the training loss and evaluation loss respectively."""
train_test_split = final_dataset.train_test_split(test_size = 0.2, seed = 42)
train_data = train_test_split['train']
eval_data = train_test_split['test']

print(train_data)
print(eval_data)

## **Save and Load the Datasets**

In [None]:
"""Finetuning requires multiple runs so the best hyperparameter configuration can be reached.
To save time and computational resources, we save the dataset here so it can be loaded directly
for subsequent runs"""
dataset_dictionary = DatasetDict({'train': train_data, 'test': eval_data}) # Creating a dictionary with the dataset
dataset_dictionary.save_to_disk('/content/drive/MyDrive/MSc Project/new_attempt/data/train/final_tokenized_dataset')

In [None]:
"""Loading the dataset from the local drive. As the full dataset was saved, the train test split must be done again"""
loaded_dataset = load_from_disk('/content/drive/MyDrive/MSc Project/new_attempt/data/train/final_tokenized_dataset')
train_data = loaded_dataset['train']
eval_data = loaded_dataset['test']

print(train_data)
print(eval_data)

## **Pad the Tokens**

In [None]:
"""Padding the tokens is a crucial step. Padding makes it so all the records have the same number of tokens, so, the
padding tokens are added to records with lower token counts. The UNK token is used for padding to avoid the known
issues with the EOS token"""
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = 'right'

## **QLoRA Quantization**

In [None]:
"""Here, we configure the settings for QLoRA quantization."""
qlora_config = BitsAndBytesConfig(
                                  load_in_4bit = True,
                                  bnb_4bit_compute_dtype = torch.float16,
                                  bnb_4bit_quant_type = "nf4",  # Normalized Float 4-bit gives improved accuracy compared to standard 4-bit integer quantization
                                  bnb_4bit_use_double_quant = False  # Setting as False to avoid further losses in accuracy at the cost of slight increase in memory usage
                                  )

## **Loading the Llama 2 Model**

In [None]:
"""The Llama 2 model is loaded and prepared for 4 bit training using the QLoRA config defined earlier"""
model = LlamaForCausalLM.from_pretrained(
                                          base_model,
                                          quantization_config = qlora_config,
                                          device_map = "auto"
                                        )
model.config.use_cache = False  # Setting as False to save memory during training
model.config.pretraining_tp = 1  # Tensor parallelism set unavailable as only 1 GPU available via colab
model = prepare_model_for_kbit_training(model)

## **Parameter-Efficient Fine-Tuning**

In [None]:
""""PEFT is another strategy for model compression. The target modules are the attention layers and feedforward network
projection layers. Targeting these means that these layers will be adapted to the new data during finetuning"""
peft_params = LoraConfig(
                          lora_alpha = 32,
                          lora_dropout = 0.05,
                          r = 16, # this is the rank of matrices to be used in the LoRA process
                          bias = "none",
                          task_type = "CAUSAL_LM",
                          target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
                        )

## **Setting the Training Parameters**

In [None]:
"""Specifying the training parameters"""
training_params = TrainingArguments(
                                      output_dir = "/content/drive/MyDrive/MSc Project/new_attempt/results",
                                      num_train_epochs = 12,
                                      per_device_train_batch_size = 32,
                                      per_device_eval_batch_size = 32,
                                      gradient_accumulation_steps = 1,
                                      optim = "paged_adamw_8bit", # using the 8-bit optimizer
                                      evaluation_strategy = "steps",
                                      eval_steps = 500,
                                      save_steps = 500,
                                      logging_steps = 100,
                                      learning_rate = 2e-4, # this is the learning rate that produced the most promising results
                                      weight_decay = 0.001,
                                      fp16 = True,
                                      bf16 = False, # These settings for fp16 and bf16 ensure mixed precision is enabled for computational efficiency
                                      max_grad_norm = 0.3,
                                      max_steps = -1,
                                      warmup_ratio = 0.03,
                                      lr_scheduler_type = "linear", # linear decay produced better results than cosine annealing
                                      report_to = "wandb"
                                    )

## **Finetuning the LLM**

In [None]:
"""Finetuning the base model and saving the adaption layers after finetuning"""
trainer = SFTTrainer(
                      model = model,
                      train_dataset = train_data,
                      eval_dataset = eval_data,
                      peft_config = peft_params,
                      max_seq_length = 512,
                      tokenizer = tokenizer,
                      packing = False,  # Leaving packing as False because the Java focal methods and unit tests can be lengthy
                      dataset_text_field = 'text',
                      args = training_params
                    )

trainer.train() # Training the model
trainer.model.save_pretrained(new_model) # Saving the model

## **Inference**

In [None]:
"""Running the text generation pipeline with the finetuned model. This is just to check if the model is working for inference
properly. Detailed inference will be carried out for validation in the next notebook. The focal method used here was randomly
selected for the eval dataset in the parent corpus, and is unseen data for the model"""
prompt = "VerificationUtil { static public boolean isZero(Number value, double zeroThreshold){ return (value.doubleValue() >= -zeroThreshold) && (value.doubleValue() <= zeroThreshold); } }"
focal_method = f"---Focal Method---\n{prompt}\n\n---Unit Test---\n"

pipe = pipeline(task = "text-generation", model = model, tokenizer = tokenizer, max_length = 512)
result = pipe(focal_method)
print(result[0]['generated_text'][len(focal_method):])

In [None]:
"""Emptying the VRAM so the next step can be carried out without memory issues"""
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

## **Merging the base model with the trained adapter**

In [None]:
"""Reload the model in FP16 precision and merge it with the adapted LoRA weights"""
model = LlamaForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage = True,
    return_dict = True,
    torch_dtype = torch.float16,
    device_map = "auto",
)
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

"""Reload the tokenizer to save it"""
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"