# **Validation**
This notebook uses the finetuned Llama-2-7b-chat-hf model and performs validation by running inference on the finetuned model using unseen data.
The generated responses are then compared with the actual test cases and
metrics for accuracy, precision, recall, and F1 score are calculated. This notebook can run on the L4 GPU (or better) from Google Colab

## **Import Necessary Libraries and Methods**
As in the previous notebook, the following packages need to be installed before
the required libraries and modules can be imported

In [None]:
!pip install -q transformers==4.30.0 accelerate==0.21.0 peft==0.4.0 jedi xformers triton tqdm
!pip install -q cudf-cu12==24.4.1
!pip install -q ibis-framework --upgrade
!pip install -q bigframes --upgrade
!pip install -q gcsfs==2024.3.1
!pip install -q datasets==2.19.1

In [None]:
# Update the package list
!apt-get update

# Install development libraries for pycairo and other required packages
!apt-get install -y libcairo2-dev pkg-config python3-dev

# Install pycairo
!pip install -q pycairo

In [None]:
!pip check ##Check for Any Issues in the Libraries

In [None]:
import numpy as np
import pandas as pd
import torch
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from transformers import AutoTokenizer, LlamaForCausalLM, pipeline
from peft import PeftModel
from tqdm import tqdm

## **Loading the Base and Finetuned Model & Combining the LoRA Weights with the Base Model**
The base model and finetuned adaption layers need to be loaded from the directories as this is a new notebook running on a fresh runtime

In [None]:
base_model = "NousResearch/Llama-2-7b-chat-hf" # Loading the base model from Hugging Face
new_model = "/content/drive/MyDrive/MSc Project/new_attempt/finetuned_llama_for_software_testing" # Loading the new model saved in the directory after finetuning

"""Reload the base model in FP16 and merge it with LoRA weights"""
model = LlamaForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage = True,
    return_dict = True,
    torch_dtype = torch.float16,
    device_map = "auto",
)

model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

"""Loading the saved tokenizer"""
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

## **Create Dataset from Validation Files**
For validation, the test folder was used from the parent corpus. This data was not introduced to the model in any form prior to this step, so it can serve as the validation dataset

In [None]:
"""Load the input and output text file paths"""
input_file_path = "/content/drive/MyDrive/MSc Project/new_attempt/data/val_data/input.methods.txt"
output_file_path = "/content/drive/MyDrive/MSc Project/new_attempt/data/val_data/output.tests.txt"

with open (input_file_path, "r") as file:
  input_text = file.read()

with open (output_file_path, "r") as file:
  output_text = file.read()

"""Creating a pandas dataframe using the files"""
df_validation = pd.DataFrame({"input": input_text.split("\n"), "output": output_text.split("\n")})
df_validation

## **Filter Out Records based on Token Length**
The records need to be filtered based on token length for this dataset as well. To include more examples in the validation process, the max combined length was raised to 2048 and the max input length was raised to 1024. As the inference step is not as computationally expensive as finetuning, and the dataset is much smaller, the token lengths can be raised to these numbers without any issues

In [None]:
"""Creating a function to filter out the records with token lengths that do not
meet the requirements. Batch wise tokenization is first performed on the dataset so the
filtration can be done based on token lengths"""
def filter_records(df, tokenizer, max_input_length = 1024, max_total_length = 2048, batch_size = 2000):
  filtered_indices = []

  for start_idx in range(0, len(df), batch_size):
    end_idx = min(start_idx + batch_size, len(df))
    df_batch = df.iloc[start_idx:end_idx]

    input_batch_tokenized = tokenizer(df_batch["input"].tolist(), return_tensors = "np", padding = True) # converting input batch into tokens
    output_batch_tokenized = tokenizer(df_batch["output"].tolist(), return_tensors = "np", padding = True) # converting output batch into tokens

    # Create a list of token lengths for each row for inputs, outputs, and combined lengths
    input_token_lengths = np.array([len(token) for token in input_batch_tokenized['input_ids']])
    output_token_lengths = np.array([len(token) for token in output_batch_tokenized['input_ids']])
    combined_token_lengths = input_token_lengths + output_token_lengths

    # Filter the records that exceed token length specified
    batch_filtered_indices = np.where((combined_token_lengths < max_total_length) & (input_token_lengths <= max_input_length))[0] # Using NumPy arrays for efficient filtering
    adjusted_indices = [start_idx + idx for idx in batch_filtered_indices]
    filtered_indices.extend(adjusted_indices) # concatenating the filtered indices to the main list outside the loop


  return df.iloc[filtered_indices].reset_index(drop = True)

df_validation = filter_records(df_validation, tokenizer)
df_validation

## **Selecting Random Sample from validation Dataset for Inference**
To remove any bias, the samples used for validation are selected at random from the filtered dataset

In [None]:
df_validation_sample = df_validation.sample(n = 300, random_state = 42) # Selecting 300 random samples from the validation dataset

"""Now we extract the sample dataset into lists"""
focal_methods = df_validation_sample["input"].tolist()
validation_unit_tests = df_validation_sample["output"].tolist()

## **Run Inference with the Finetuned Model**
Inference is performed using the input focal methods from all 300 selected samples. The generated responses can then be compared with the actual test cases

In [None]:
"""Creating a function to generate test cases from the finetuned model through inference"""
def generate_test_cases(focal_methods):
  generated_test_cases = []

  for focal_method in tqdm(focal_methods, desc = "Generating Test Cases", unit = "case"):
    prompt = f"---Focal Method---\n{focal_method}\n\n---Unit Test---\n"  # Using the predefined chat template
    pipe = pipeline(task = "text-generation", model = model, tokenizer = tokenizer, max_length = 1024)
    result = pipe(prompt)
    generated_test_cases.append(result[0]['generated_text'][len(prompt):]) # Appending each generated response to the main list outside the loop

  return generated_test_cases

generated_test_cases = generate_test_cases(focal_methods)

df_validation_sample["generated_tests"] = generated_test_cases # Adding a column for the generated responses in the validation dataframe

## **Create a Tokenized Version of the Validation Sample Dataset**
Now, for calculating the evaluation metrics, the input, actual output, and generated output need to be tokenized.

In [None]:
"""Tokenizing the inputs, true outputs, and generated outputs"""
input_ids = tokenizer(df_validation_sample["input"].tolist(), return_tensors = "np", padding = True)
output_ids = tokenizer(df_validation_sample["output"].tolist(), return_tensors = "np", padding = True)
generated_test_ids = tokenizer(df_validation_sample["generated_tests"].tolist(), return_tensors = "np", padding = True)
df_validation_sample["input_token_ids"] = input_ids["input_ids"].tolist()
df_validation_sample["output_token_ids"] = output_ids["input_ids"].tolist()
df_validation_sample["generated_test_token_ids"] = generated_test_ids["input_ids"].tolist()

df_final_validation = df_validation_sample.copy()
df_final_validation

## **Evaluating the Accuracy of Generated Test Cases**
Finally, the accuracy, precision, recall and f1 score for the generated test cases are calculated

In [None]:
"""Creating a function to calculate all the evaluation metrics"""
def calculate_metrics(df):
  predicted_ids_complete = []
  true_ids_complete = []

  for _, row in df.iterrows(): # Iterating over the tokenized dataframe to retrieve tokens for actual outputs and generated outputs
    predicted_ids = row["generated_test_token_ids"]
    true_ids = row["output_token_ids"]

    min_length = min(len(predicted_ids), len(true_ids)) # Aligning the lengths of the true and generated tokens to avoid any problems in calculation of the metrics
    predicted_ids = predicted_ids[:min_length]
    true_ids = true_ids[:min_length]

    predicted_ids_complete.extend(predicted_ids)
    true_ids_complete.extend(true_ids)

  accuracy = accuracy_score(true_ids_complete, predicted_ids_complete)
  precision = precision_score(true_ids_complete, predicted_ids_complete, average = "macro", zero_division = 1) # Setting zero division as 1 to avoid any zero division errors
  recall = recall_score(true_ids_complete, predicted_ids_complete, average = "macro", zero_division = 1)
  f1 = f1_score(true_ids_complete, predicted_ids_complete, average = "macro", zero_division = 1)

  return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

metrics = calculate_metrics(df_final_validation)

print(f"Metrics: \n{metrics}") # Print the calculated Evaluation Metrics