<a href="https://colab.research.google.com/github/RealAI-RAI/Code-llama-Fine-Tuning-Post-Patch-Generation/blob/main/Code_llama_Fine_Tuning_Post_Patch_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setting Up Your Environment

In [None]:
!pip install transformers==4.39
!pip install accelerate==0.27.2
!pip install peft trl

Successfully installed accelerate-0.27.2 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.19.3 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105


**Import Necessary Modules**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from transformers import LlamaForCausalLM, CodeLlamaTokenizer
from transformers import TrainingArguments
from torch.utils.data import DataLoader
from datasets import Dataset
import pickle

**Loading and Updating a Dataset with Labels**

Loading dataset from a pickle file, updating it by filtering and labeling and then save the updated dataset

In [None]:
# Load the dataset from the pickle file
file_path = '/content/preprocessed_data.pkl'
with open(file_path, 'rb') as file:
  dataset_list = pickle.load(file)

def update_dataset(data_list):
  updated_data = []
  for data in data_list:
    input_code = data.get('input', '')  # Handle potential missing 'input' key
    target_text = data.get('target', '')  # Handle potential missing 'target' key
    if input_code:
      similarity = sum(c1 == c2 for c1, c2 in zip(input_code, target_text)) / len(input_code)
      label = 1 if similarity >= 0.9 else 0

      data['labels'] = label
      updated_data.append(data)
  return updated_data

filtered_dataset = update_dataset(dataset_list)

output_file_path = '/content/updated_data_with_labels_filtered.pkl'
with open(output_file_path, 'wb') as output_file:
  pickle.dump(filtered_dataset, output_file)


**Verifying Filtering Logic for Empty Targets**

checks the filtering logic applied to the dataset by counting and printing information about entries that have empty 'target' after the dataset has been updated and filtered.

In [None]:
# Check filtering logic (add this after the loop in update_dataset)
empty_target_count = 0
for data in filtered_dataset:
  if not data.get('target'):
    empty_target_count += 1
    print(f"Empty target found (after filtering): Input code - {data.get('input', '')}")

if empty_target_count > 0:
  print(f"Found {empty_target_count} entries with empty targets after filtering.")
else:
  print("No entries with empty targets found after filtering.")


Empty target found (after filtering): Input code - 

__version__ = '1.1.3'

default_app_config = 'aldryn_events.apps.AldrynEvents'


request_events_event_identifier = 'aldryn_events_current_event'

ORDERING_FIELDS = (
    'start_date', 'start_time', 'end_date', 'end_time', 'pk'
)

ARCHIVE_ORDERING_FIELDS = (
    '-start_date', '-start_time', 'end_date', 'end_time', 'pk'
)

Empty target found (after filtering): Input code - 

__version__ = '1.1.3'

default_app_config = 'aldryn_events.apps.AldrynEvents'


request_events_event_identifier = 'aldryn_events_current_event'

ORDERING_FIELDS = (
    'start_date', 'start_time', 'end_date', 'end_time', 'pk'
)

ARCHIVE_ORDERING_FIELDS = (
    '-start_date', '-start_time', 'end_date', 'end_time', 'pk'
)

Empty target found (after filtering): Input code - 

__version__ = '1.1.3'

default_app_config = 'aldryn_events.apps.AldrynEvents'


request_events_event_identifier = 'aldryn_events_current_event'

ORDERING_FIELDS = (
    'start_date', 'start_time'

**Counting and Reporting Empty Targets in a Dataset**

iterates through a dataset to count and report the number of entries that have empty 'target' . It demonstrates a straightforward approach to identifying and reporting missing or empty values in a dataset.

In [None]:
# Count and print information about entries with empty targets
empty_target_count = 0
for data in filtered_dataset:
  target_text = data.get('target', '')
  if not target_text:
    empty_target_count += 1
    # print(f"Empty target found: Input code - {data.get('input', '')}")

if empty_target_count > 0:
  print(f"Found {empty_target_count} entries with empty targets.")
else:
  print("No entries with empty targets found.")


Found 641 entries with empty targets.


**Removing Entries with Empty Targets from a Dataset**
 filters out entries from a dataset that have empty 'target' fields, creating a new dataset without these entries.

In [None]:
# Filter out entries with empty targets
filtered_dataset_no_empty_targets = [data for data in filtered_dataset if data.get('target', '') != '']

# Save the updated dataset without entries with empty targets to a new pickle file
output_file_path_no_empty_targets = '/content/updated_data_with_labels_filtered_no_empty_targets.pkl'
with open(output_file_path_no_empty_targets, 'wb') as output_file:
    pickle.dump(filtered_dataset_no_empty_targets, output_file)

print(f"Dataset without entries with empty targets saved to: {output_file_path_no_empty_targets}")


Dataset without entries with empty targets saved to: /content/updated_data_with_labels_filtered_no_empty_targets.pkl


**Loading and Verifying a Dataset Without Empty Targets**

In [None]:
# Load the updated dataset without entries with empty targets from the pickle file
with open('/content/updated_data_with_labels_filtered_no_empty_targets.pkl', 'rb') as file:
    filtered_dataset_no_empty_targets = pickle.load(file)

# Count and print information about entries with empty targets
empty_target_count = 0
for data in filtered_dataset_no_empty_targets:
    target_text = data.get('target', '')
    if not target_text:
        empty_target_count += 1

if empty_target_count > 0:
    print(f"Found {empty_target_count} entries with empty targets.")
else:
    print("No entries with empty targets found.")


No entries with empty targets found.


**Checking the structure of the dataset**

In [None]:
for example in filtered_dataset_no_empty_targets[:5]:  # Check a few examples
    print("Example keys:", example.keys())
    if 'labels' in example:
        print("Labels found!")
    else:
        print("Labels not found.")

Example keys: dict_keys(['input', 'target', 'labels'])
Labels found!
Example keys: dict_keys(['input', 'target', 'labels'])
Labels found!
Example keys: dict_keys(['input', 'target', 'labels'])
Labels found!
Example keys: dict_keys(['input', 'target', 'labels'])
Labels found!
Example keys: dict_keys(['input', 'target', 'labels'])
Labels found!


**Converting a List of Dictionaries to a Dataset Object**
converting a list of dictionaries, which represents a dataset, into a Dataset object from the datasets library, conversion is a crucial for use with Hugging Face's transformers library, as it allows for easy manipulation and processing of the data. The process involves creating a dictionary where each key corresponds to a column in the dataset and the values are lists containing the data for that column.which provides a structured and efficient way to work with the data.

In [None]:
# Convert the list of dictionaries into a Dataset object
dataset_dict = {key: [d[key] for d in filtered_dataset_no_empty_targets] for key in dataset_list[0].keys()}
dataset = Dataset.from_dict(dataset_dict)

**Initializing the CodeLlama Model for Causal Language Modeling**

initialize the CodeLlama model using the Hugging Face transformers library. It specifies the model ID for the CodeLlama model, which is designed for Python code generation and understanding

In [None]:
# Initialize the CodeLlama model
model_id = "codellama/CodeLlama-7b-Python-hf"
model = AutoModelForCausalLM.from_pretrained(model_id)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Preparing a Dataset for Training with CodeLlama

preparing  dataset for training with the CodeLlama model, key steps are:

1. **CodeLlama Tokenizer**
The dataset is tokenized using the defined tokenization function, applied to each example in the dataset. This step converts the  data into a format that can be fed into the model.

2. **Splitting the Dataset**: The tokenized dataset is split into training and validation sets, with 90% of the data used for training and the remaining 10% for validation. This split is essential for evaluating the model's performance and preventing overfitting.

In [None]:
# Initialize the CodeLlama tokenizer
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Python-hf")

# Set a padding token in the tokenizer
tokenizer.pad_token = tokenizer.eos_token  # Use the end-of-sequence token as the padding token

# Define the tokenization function
def tokenize_function(example):
    # Tokenize input text and ensure truncation and padding are applied
    return tokenizer(
        example['input'],
        max_length=512,  # Set an appropriate max_length based on your dataset and tokenizer
        truncation=True,
        padding='max_length',
        return_tensors="pt"  # Return PyTorch tensors
    )

# Load the updated dataset with labels from the pickle file
file_path = '/content/updated_data_with_labels_filtered_no_empty_targets.pkl'
with open(file_path, 'rb') as file:
    dataset_list = pickle.load(file)

# Convert the list of dictionaries into a Dataset object
dataset_dict = {key: [d[key] for d in filtered_dataset] for key in filtered_dataset[0].keys()}
dataset = Dataset.from_dict(dataset_dict)

# Apply tokenization to the dataset using `map` method
tokenized_dataset = dataset.map(tokenize_function, batched=True)

train_size = int(len(tokenized_dataset) * 0.9) # 90% of the dataset for training
train_dataset = tokenized_dataset.select(range(train_size))
val_dataset = tokenized_dataset.select(range(train_size, len(tokenized_dataset)))


Map:   0%|          | 0/3116 [00:00<?, ? examples/s]

**Creating DataLoaders for Training and Validation**

creating DataLoader objects for training and validation datasets using PyTorch's DataLoader class. It sets up the data loading process by specifying a batch size and whether the data should be shuffled for each epoch. The train_loader is configured to shuffle the data, which is beneficial for training deep learning models as it helps prevent the model from memorizing the order of the data and ensures that the model learns the underlying patterns in the data. The val_loader, on the other hand, is set to not shuffle the data, which is typical for validation datasets to ensure that the validation process is consistent and reproducible across different runs

In [None]:
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)


**Displaying Dataset Sizes for Training and Validation**

In [None]:
print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")


Training dataset size: 2804
Validation dataset size: 312


**TrainingArguments Configuration**

configures the TrainingArguments, setting up  parameters for fine tuning of a model. It specifies the output directory, batch sizes, warmup steps, weight decay, logging directory, logging steps, evaluation strategy, save strategy, and disables reporting.

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
#     use_seedable_sampler=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    report_to="none",

)


**Fine Tuning Implementation**

overrides the training_step method to ensure that the inputs are in the correct format (a dictionary) and to print the shapes of the input and label tensors before proceeding with the training step. The custom trainer is then instantiated with a model, training arguments, and datasets for training and evaluation. and the model is fine tuning using the custom trainer.

In [None]:
from transformers import Trainer, TrainingArguments
import torch
class CustomTrainer(Trainer):
    def training_step(self, model, inputs):
        # Ensure inputs is a dictionary
        if not isinstance(inputs, dict):
            raise TypeError("Inputs must be a dictionary.")

        # Correctly separate input_ids and labels from the inputs dictionary
        input_ids = inputs['input_ids']
        labels = inputs['labels']

        # Print the shapes of the input and label tensors
        print(f"Input shape: {input_ids.shape}")
        print(f"Label shape: {labels.shape}")

        # Proceed with the training step as usual
        outputs = model(input_ids=input_ids, labels=labels)
        loss = outputs.loss
        return loss

# Define the Trainer with the custom class
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()


Input shape: torch.Size([4, 512])
Label shape: torch.Size([4])


**Evaluating the model**

run the model on the validation dataset and print the evaluation results

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()

# Print the evaluation results
print(eval_results)

**Generate Code**

 using a Fine Tuned model  It takes an input prompt, encodes it into input IDs using a tokenizer, generates code based on the input IDs, and then decodes the generated output.

In [None]:
# Example of text generation using the trained model
input_prompt = "input prompt ......."
input_ids = tokenizer.encode(input_prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)

print("Generated code:")
print(generated_code)
