<a href="https://colab.research.google.com/github/ShaliniAnandaPhD/RefactorEarth/blob/main/Fine_Tuning_Codebert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning CodeBERT for Code Optimization

This notebook provides a step-by-step guide to fine-tuning the CodeBERT model using a custom dataset. CodeBERT is a pre-trained model designed for programming languages, and fine-tuning it on your specific codebase can improve its performance for tasks like code generation, code completion, and other code-related applications.

---

## Step 1: Setting Up the Environment

Let's start by installing the necessary libraries and setting up our environment.

`


In [None]:
!pip install transformers datasets torch

Next, we'll load our dataset. For this example, we'll assume you have a dataset in CSV format with columns input_text (the code) and output_text (the corresponding labels or descriptions)

In [None]:
import git

# Specify the repository URL
repo_url = "https://github.com/yourusername/your-repo.git"

# Clone the repository
repo_path = "./your-repo"
git.Repo.clone_from(repo_url, repo_path)


Now that we have the code files, we'll load and preprocess them for fine-tuning. We'll collect all the Python files (.py files) and prepare them for tokenization

In [None]:
import os

# Function to load Python files from the repository
def load_code_files(directory):
    code_snippets = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(".py"):  # We are focusing on Python files
                file_path = os.path.join(root, file)
                with open(file_path, "r", encoding="utf-8") as f:
                    code_snippets.append(f.read())
    return code_snippets

# Load the code snippets from the cloned repo
code_snippets = load_code_files(repo_path)

# Display the first few lines of a snippet as an example
print("\n".join(code_snippets[:1][0].splitlines()[:10]))


Next, we'll tokenize the code snippets using the CodeBERT tokenizer. Tokenization is necessary to convert the code into a format that the model can understand.

In [None]:
from transformers import RobertaTokenizer

# Load the CodeBERT tokenizer
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

# Tokenize the code snippets
def tokenize_snippets(snippets):
    return tokenizer(snippets, padding="max_length", truncation=True, max_length=512)

# Apply tokenization
tokenized_snippets = tokenize_snippets(code_snippets)

# Example of a tokenized snippet
print(tokenized_snippets['input_ids'][0][:10])  # Show the first 10 tokens of the first snippet


Before we begin fine-tuning, we need to organize the tokenized data into a format suitable for training.

In [None]:
import torch

# Convert tokenized snippets into PyTorch tensors
input_ids = torch.tensor(tokenized_snippets['input_ids'])
attention_masks = torch.tensor(tokenized_snippets['attention_mask'])

# Optionally, define labels if you're doing supervised fine-tuning (e.g., code classification)
labels = torch.zeros(len(input_ids), dtype=torch.long)  # Example labels

# Create a PyTorch dataset
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

dataset = TensorDataset(input_ids, attention_masks, labels)
train_sampler = RandomSampler(dataset)
train_dataloader = DataLoader(dataset, sampler=train_sampler, batch_size=8)


Now, we will fine-tune the CodeBERT model on our dataset.

In [None]:
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments

# Load the pre-trained CodeBERT model for sequence classification
model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base")

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./codebert_finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Start fine-tuning
trainer.train()

# Save the fine-tuned model
trainer.save_model("./codebert_finetuned")


After fine-tuning, it's important to evaluate the model to see how well it performs on tasks relevant to your use case.

In [None]:
# Sample test code snippets (these would normally come from a separate validation set)
test_code_snippets = [
    "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return n * factorial(n-1)",
    "def add_numbers(a, b):\n    return a + b",
    "def fib(n):\n    if n <= 1:\n        return n\n    else:\n        return fib(n-1) + fib(n-2)",
    "def print_hello():\n    print('Hello, world!')"
]

# Corresponding labels (1 for recursive, 0 for non-recursive)
# The first and third snippets are recursive, so they are labeled with 1.
# The second and fourth snippets are non-recursive, so they are labeled with 0.
test_labels = [1, 0, 1, 0]

# Tokenize the test code snippets
test_encodings = tokenizer(test_code_snippets, padding=True, truncation=True, return_tensors="pt", max_length=512)

# Convert labels to tensor
test_labels_tensor = torch.tensor(test_labels)

# Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # Disable gradient calculation for evaluation
    outputs = model(**test_encodings)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Compare the predictions with the actual labels
correct_predictions = (predictions == test_labels_tensor).sum().item()
total_predictions = len(test_labels)
accuracy = correct_predictions / total_predictions

print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Output individual predictions
for i, snippet in enumerate(test_code_snippets):
    label = "Recursive" if predictions[i] == 1 else "Non-Recursive"
    print(f"Snippet {i+1}: {label}")


Finally, you can now use the fine-tuned model for various tasks like code generation, code completion, or any other task that CodeBERT is suitable for

In [None]:
# Example usage of the fine-tuned model
from transformers import pipeline

code_generator = pipeline("text-generation", model="./codebert_finetuned", tokenizer=tokenizer)
generated_code = code_generator("def my_function(", max_length=50, num_return_sequences=1)
print(generated_code[0]['generated_text'])
