<h1 align="center">Fine-Tuning LLaMA 2 for Enhanced Julia Code Generation</h1>


----


This notebook shows the process of fine-tuning the LLaMA-2-7B model, specifically for the purpose of generating code in the Julia programming language. The process includes setting up the environment, preprocessing the dataset, setting up the training configuration, training the model, and evaluating its performance. Our goal is to illustrate the workflow and techniques involved in adapting advanced AI models for specific code generation tasks.


## Environment Setup

Before we start, we set up our environment by installing necessary libraries and importing them. This includes :
- `bitsandbytes`: for efficient model training.
- `transformers`: Provides access to pre-trained models
- `peft`: Parameter-Efficient Fine-Tuning library from Hugging Face for efficient model adaptation.
- `accelerate`: Simplifies running training scripts.
- `datasets`: Hugging Face's library for easily loading and manipulating datasets for model training and evaluation.
- `evaluate`: for evaluating and comparing the performance of models across a wide range of metrics.
- `trl` (Textual Reinforcement Learning): for training language models using reinforcement learning techniques.




In [None]:
!pip install -q -U bitsandbytes
!pip install transformers==4.31
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install evaluate
!pip install -qqq trl==0.7.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m90.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

we import the necessary Python libraries and modules

In [None]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset, load_from_disk, DatasetDict
import random

## Dataset Preparation

The dataset preparation involves loading the Julia programming challenges dataset ( leet problems + maths code), shuffling it to ensure a mix of data points for training, and then splitting it into training and validation sets.


In [None]:
df = pd.read_csv("/content/Julia_math_leet.csv", encoding='ISO-8859-1')

In [None]:
shuffled_df = df.sample(frac=1).reset_index(drop=True)

In [None]:
shuffled_df

Unnamed: 0,solution
0,function reverse_words(s::String)\n s_s = s...
1,## double pointers\nfunction middle_node_by_do...
2,function find_circle_num(is_connected::Vector{...
3,"function is_same_tree(p::Union{TreeNode, Nothi..."
4,"function all_cells_dist_order(rows::Int, cols:..."
...,...
769,function max_subarray_sum_circular(nums)\n ...
770,function smallest_repunit_div_by_k(K::Int)\n ...
771,"x,n = 0,0\nwhile x<100\n n+=1\n x+=rand(..."
772,"function calculate(s::String)::Int\n res, n..."


In [None]:
dataset = Dataset.from_pandas(shuffled_df)

In [None]:
# Create a DatasetDict with the train split
dataset = DatasetDict({
    'train': dataset
})


In [None]:
# split the training set into a smaller training set and a test set
train_test_split = dataset['train'].train_test_split(test_size=0.3)  # 70%-30% split

# split the test set from the previous step into validation and test sets
validation_test_split = train_test_split['test'].train_test_split(test_size=0.5)  # Split the 30% into two parts of 10% each

train_data = train_test_split['train']
validation_data = validation_test_split['train']
test_data = validation_test_split['test']

In [None]:
validation_data

Dataset({
    features: ['solution'],
    num_rows: 116
})

#### **Note:**
Due to constraints related to memory and resources, our fine-tuning was conducted on a small subset of data, rather than on the larger dataset 'Julia-Proof-Pile-2' available from Hugging Face. However, the following steps can be taken to extend fine-tuning to this larger dataset:

In [None]:
#Load the dataset from hf
hf_dataset = load_dataset("ajibawa-2023/Julia-Proof-Pile-2")

In [None]:
# Split the 'train' dataset into train and test datasets with a 70-30 split
train_test_split_2 = hf_dataset['train'][:].train_test_split_2(test_size=0.3)

hf_dataset = DatasetDict({
    'train': train_test_split_2['train'],
    'test': train_test_split_2['test']
})

# You can then use these splits as follows
hf_train_dataset = hf_dataset['train']
hf_eval_dataset = hf_dataset['test']

Small modifications are needed as well in the `SFTTrainer` cell. Please scroll to the section titled "SFTTrainer Configuration" for details.
[cell_link](https://colab.research.google.com/drive/1Yl1Fr9fLlipvJUmChfQO1rtVS2IxKF5q#scrollTo=t5hwFT0BvBwa&line=6&uniqifier=1)

## Initializing the Model and Tokenizer

To begin the fine-tuning process, we must first select and initialize our base model and tokenizer. In this project, we choose a pre-trained LLaMA 2 model suitable for our needs (balance between computational efficiency and the ability to generate Julia code). We then configure the model with BitsAndBytes to optimize its performance (efficient memory usage and computational speed without sacrificing accuracy). Additionally, we set up the tokenizer, which is crucial for preprocessing our data into a format that the model can understand.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model ID from Hugging Face
model_id =  "NousResearch/Llama-2-7b-hf"

# Configuration for BitsAndBytes to optimize model memory usage and compute efficiency.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Enables loading model weights in 4-bit precision.
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4", # Specifies the quantization type
    bnb_4bit_compute_dtype=torch.bfloat16  # Use bfloat16 for computation
)

# Download and load the model with the specified configuration for optimized performance.
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

# Download and set up the tokenizer for the model, including setting padding tokens and sides.
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Set the padding token to be the end-of-sentence (EOS) token.
tokenizer.padding_side = "right" # Pad sequences to the right side.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
from peft import prepare_model_for_kbit_training

# Two Optimization techniques :

# Enable gradient checkpointing to reduce memory consumption during training.
# This technique stores only a subset of intermediate activations, recalculating others during the backward pass.
model.gradient_checkpointing_enable()

# Prepare the model for k-bit training using the PEFT library.
# This optimizes the model for fine-tuning with reduced parameter updates, in order to enhance training efficiency.
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the total number of parameters in the model, the number of trainable parameters,
    and the percentage of parameters that are trainable.
    ---------
    This function iterates through all the parameters of the model, checks if they are
    trainable (requires_grad == True), and accumulates their counts.
    """
    trainable_params = 0  # count of trainable parameters
    all_param = 0  # total count of parameters
    for _, param in model.named_parameters():
        all_param += param.numel()  # Accumulate total number of parameters
        if param.requires_grad:
            # If parameter requires gradient, it's trainable
            trainable_params += param.numel()

    trainable_percentage = 100 * trainable_params / all_param  # percentage of trainable parameters
    print(f"Trainable params: {trainable_params} || All params: {all_param} || Trainable%: {trainable_percentage:.2f}")

## Training Configuration

Setting up the training configuration involves preparing the model for training with specific hardware optimizations and defining the training parameters. This includes the configuration of LoRA (Low Rank Adaption) parameters, which are key to fine-tuning the model efficiently on our dataset.

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16, # LoRA attention dimension (The rank)
    lora_alpha=64, # The alpha parameter for Lora scaling.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], #specific to Llama models.
    lora_dropout=0.2, # The dropourt probability for LoRA layers
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

### Reusing and Further Fine-Tuning the Fine-Tuned Model
The code below shows the process of loading an already fine-tuned model and its tokenizer for additional rounds of fine-tuning. However, This approach of incremental fine-tuning (where we wanted to fine-tune the model in stages, such as two epochs at a time). can increase the risk of encountering memory issues, (out-of-memory errors) due to the cumulative computational load.

peft_model_dir = "./peft-julia-leet20" # Model path
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)

tokenizer = AutoTokenizer.from_pretrained(peft_model_dir)
model = get_peft_model(trained_model, lora_config)

## Model Training

In this section, we start the actual training process. Training arguments are defined, specifying details such as the number of epochs, batch size, and where to save checkpoints.


In [None]:
OUTPUT_DIR = "llama2-docsum-adapter"

In [None]:
from transformers import TrainingArguments

# Configuring training arguments to customize the fine-tuning process.
training_arguments = TrainingArguments(
    per_device_train_batch_size=4,  # Batch size per device during training.
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before performing a backward/update pass.
    optim="paged_adamw_32bit",  # Optimizer type, optimized for lower memory usage.
    logging_steps=1,  # Log metrics every specified number of steps.
    learning_rate=1e-4,  # Initial learning rate for the optimizer.
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=10,  # Total number of training epochs.
    evaluation_strategy="steps",  # Evaluation is performed at the end of each training epoch.
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",  # Save the model at the end of each epoch.
    group_by_length=True,
    output_dir=OUTPUT_DIR,  # Directory where the model predictions and checkpoints will be saved.
    #report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,  # Random seed for initialization.
)

model.config.use_cache = False  # silence the warnings.

In [None]:
!pip install --upgrade trl peft

Collecting trl
  Downloading trl-0.8.1-py3-none-any.whl (225 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/225.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m215.0/225.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.0/225.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.7.3-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting docstring-parser>=0.14.1 (from tyro>=0.5.11->trl)
  Downloading docstring_parser-0.16-py3-none-any.whl (36 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Downloading shtab-1.7.1-py3-none-any.whl (14 kB)
Installing collected packages: shtab, docstring-parser, tyro, trl
  Attempting uninstall: trl
    Found existing installation: trl 

In [None]:
# !pip install SFTTrainer

In [None]:
from trl import SFTTrainer

In [None]:
## SFTTrainer Configuration


# Initialize the SFTTrainer with the model, datasets, configuration, and training arguments.
trainer = SFTTrainer(
    model=model,  # The model to be fine-tuned.
    train_dataset=train_data,  # The dataset for training.   #### To train on Hugging face dataset : replace "train_data" by "hf_train_dataset"
    eval_dataset=validation_data,  # The dataset for evaluation.  #### and replace "validation_data" by "hf_eval_dataset"
    peft_config=lora_config,  # Configuration for PEFT.
    dataset_text_field="solution",  # Specifies the column in the dataset that contains the code to be used.  #### Here as well, "solution" should be replace by "text"
    max_seq_length=1024,  # The maximum length of the sequence to be considered.
    tokenizer=tokenizer,  # The tokenizer to be used for encoding the texts.
    args=training_arguments,  # Training arguments defined earlier.
)

# Start the training process.
trainer.train()


Map:   0%|          | 0/541 [00:00<?, ? examples/s]

Map:   0%|          | 0/116 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
68,1.2053,0.972535
136,1.181,0.953567
204,0.6276,1.044986
272,0.538,1.193086
340,0.4674,1.25564




TrainOutput(global_step=340, training_loss=0.7503645330229226, metrics={'train_runtime': 1326.1059, 'train_samples_per_second': 4.08, 'train_steps_per_second': 0.256, 'total_flos': 1.7847030051201024e+16, 'train_loss': 0.7503645330229226, 'epoch': 10.0})

In [None]:
# path where the fine-tuned model and tokenizer will be saved.
save_peft_model_path="./peft-julia-math10"

# Save the fine-tuned model to the specified path.
trainer.model.save_pretrained(save_peft_model_path)
# Save the tokenizer associated with the fine-tuned model to the same path.
tokenizer.save_pretrained(save_peft_model_path)


('./peft-julia-math10/tokenizer_config.json',
 './peft-julia-math10/special_tokens_map.json',
 './peft-julia-math10/tokenizer.model',
 './peft-julia-math10/added_tokens.json',
 './peft-julia-math10/tokenizer.json')

## Generating Code with the Fine-Tuned Model

After training, we'll use our fine-tuned model to generate Julia code. This section demonstrates how to use the model to make predictions.

In [None]:
from transformers import TextStreamer
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.2, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer

In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

peft_model_dir = "./peft-julia-math10" # model path

# load base LLM model and tokenizer
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

For the testing part, we'll use a dataset containing only "Leet" problems, as it includes both the statement and solution columns, which is not the case for math problems.


In [None]:
test_dataset = pd.read_csv("/content/Julia_test_data_leet.csv", encoding='ISO-8859-1')

In [None]:
index = 1

statement = test_dataset['statement'][index]
solution  = test_dataset['solution'][index]

prompt = f"""
Solve the following problem using Julia programming language.

### Input:
{statement}

### Solution:
"""

input_ids = tokenizer(prompt, return_tensors='pt',truncation=True).input_ids.cuda()
outputs = trained_model.generate(input_ids=input_ids, max_new_tokens=100, )
output= tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]

dash_line = '-'.join('' for x in range(200))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'FINE TUNED MODEL GENERATED CODE :\n{output}')

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INPUT PROMPT:

Solve the following problem using Julia programming language. 

### Input:

 Find Common Characters
 ---
 
 Given an array `A` of strings made only from lowercase letters, return a list
 of all characters that show up in all strings within the list **(including
 duplicates)**.  For example, if a character occurs 3 times in all strings but
 not 4 times, you need to include that character three times in the final
 answer.
 
 You may return the answer in any order.
 
 
 
 **Example 1:**
 
     
     
     Input: ["bella","label","roller"]
     Output: ["e","l","l"]
     
 
 **Example 2:**
 
     
     
     Input: ["cool","lock","cook"]
     Output: ["c","o"]
     
 
 
 
 **Note:**
 
   1. `1 <= A.length <= 100`
   2. `1 <= A[i].length <= 100`
   3. `A[i][j]` is a lowercase let

# Merge Trained LoRA Adapter With BASE MODEL and Push Model to Hub

---



In [None]:
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
)
# Merge LoRA and base model
merged_model = trained_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")

# push merged model to the hub
# merged_model.push_to_hub("user/repo")
# tokenizer.push_to_hub("user/repo")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.model',
 'merged_model/added_tokens.json',
 'merged_model/tokenizer.json')

### Performance Evaluation Using CodeBLEU

CodeBLEU metric evaluation allows us to quantitatively assess the quality of the code generated by our model. This step involves installing the CodeBLEU package and computing the metric on our test dataset to understand the effectiveness of our fine-tuning efforts.

In [None]:
!pip -q install codebleu

  Building wheel for codebleu (pyproject.toml) ... [?25l[?25hdone
  Created wheel for codebleu: filename=codebleu-0.6.0-cp310-cp310-linux_x86_64.whl size=1281124 sha256=5869b9f822df109e3c8d595328ecea9358dd7ef59c48b587e10a080760e11208
  Stored in directory: /root/.cache/pip/wheels/5e/45/f6/abb2f927620de36cc631f769fc8f478fe44f7e057dc3b09483
Successfully built codebleu
Installing collected packages: tree-sitter, codebleu
Successfully installed codebleu-0.6.0 tree-sitter-0.21.3


#### CodeBLEU score calculation
The CodeBLEU metric evaluates the quality of generated code based on four components:
1. N-gram match score: Measures the overlap of n-grams between the reference and generated code.
2. Weighted n-gram match score: Similar to the n-gram match but gives different weights to n-grams of different lengths.
3. Syntax match score: Evaluates the syntactic correctness of the generated code.
4. Dataflow match score: Assesses the semantic accuracy of the generated code by examining data dependencies.

Since the current version of CodeBLEU does not directly support Julia, and for demonstration purposes, we approximate by using Python as a proxy language due to its syntactic similarities to Julia. Thus, we adjust the syntax match score weight to a lower value (10%).


In [None]:
from codebleu import calc_codebleu
sum = 0
for i in range(len(test_data)):
  reference = test_dataset["solution"][i] # a true solution from the test dataset

  # Our model's output
  statement = test_dataset["statement"][i]
  prompt = f"""
  Solve the following problem using Julia programming language.

  ### Input:
  {statement}

  ### Solution:
  """

  input_ids = tokenizer(prompt, return_tensors='pt',truncation=True).input_ids.cuda()
  # output = tokenizer.decode(
  #     trained_model.generate(
  #         inputs_ids,
  #         max_new_tokens=100,
  #     )[0],
  #     skip_special_tokens=True
  # )
  outputs = trained_model.generate(input_ids=input_ids, max_new_tokens=200, )
  output= tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]

  # Calculate CodeBLEU score
  result = calc_codebleu([reference], [output], lang="python", weights=(0.3, 0.3, 0.1, 0.3), tokenizer=None)
  sum+=result["codebleu"]

print(f"code bleu metric for the test dataset : {sum/len(test_data)}")