<a href="https://colab.research.google.com/github/Bryan-Az/TransformerLLM-Finetuning/blob/main/%5BFine_tuning_%26_Evaluation%5D_JinaAI's_Starcoder_with_C_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study of JinaAI's Starcoder Model Trained on Programming Textbooks
In this evaluation notebook, I will use a textbook on programming in C# to evaluate Starcoder's ability to accurately answer programming questions in C#. I will then fine-tune the Starcoder model and run a minor ablation study to evaluate whether the model is able to respond more accurately post-finetuning.
Using the A100 environment in google colab, fine-tuning took ~20minutes given the example textbook dataset.

Key Reference: https://colab.research.google.com/drive/1T4IfGfDJ8uxgU8XBPpMZivw_JThzdQim?usp=sharing

## Imports and Installs

In [1]:
%%capture
!pip install transformers accelerate huggingface_hub
!pip install pypdf2

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
import PyPDF2
import torch
import re
import json
from sklearn.model_selection import train_test_split
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments, AutoModelWithLMHead

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Loading the C\# Programming Textbook
This textbook is available and open-source via https://github.com/EbookFoundation/free-programming-books/blob/main/books/free-programming-books-langs.md#csharp  and thus is available for public use and download from my github link.

In [4]:
!wget 'https://github.com/Bryan-Az/TransformerLLM-Finetuning/raw/main/data/Programming_C_ISO.pdf' -O 'Programming_C_ISO.pdf'

--2024-11-23 02:27:52--  https://github.com/Bryan-Az/TransformerLLM-Finetuning/raw/main/data/Programming_C_ISO.pdf
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Bryan-Az/TransformerLLM-Finetuning/main/data/Programming_C_ISO.pdf [following]
--2024-11-23 02:27:53--  https://raw.githubusercontent.com/Bryan-Az/TransformerLLM-Finetuning/main/data/Programming_C_ISO.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2372927 (2.3M) [application/octet-stream]
Saving to: ‘Programming_C_ISO.pdf’


2024-11-23 02:27:54 (15.2 MB/s) - ‘Programming_C_ISO.pdf’ saved [2372927/2372927]



In [5]:

textbook_path = 'Programming_C_ISO.pdf'
json_path = 'Programming_C_ISO.json'
all_data = []
with open(textbook_path, 'rb') as f:
    pdf_reader = PyPDF2.PdfReader(f)
    ten_page_bin = ''
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        ten_page_bin += page.extract_text()
        if (page_num + 1) % 10 == 0:
            print(f"Processed up to page {page_num + 1}")
            # Append data to the list
            all_data.append({'10_page_bin': ten_page_bin})
            ten_page_bin = ''


with open(json_path, 'w') as json_file:
    json.dump(all_data, json_file)

Processed up to page 10
Processed up to page 20
Processed up to page 30
Processed up to page 40
Processed up to page 50
Processed up to page 60
Processed up to page 70
Processed up to page 80
Processed up to page 90
Processed up to page 100
Processed up to page 110
Processed up to page 120
Processed up to page 130
Processed up to page 140
Processed up to page 150
Processed up to page 160
Processed up to page 170
Processed up to page 180
Processed up to page 190
Processed up to page 200
Processed up to page 210
Processed up to page 220
Processed up to page 230
Processed up to page 240
Processed up to page 250
Processed up to page 260
Processed up to page 270
Processed up to page 280
Processed up to page 290
Processed up to page 300
Processed up to page 310
Processed up to page 320
Processed up to page 330
Processed up to page 340
Processed up to page 350
Processed up to page 360
Processed up to page 370
Processed up to page 380
Processed up to page 390
Processed up to page 400
Processed

In [6]:
with open(json_path) as f:
    text = json.load(f)

In [7]:
train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

In [9]:
def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        summary = str(texts['10_page_bin']).strip()
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)

In [10]:
train, test = train_test_split(text,test_size=0.15)

build_text_files(train,'train_dataset.txt')
build_text_files(test,'test_dataset.txt')

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))

Train dataset length: 45
Test dataset length: 8


In [11]:
# encoding using the AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jinaai/starcoder-1b-textbook")

tokenizer_config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

In [12]:
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



In [13]:
train, test = train_test_split(text,test_size=0.15)

build_text_files(train,'train_dataset.txt')
build_text_files(test,'test_dataset.txt')

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))


Train dataset length: 45
Test dataset length: 8


## Loading the Model and Applying LoRA Fine-tuning

### Loading the Experimental Model
This is the pre-trained coder model that was trained on textbooks of a variety of programming textbooks.

In [14]:
base_model = AutoModelForCausalLM.from_pretrained(
        "jinaai/starcoder-1b-textbook", device_map='auto')

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [15]:
# Inspect the model's layers to find suitable target modules
unique_target = set()
for name, module in base_model.named_modules():
    unique = name.split('.')[-1]
    #using regex to ensure using right modules, not numerical
    if re.match(r'[a-zA-Z]+', unique):
      unique_target.add(unique)
unique_target

{'act',
 'attn',
 'attn_dropout',
 'c_attn',
 'c_fc',
 'c_proj',
 'drop',
 'dropout',
 'h',
 'lm_head',
 'ln_1',
 'ln_2',
 'ln_f',
 'mlp',
 'resid_dropout',
 'transformer',
 'wpe',
 'wte'}

Valid LoRA Targets (identified with the help of gemini in colab):


2. **c_attn**: This could be the linear projection within the attention mechanism (often denoted as 'query', 'key', 'value' projections).
3. **c_proj**: This likely represents the output projection of the attention module.
5. **c_fc**: This might refer to a fully connected layer within the MLP, another potential LoRA target.
6. **lm_head**: This is the final linear layer used for language modeling. It can be a target for LoRA but it is less common.

In [16]:
ls=LoraConfig(
    r = 16, # Lora Rank; given a 1b model this should be relatively small
    target_modules = ['c_attn','c_proj', 'c_fc', 'lm_head'],
    lora_alpha = 16, #weight_scaling
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimize
    modules_to_save = ["lm_head", "embed_tokens"] ## if you use new chat formats or embedding tokens
)
finetuned_model = get_peft_model(base_model, ls)
finetuned_model.print_trainable_parameters()

trainable params: 111,771,648 || all params: 1,248,978,944 || trainable%: 8.9490


In [17]:
type(base_model)

In [18]:
type(finetuned_model)

### Training the Fine-tuned Model

In [19]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 10
eval_interval = 20
learning_rate = 1e-3
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

In [20]:
training_args = TrainingArguments(
    output_dir="./gpt-bigcode-Cfinetuned", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=max_iters, # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    per_device_eval_batch_size=batch_size,  # batch size for evaluation
    eval_steps = eval_interval, # Number of update steps between two evaluations.
    save_steps=400, # after # steps model is saved
    warmup_steps=20,# number of warmup steps for learning rate scheduler
    learning_rate=learning_rate
    )

trainer = Trainer(
    model=finetuned_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [21]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,1.3024
1000,0.4411
1500,0.1652




TrainOutput(global_step=1740, training_loss=0.5605262811156525, metrics={'train_runtime': 1146.565, 'train_samples_per_second': 24.211, 'train_steps_per_second': 1.518, 'total_flos': 2.412403727794176e+16, 'train_loss': 0.5605262811156525, 'epoch': 10.0})

## Experimental Ablation Study

### Loading a Control Model
This is the same base model as was loaded with the fine-tuning model.

In [22]:
#loading the base model without lora fine-tuning
starcoder_base_model = AutoModelForCausalLM.from_pretrained(
        "jinaai/starcoder-1b-textbook", device_map='auto')

In [23]:
type(starcoder_base_model)

### Checking the Model Objects' Identity
It is possible to check whether the get_peft_model function is wrapping and referencing the base model using the '==' operator.

The memory locations are misguiding as they are weak cross-references and point to the variable and not the object.

After checking the class values, I found that the variable referencing the base_model.base_model variable prior to applying the lora adapters with the fine_tuned model is also referenced by the finetuned_model.base_model.base_model variable.

In [24]:
print('### These variables reference the same object ###')
print('Memory address of base model: ', id(base_model.base_model))
print('Memory address of base model -> lora model: ', id(finetuned_model.base_model.base_model))
print('\n')
print('### These variables reference different objects ###')
print('Memory address of base model: ', id(base_model.base_model))
print('Memory address of second base model: ', id(starcoder_base_model.base_model))

### These variables reference the same object ###
Memory address of base model:  135221227098496
Memory address of base model -> lora model:  135221227098496


### These variables reference different objects ###
Memory address of base model:  135221227098496
Memory address of second base model:  135219776232064


In [25]:
# 'is' keyword checks memory locations, == checks object value
print(base_model.base_model == finetuned_model.base_model.base_model) # the base model is referenced by the peft model
print(base_model == finetuned_model) # the reference is deep and not superficial
print(base_model == starcoder_base_model) # a new model is needed to avoid calling the fine-tuned / peft model

True
False
False


### Evaluating their Responses to a C\# Specific Example

In [26]:
prompt = '''
generate a C# example for the coding principle that "The grouping of an expression does not completely determine its evaluation".
'''
inputs = tokenizer(prompt.rstrip(), return_tensors="pt").to("cuda")

In [27]:
base_generation_output = starcoder_base_model.generate(
    **inputs,
    max_new_tokens=256,
    eos_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True,
)

base_s = base_generation_output.sequences[0]
base_output = tokenizer.decode(base_s, skip_special_tokens=True)

print(base_output)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



generate a C# example for the coding principle that "The grouping of an expression does not completely determine its evaluation".

# Exercise
# Given a list of numbers, find the sum of all the even numbers in the list.

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Initialize a variable to store the sum of even numbers
even_sum = 0

# Iterate over each number in the list
for number in numbers:
    # Check if the number is even
    if number % 2 == 0:
        # Add the even number to the sum
        even_sum += number

# Print the sum of even numbers
print(even_sum)

# Expected output: 30

# Now, let's try to complete the code by using the "is not" operator to check if the sum of even numbers is not equal to 30.

# Your task is to complete the code below to check if the sum of even numbers is not equal to 30.
# If it is not equal to 30, print "Sum of even numbers is not equal to 30".
# Otherwise, print "Sum of even numbers is equal to 


In [30]:
finetuned_generation_output = finetuned_model.generate(
    **inputs,
    max_new_tokens=256,
    eos_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True,
)

finetuned_s = finetuned_generation_output.sequences[0]
finetuned_output = tokenizer.decode(finetuned_s, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [31]:
display(finetuned_output)

'\ngenerate a C# example for the coding principle that "The grouping of an expression does not completely determine its evaluation". In the spirit of the multiplication example above, if the expression 2is evaluated as a void expression, either the result or the ﬂoating-point exception is unspeciﬁed and this requires a trap or a trap representation in the result, the behavior is undeﬁned. 5 The following example speciﬁes the behavior for a function call when the value of a pointer is used as an argument to a function deﬁned in the same scope by the same function call. It is intended to be a working document for the implementation. #include <stdarg.h> void f(int n,int *restrict p,int i,...) { va_list ap; char *restrict format; if(n == 0) va_copy (ap, va_start (ap, f2(0, 1, 1, 1, 3), f2)); else if(n == 1) va_copy (ap, va_start (ap, f2(1, 1, 1, 3, 4)-1, f2)); else if(n == 2) va_copy (ap, va_start'

In [32]:
# saving the model to huggingface
repo='Alexis-Az/jinaai-starcoder-textbook-finetuned'
finetuned_model.push_to_hub(repo)



adapter_model.safetensors:   0%|          | 0.00/447M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Alexis-Az/jinaai-starcoder-textbook-finetuned/commit/d24d7ff82f07b3f173531da5717fd75b611f2b74', commit_message='Upload model', commit_description='', oid='d24d7ff82f07b3f173531da5717fd75b611f2b74', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Alexis-Az/jinaai-starcoder-textbook-finetuned', endpoint='https://huggingface.co', repo_type='model', repo_id='Alexis-Az/jinaai-starcoder-textbook-finetuned'), pr_revision=None, pr_num=None)