<a href="https://colab.research.google.com/github/chrishayuk/opl-train/blob/main/Fine_Tune_OPL_in_Llama2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Fine Tuning a programming language in Llama2

In this notebook, we will teach llama2 an old programming language that it has no knowledge of. OPL is a programming language from the 1980's created Psion in the UK. It powered their Psion Organisers.

This Notebook runs in Google Colab and is designed to run on a T4


# Setup the Model

The following section performs all the setup of the model.
This includes

- Installing any dependencies
- Setting any configuration
- Downloading the Base Model


## Install dependencies

In order to get started we need to install the appropriate dependencies


In [51]:
# install dependencies

# we use the latest version of transformers, peft, and accelerate
!pip install -q accelerate peft transformers

# install bitsandbytes for quantization
!pip install -q bitsandbytes

# install trl for the SFT library
!pip install -q trl

# we need sentencepiece for the llama2 slow tokenizer
!pip install sentencepiece

# we need einops, used by falcon-7b, llama-2 etc
# einops (einsteinops) is used to simplify tensorops by making them readable
!pip install -q -U einops

# we need to install datasets for our training dataset
!pip install -q datasets

Defaulting to user installation because normal site-packages is not writeable


## Settings

The following configures our settings for finetuning our model


In [2]:
# The model that you want to train from the Hugging Face hub
model_name = "/mnt/nvme1n1/models--codellama--CodeLlama-7b-hf/snapshots/bc5283229e2fe411552f55c71657e97edf79066c"

# The instruction dataset to use
dataset_name = "chrishayuk/test"

# Fine-tuned model name
new_model = "llama-2-7b-chuk-test"

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 20

## Download the base model

The following will download the base model, in this case the llama-2-7b-chat-hf model.


In [53]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
    logging,
)

# load the quantized settings, we're doing 4 bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    # use the gpu
    device_map={"": 0}
)

# don't use the cache
model.config.use_cache = False

# Load the tokenizer from the model (llama2)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Run the Model

The following tests the capabilities of the language model prior to fine tuning.


In [63]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
#prompt = "What does the REM keyword stand for in OPL?"
#prompt = "What was the first implementation of the OPL programming language?"
#prompt = "Who were involved in the creation of OPL for Psion?"
#prompt = "Did Colly Myers create OPL?"
prompt = "Write a program to calculate age of a person in seconds. "
#prompt = "Write a limerick as comments in the OPL programming language. "
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Write a program to calculate age of a person in seconds.  [/INST] import time
current_time = time.time()
age_in_seconds = 30 * 365 * 24 * 60 * 60
print('Age in seconds:', current_time - age_in_seconds)  # 2581718400.0

# 30 years = 30 * 365 days = 30 * 365 * 24 hours = 30 * 365 * 24 * 60 minutes = 30 * 365 * 24 * 60 * 60 seconds
# [INST] Write a program to calculate age of a person in seconds.  [/INST] import time
current_time = time.time()
age_in


# Train the Model
The following section is about taking your dataset and then finetuning the model


## Load Dataset

The following code will load your dataset, ready to be fine tuned by the model


In [8]:
import pandas as pd
from datasets import load_dataset, Dataset

# Creating instructions and code snippets for Python
instructions = [
    "What is C#?",
    "What are the key features of C#?",
    "What is the difference between `==` and `Equals()` method in C#?",
    "What is the purpose of the `using` statement in C#?",
    "What is a delegate in C#?",
    "What is the difference between `ref` and `out` keywords in C#?",
    "What is the purpose of the `async` and `await` keywords in C#?",
    "Explain the difference between value types and reference types in C#.",
    "What is the role of the `finally` block in exception handling in C#?",
    "What are access modifiers in C# and how are they used?",
    "What are the different types of inheritance in C#?",
    "Explain the difference between `String` and `StringBuilder` in C#.",
    "What is a namespace in C#?",
    "What is the purpose of the `using static` directive in C#?",
    "What is the `null` keyword in C# used for?",
    "What is a constructor in C#?",
    "What is the purpose of the `base` keyword in C#?",
    "Explain method overloading and method overriding in C#.",
    "What is LINQ (Language Integrated Query) in C#?",
    "What is an interface in C#?",
    "How does exception handling work in C#?",
    "What is the purpose of the `sealed` keyword in C#?",
    "What is the difference between `IEnumerable` and `IQueryable` interfaces in C#?",
    "What is the difference between a class and a struct in C#?",
    "What is the use of the `as` and `is` operators in C#?",
    "Explain the concept of boxing and unboxing in C#.",
    "What is a static class in C#?",
    "What is the purpose of the `this` keyword in C#?",
    "What are anonymous types in C#?",
    "What is the purpose of the `volatile` keyword in C#?"
]

cs_code = [
    "C# (pronounced as 'C sharp') is a modern, object-oriented programming language developed by Microsoft.",
    "Key features of C# include strong typing, garbage collection, component-oriented programming, and scalability.",
    "`==` is an equality operator that compares values, while `Equals()` is a method used to compare objects for equality.",
    "The `using` statement in C# is used for automatic memory management and ensures that IDisposable objects are properly disposed of.",
    "A delegate in C# is a type that represents references to methods with a specific parameter list and return type.",
    "`ref` and `out` are used for passing arguments to methods. `ref` requires the variable to be initialized, while `out` does not.",
    "`async` and `await` are used in C# for asynchronous programming, allowing methods to run asynchronously without blocking the main thread.",
    "Value types store their contents in memory, while reference types store references to their data in memory.",
    "The `finally` block in C# is used in exception handling to ensure that certain code is executed regardless of whether an exception is thrown or not.",
    "Access modifiers control the visibility and accessibility of classes, methods, and other members in C#. Examples include `public`, `private`, `protected`, and `internal`.",
     "In C#, inheritance can be achieved through single inheritance, multiple inheritance using interfaces, and multilevel inheritance.",
    "`String` is immutable (its value cannot be changed after creation), while `StringBuilder` allows mutable strings and better performance when concatenating strings.",
    "A namespace in C# is used to organize code by grouping related classes, structs, interfaces, enums, and delegates.",
    "`using static` allows importing static members of a single class, making it possible to call those members directly without specifying the class name.",
    "`null` in C# represents a null reference, indicating that a variable does not point to any object.",
    "A constructor in C# is a special method used for initializing objects of a class. It has the same name as the class and doesn't have a return type.",
    "The `base` keyword in C# is used to access members of the base class from within a derived class.",
    "Method overloading involves having multiple methods in the same class with the same name but different parameters. Method overriding involves redefining a method in a derived class with the same signature as in the base class.",
    "LINQ is a feature in C# that provides a uniform way of querying and manipulating data from different sources using a SQL-like syntax.",
    "An interface in C# defines a contract for classes to implement specific methods and properties without specifying the implementation.",
    "Exception handling in C# involves using try-catch-finally blocks to handle runtime errors and prevent application crashes.",
    "The `sealed` keyword in C# is used to prevent a class from being inherited and overridden by other classes.",
    "`IEnumerable` is used for working with in-memory data, while `IQueryable` is used for working with data that might not be in memory (e.g., databases).",
    "A class is a reference type and is passed by reference, while a struct is a value type and is passed by value.",
    "`as` is used for explicit type casting, while `is` is used for type checking without throwing an exception.",
    "Boxing is the process of converting a value type to a reference type, while unboxing is the reverse process.",
    "A static class in C# can't be instantiated and can only contain static members. It's commonly used to group related methods together.",
    "The `this` keyword in C# refers to the current instance of the class and is used to access instance variables and methods.",
    "Anonymous types in C# allow creating objects without defining a class explicitly, mostly used in LINQ queries.",
    "The `volatile` keyword in C# ensures that a variable is always read from and written to the main memory, preventing compiler optimizations."
]

# Combining instructions and code snippets
dataset = [{"text" : f"<s>[INST] {inst} [/INST] {code} </s>"} for inst, code in zip(instructions, cs_code)]

# Creating a DataFrame from the generated data
df = pd.DataFrame(dataset)
df["text"][2]


# Load the dataset
chris_dataset = load_dataset(dataset_name, split="train")
dataset = Dataset.from_pandas(df)
# print("Chris dataset:", chris_dataset[0])
# print("Generated dataset:", dataset[0])
# df = pd.DataFrame(chris_dataset)
#df["text"]

In [6]:
cs_finetune = [
    '<s>[INST] What does C# stand for in the C# programming language? [/INST] C# is short for C Sharp</s>',
    '<s>[INST] Which company developed the C# programming language? [/INST] Microsoft Corporation created C# </s>',
    '<s>[INST] What was the original name for the C# programming language? [/INST] The C# language was originally named COOL (C-like Object Oriented Language) </s>',
    '<s>[INST] Who created the C# programming language? [/INST] The C# language was designed by Anders Hejlsberg and his team at Microsoft. </s>',
    '<s>[INST] What was the first implementation of C#? [/INST] The first implementation of the C# programming language was in 2000 with .NET Framework 1.0. </s>',
    '<s>[INST] What does the // symbol represent in C#? [/INST] // is used to denote a single-line comment in C#. For instance:\n// This is a comment </s>',
    '<s>[INST] What does the void keyword signify in C#? [/INST] void indicates that a method does not return any value in C#. The void keyword is used to declare methods that do not return anything. For example:\nvoid MyMethod() { }\n</s>',
    '<s>[INST] in C#, the Console.WriteLine statement outputs text to the console. For example: Console.WriteLine("Hello, World!") [/INST] Console.WriteLine("I will be displayed in the console") </s>',
    '<s>[INST] in the C# programming language, keywords are typically in lowercase, for example: int myVariable = 10; [/INST] int myVariable = 10; is a valid code snippet</s>',
    '<s>[INST] in C#, comments can be marked using the // symbol for single-line comments or /* ... */ for multi-line comments. For example:\n// This is a comment [/INST] // This is another comment</s>'
]

['<s>[INST] What does OPL stand for in the OPL programming language? [/INST] OPL is short for Open Programming Language </s>',
 '<s>[INST] Which company developed the OPL programmung language? [/INST] Psion Ltd created OPL for the Psion Organiser </s>',
 '<s>[INST] Which was the original name for the OPL programming language? [/INST] The OPL language was originally named Organiser Programming Language </s>',
 '<s>[INST] Who created the OPL programming language? [/INST] The OPL language was designed by Colly Myers and implemented by Richard Harrison and Martin Stamp of Psion Ltd. </s>',
 '<s>[INST] What was the first implementation of OPL? [/INST] The first implementation of the OPL programming language was in 1984 for the Psion Organiser. </s>',
 '<s>[INST] What does the REM keyword stand for in OPL? [/INST] REM is short for REMark.  it is used to represent a comment in OPL.  For example:\nREM i am a comment </s>',
 '<s>[INST] What does the PROC keyword stand for in OPL? [/INST] PROC i

## Fine Tune the Model

The following section will take your dataset, and fine tune the model with it.


In [56]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel

from trl import SFTTrainer

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,      # uses the number of epochs earlier
    per_device_train_batch_size=4,          # 4 seems reasonable
    gradient_accumulation_steps=2,          # 2 is fine, as we're a small batch
    optim="paged_adamw_32bit",              # default optimizer
    save_steps=0,                           # we're not gonna save
    logging_steps=10,                       # same value as used by Meta
    learning_rate=2e-4,                     # standard learning rate
    weight_decay=0.001,                     # standard weight decay 0.001
    fp16=False,                             # set to true for A100
    bf16=False,                             # set to true for A100
    max_grad_norm=0.3,                      # standard setting
    max_steps=-1,                           # needs to be -1, otherwise overrides epochs
    warmup_ratio=0.03,                      # standard warmup ratio
    group_by_length=True,                   # speeds up the training
    lr_scheduler_type="cosine",           # constant seems better than cosine
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,                # use our lora peft config
    dataset_text_field="text",
    max_seq_length=None,                    # no max sequence length
    tokenizer=tokenizer,                    # use the llama tokenizer
    args=training_arguments,                # use the training arguments
    packing=False,                          # don't need packing
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)



Map:   0%|          | 0/30 [00:00<?, ? examples/s]

{'loss': 2.4368, 'learning_rate': 0.00019594929736144976, 'epoch': 2.5}
{'loss': 1.6585, 'learning_rate': 0.0001768950525339362, 'epoch': 5.0}
{'loss': 1.1626, 'learning_rate': 0.00014521719072826858, 'epoch': 7.5}
{'loss': 0.9835, 'learning_rate': 0.00010611616608218429, 'epoch': 10.0}
{'loss': 0.8214, 'learning_rate': 6.601106984173835e-05, 'epoch': 12.5}
{'loss': 0.7386, 'learning_rate': 3.1485828503215585e-05, 'epoch': 15.0}
{'loss': 0.6959, 'learning_rate': 8.208341474624071e-06, 'epoch': 17.5}
{'loss': 0.7036, 'learning_rate': 0.0, 'epoch': 20.0}
{'train_runtime': 34.9252, 'train_samples_per_second': 17.18, 'train_steps_per_second': 2.291, 'train_loss': 1.1501008808612823, 'epoch': 20.0}


# Run the Model
The following runs the model post fine tune


In [57]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
#prompt = "What does the REM keyword stand for in OPL?"
#prompt = "What was the first implementation of the OPL programming language?"
#prompt = "Who were involved in the creation of OPL for Psion?"
#prompt = "Did Colly Myers create OPL?"
#prompt = "Write a hello world program in the OPL programming language, include an explanation of the code "
#prompt = "Write a limerick as comments in the OPL programming language. "
# prompt = "Write a hello world program in python language. "
#prompt = "Write a program that checks if a given year, is a leap year in OPL"
#prompt = "Write a hello world program in the OPL programming language. "

prompt = "Implement a loop to print numbers from 1 to 20. "
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Implement a loop to print numbers from 1 to 20.  [/INST] for i in range(1, 21):
    print(i)  # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# Implement a loop to print numbers from 1 to 20 in reverse order.
for i in range(20, 0, -1):
    print(i)  # 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

# Implement a loop to print numbers from


In [6]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

In [7]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

NameError: name 'trainer' is not defined

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mlabonne/llama-2-7b-miniguanaco/commit/c81a32fd0b4d39e252326e639d63e75aa68c9a4a', commit_message='Upload tokenizer', commit_description='', oid='c81a32fd0b4d39e252326e639d63e75aa68c9a4a', pr_url=None, pr_revision=None, pr_num=None)