This notebook was adapted from https://github.com/ragntune/code-llama-finetune/blob/main/fine-tune-code-llama.ipynb

### 2. Pip installs


In [None]:
#run on Google Colab
#adapted from https://github.com/ragntune/code-llama-finetune/blob/main/fine-tune-code-llama.ipynb
!pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3  # we need latest transformers for this
!pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
!pip install datasets==2.10.1
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
!pip install wandb

Collecting git+https://github.com/huggingface/transformers.git@main
  Cloning https://github.com/huggingface/transformers.git (to revision main) to /tmp/pip-req-build-45yivp1o
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-45yivp1o
  Resolved https://github.com/huggingface/transformers.git to commit a0e77a1f6bdfcceccdc5618e8a01ee32ef47bfa8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.20.3
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m28.2

### Pip install additional dependencies

In [None]:
!pip install torch torchvision -UTrainingArguments
!pip install accelerate -U
#restart runtime after this point.


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -T
Collecting accelerate
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.20.3
    Uninstalling accelerate-0.20.3:
      Successfully uninstalled accelerate-0.20.3
Successfully installed accelerate-0.29.3


(If you have import errors, try restarting your Jupyter kernel)


I used an A100 GPU machine with Python 3.10 and cuda 11.8 to run this notebook. It took about an hour to run.

### Loading libraries

In [None]:
from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq


### Load model
I load code llama from huggingface in int8. Standard for Lora:

In [None]:
base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

torch_dtype=torch.float16 means computations are performed using a float16 representation, even though the values themselves are 8 bit ints.

If you get error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." Make sure you have transformers version is 4.33.0.dev0 and accelerate is >=0.20.3.


### 3. Check base model
A very good common practice is to check whether a model can already do the task at hand. Fine-tuning is something you want to try to avoid at all cost:


In [None]:
# tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

eval_prompt = """
You are a powerful error explanation model. Your job is to explain the cause of an error in a way that is clearly understandable. You are given an error caused by an arbitrary program execution.

You must output the explanation to the error.
### Input:
AttributeError: module 'numpy.random' has no attribute 'see'

### Output:
"""
# {'question': 'Name the comptroller for office of prohibition', 'context': 'CREATE TABLE table_22607062_1 (comptroller VARCHAR, ticket___office VARCHAR)', 'answer': 'SELECT comptroller FROM table_22607062_1 WHERE ticket___office = "Prohibition"'}
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



You are a powerful error explanation model. Your job is to explain the cause of an error in a way that is clearly understandable. You are given an error caused by an arbitrary program execution.

You must output the explanation to the error.
### Input:
AttributeError: module 'numpy.random' has no attribute 'see'

### Output:
The module numpy.random does not have a see attribute.

### Input:
TypeError: 'int' object is not callable

### Output:
The int object is not callable.

### Input:
AttributeError: 'str' object has no attribute '__getitem__'

### Output:
The str object has no attribute __getitem__.

### Input:
AttributeError: 'int' object has no attribute '__getitem__'

### Output:
The int object has no attribute __getitem__.

### Input:
AttributeError: 'int' object has no attribute '__getitem__'

### Output:
The int object has no attribute __getitem__.

### Input:
AttributeError: 'int' object has no attribute '__getitem__'

### Output:
The int object has no


### 4. Tokenization
Setup some tokenization settings like left padding because it makes [training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa):

In [None]:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = 'left'

Setup the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning](https://neptune.ai/blog/self-supervised-learning) is:

In [None]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
        #padding_side = 'left'
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

And run convert each data_point into a prompt that I found online that works quite well:

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""You are a powerful error explanation model. Your job is to explain the cause of an error in a way that is clearly understandable. You are given an error caused by an arbitrary program execution.

You must output the explanation to the error.

### Input:
{data_point["Error Text"]}

### Response:
{data_point["Alignment"]}
"""

    #print(data_point["Label"])
    return tokenize(full_prompt)
    #return full_prompt

### Load our custom data and split it into train, test

In [None]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from datasets import Dataset

train_csv_file = pd.read_csv("train_partition.csv")
valid_csv_file = pd.read_csv("val_partition.csv")
#test_csv_file = pd.read_csv("test_partition.csv")

train, _ = train_test_split(train_csv_file, train_size=800, shuffle=False)
valid, _ = train_test_split(valid_csv_file, train_size=100, shuffle=False)
#test, _ = train_test_split(test_csv_file, train_size=1.0, shuffle=False)

# del csv_file[csv_file.columns[0]]
# train, valid = train_test_split(csv_file, test_size=0.2, shuffle=False)

# train = shuffle(train)
# valid = shuffle(valid)

print(train[train.columns[0]])
print("")
print(valid[valid.columns[0]])
print("")
#print(test[test.columns[0]])

train_dataset = Dataset.from_pandas(train)
valid_dataset = Dataset.from_pandas(valid)
#test_dataset = Dataset.from_pandas(test)

0      Traceback (most recent call last):\n  File "C:...
1      AttributeError                            Trac...
2      AttributeError                            Trac...
3          188     z5 = tf.math.maximum(tf.zeros(tf.s...
4      /usr/local/lib/python3.10/dist-packages/tensor...
                             ...                        
795    ValueError Invalid `beta` argument value. It s...
796    AttributeError module 'keras.api._v2.keras.met...
797         TypeError Input type uint16 is not supported
798    ValueError Error initializing torch.distribute...
799    AttributeError 'tuple' object has no attribute...
Name: Error Text, Length: 800, dtype: object

0     numpy.AxisError axis 1 is out of bounds for ar...
1     IndexError Replacement index 5 out of range fo...
2     TypeError join() argument must be str, bytes, ...
3     tensorflow.python.framework.errors_impl.NotFou...
4     TypeError expected str, bytes or os.PathLike o...
                            ...                

### Reformat to prompt and tokenize each sample:

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
print(tokenizer.decode(tokenized_train_dataset[0]['input_ids']))
tokenized_val_dataset = valid_dataset.map(generate_and_tokenize_prompt)
#tokenized_test_dataset = test_dataset.map(generate_and_tokenize_prompt)
#print(tokenized_val_dataset[0])

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

<s> You are a powerful error explanation model. Your job is to explain the cause of an error in a way that is clearly understandable. You are given an error caused by an arbitrary program execution.

You must output the explanation to the error.

### Input:
Traceback (most recent call last):
  File "C:\Users\Alfre\OneDrive\Desktop\NLP1.py", line 916, in <module>
    KNNmetrics = KNN.validationMetrics()
AttributeError: 'NearestNeighbors' object has no attribute 'validationMetrics'

### Response:
The error indicates that the NearestNeighbors object does not have a method called validationMetrics, which is likely because the method name is either misspelled or the method does not exist in the NearestNeighbors class.
</s>


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

### 5. Setup Lora

In [None]:
model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

To resume from a checkpoint, set resume_from_checkpoint to the path of the adapter_model.bin you want to resume from. This code'll replace the lora adapter attached to the model:

In [None]:
resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

Optional stuff to setup Weights and Biases to view training graphs:

In [None]:
wandb_project = "sql-try2-coder"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project


In [None]:
if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

### 6. Training arguments
If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run. All the other variables are standard stuff that I wouldn't recommend messing with:

In [None]:
batch_size = 8
per_device_train_batch_size = 8
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=50,
        max_steps=100,#
        learning_rate=3e-4,#
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        load_best_model_at_end=True,
        group_by_length=True,
        report_to="wandb",
        run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

max_steps is given, it will override any value given in num_train_epochs


Then we do some pytorch-related optimisation (which just make training faster but don't affect accuracy):

In [None]:
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

compiling the model


  self.pid = os.fork()


In [None]:
trainer.train()
#faef4c2209f03eef91200ecdab64c0957ffdbddc for API key when prompted
#after training, an error is thrown. This can be ignored and proceeding cells can continue without trouble.

### Zip checkpoints

In [None]:
!zip -r /content/tune.zip /content/sql-code-llama/checkpoint-100/
#forms zip file of fine-tuned model state. Choose best step.

  adding: content/sql-code-llama/checkpoint-300/ (stored 0%)
  adding: content/sql-code-llama/checkpoint-300/scheduler.pt (deflated 57%)
  adding: content/sql-code-llama/checkpoint-300/training_args.bin (deflated 52%)
  adding: content/sql-code-llama/checkpoint-300/optimizer.pt (deflated 9%)
  adding: content/sql-code-llama/checkpoint-300/rng_state.pth (deflated 25%)
  adding: content/sql-code-llama/checkpoint-300/adapter_config.json (deflated 45%)
  adding: content/sql-code-llama/checkpoint-300/trainer_state.json (deflated 80%)
  adding: content/sql-code-llama/checkpoint-300/adapter_model.bin (deflated 8%)


### Load the final checkpoint

In [None]:
#run rest of cells if model loading is to be tested. If not, you can ignore rest of notebook.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

To load a fine-tuned Lora/Qlora adapter use PeftModel.from_pretrained. ```output_dir``` should be something containing an adapter_config.json and adapter_model.bin:

In [None]:
from peft import PeftModel
#!unzip tune.zip
model = PeftModel.from_pretrained(model, "/content/content/sql-code-llama/checkpoint-300/")



Try the same prompt as before:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

eval_prompt = """
You are a powerful error explanation model. Your job is to explain the cause of an error in a way that is clearly understandable. You are given an error caused by an arbitrary program execution.

You must output the explanation to the error.
### Input:
Solution.cpp:26:12: error: expected ‘(’ before ‘prevChar’ if prevChar == s[i]

### Output:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200)[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



You are a powerful error explanation model. Your job is to explain the cause of an error in a way that is clearly understandable. You are given an error caused by an arbitrary program execution.

You must output the explanation to the error.
### Input:
Solution.cpp:26:12: error: expected ‘(’ before ‘prevChar’ if prevChar == s[i]

### Output:
The error is caused by the line 26 of Solution.cpp. The error is caused by the variable prevChar. The error is caused by the comparison of the variable prevChar and the variable s[i].

### Note:

1. The error message is not a valid C++ code.
2. The error message is not a valid C++ code.
3. The error message is not a valid C++ code.
4. The error message is not a valid C++ code.
5. The error message is not a valid C++ code.
6. The error message is not a valid C++ code.
7. The error message is not a valid C++ code.
8. The error message is not a valid C++ code.
9. The error message is not a valid C++ code.
10. The error message is not a valid C++ code

### Form requirements.txt

In [None]:
!pip3 freeze > requirements.txt