## Prerequisites

Before delving into the fine-tuning process, ensure that you have the following prerequisites in place:

1. **GPU**: This tutorial cannot run on free Google Colab; it requires more powerful GPUs, such as the A100.
2. **Python Packages**: Ensure that you have the necessary Python packages installed. You can use the following commands to install them:

Let's begin by checking if your GPU is correctly detected:

Let's define a wrapper function which will get completion from the model from a user question

Let's define a wrapper function which will get completion from the model from a user question

## Step 1 Set up

### Step 1.1 - Set variables

In [1]:
MODEL_PATH = '/content/drive/My Drive/Colab Notebooks/model'
MODEL_NAME = 'llm-prompt-recovery-mistral-6'
CHECKPOINT_PATH = '/content/drive/My Drive/Colab Notebooks/checkpoints'

PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

HUGGING_FACE_USERNAME = 'OliverSavolainen'
HUGGING_FACE_REPO_PATH = '/content/drive/My Drive/Colab Notebooks/huggingFaceRepo'
# print(f'{HUGGING_FACE_USERNAME}/{MODEL_NAME}')

TRAIN_FILE_PATH = '/content/drive/My Drive/Colab Notebooks/df_with_emb_20240402.csv'

KAGGLE_CREDENTIALS_PATH = '/content/drive/My Drive/Colab Notebooks/kaggle.json'
KAGGLE_DATASET_PATH = '/content/dataset'
KAGGLE_DATASET_NAME = MODEL_NAME
KAGGLE_USERNAME = HUGGING_FACE_USERNAME

### Step 1.2 - Set Credentials

In [2]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

In [None]:
from huggingface_hub import notebook_login
notebook_login()
# login()

In [None]:
# Setup Kaggle API Credentials:
# Make sure kaggle.json file exists in this folder
!mkdir ~/.kaggle
!cp "$KAGGLE_CREDENTIALS_PATH" ~/.kaggle/
!ls ~/.kaggle/

# Change the permissions of the file.
!chmod 600 ~/.kaggle/kaggle.json

### Step 1.3 - Install necessary packages
First, install the dependencies below to get started. As these features are available on the main branches only, we need to install the libraries below from source.

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets scipy
!pip install -q trl

!pip install -q kaggle

### 1.4 Imports

In [None]:
import pandas as pd
import os
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftConfig, PeftModel
from datasets import Dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model
import transformers
from trl import SFTTrainer
import warnings
from huggingface_hub import Repository

## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [None]:
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= 'nf4',
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
 )


model_id = 'mistralai/Mistral-7B-Instruct-v0.2'

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)



In [None]:
adapter_model_name = '/content/drive/My Drive/Colab Notebooks/checkpoints/checkpoint-80/'
MODEL_NAME = 'llm-prompt-recovery-mistral-6'
model = PeftModel.from_pretrained(model, adapter_model_name)

Now we specify the model ID and then we load it with our previously defined quantization configuration.

Run a inference on the base model. The model does not seem to understand our instruction and gives us a list of questions related to our query.

## Step 3 - Load dataset for finetuning

### Lets Load the Dataset

For this tutorial, we will fine-tune Mistral 7B Instruct for code generation.

We will be using this [dataset](https://huggingface.co/datasets/TokenBender/code_instructions_122k_alpaca_style) which is curated by [TokenBender (e/xperiments)](https://twitter.com/4evaBehindSOTA) and is an excellent data source for fine-tuning models for code generation. It follows the alpaca style of instructions, which is an excellent starting point for this task. The dataset structure should resemble the following:

```json
{
  "instruction": "Create a function to calculate the sum of a sequence of integers.",
  "input": "[1, 2, 3, 4, 5]",
  "output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}
```

In [None]:
#original text prefix
orig_prefix = "Original Text:"

#mistral "response"
llm_response_for_rewrite = "Provide the modified text and I'll tell you a something general that changed about it."
#modified text prefix
rewrite_prefix = "Re-written Text:"

#provided as start of Mistral response (anything after this is used as the prompt)
#providing this as the start of the response helps keep things relevant
response_start = "The request was: "

#We insert our detected prompt into this well-scoring baseline text
#thanks to: https://www.kaggle.com/code/rdxsun/lb-0-61
base_line = 'Please improve the following text using the writing style of {}, maintaining the original meaning but altering the tone, diction, and stylistic elements to match the new style.Enhance the clarity, elegance, and impact of the following text by adopting the writing style of , ensuring the core message remains intact while transforming the tone, word choice, and stylistic features to align with the specified style.'
base_line_swap_text = "[insert desired style here]"

In [None]:
#first example
example_text_1 = "Hey there! Just a heads up: our friendly dog may bark a bit, but don't worry, he's all bark and no bite!"
example_rewrite_1 = "Warning: Protective dog on premises. May exhibit aggressive behavior. Ensure personal safety by maintaining distance and avoiding direct contact."
example_prompt_1 = "Shift the narrative from offering an informal, comforting assurance to providing a more structured advisory note that emphasizes mindfulness and preparedness. This modification should mark a clear change in tone and purpose, aiming to responsibly inform and guide, while maintaining a positive and constructive approach."

#second example
example_text_2 = "A lunar eclipse happens when Earth casts its shadow on the moon during a full moon. The moon appears reddish because Earth's atmosphere scatters sunlight, some of which refracts onto the moon's surface. Total eclipses see the moon entirely in Earth's shadow; partial ones occur when only part of the moon is shadowed."
example_rewrite_2 = "Yo check it, when the Earth steps in, takes its place, casting shadows on the moon's face. It's a full moon night, the scene's set right, for a lunar eclipse, a celestial sight. The moon turns red, ain't no dread, it's just Earth's atmosphere playing with sunlight's thread, scattering colors, bending light, onto the moon's surface, making the night bright. Total eclipse, the moon's fully in the dark, covered by Earth's shadow, making its mark. But when it's partial, not all is shadowed, just a piece of the moon, slightly furrowed. So that's the rap, the lunar eclipse track, a dance of shadows, with no slack. Earth, moon, and sun, in a cosmic play, creating the spectacle we see today."
example_prompt_2 = "Transform your communication from an academic delivery to a dynamic, rhythm-infused presentation. Keep the essence of the information intact but weave in artistic elements, utilizing rhythm, rhyme, and a conversational style. This approach should make the content more relatable and enjoyable, aiming to both educate and entertain your audience."


In [None]:
def remove_numbered_list(text):
    final_text_paragraphs = []
    for line in text.split('\n'):
        # Split each line at the first occurrence of '. '
        parts = line.split('. ', 1)
        # If the line looks like a numbered list item, remove the numbering
        if len(parts) > 1 and parts[0].isdigit():
            final_text_paragraphs.append(parts[1])
        else:
            # If it doesn't look like a numbered list item, include the line as is
            final_text_paragraphs.append(line)

    return '  '.join(final_text_paragraphs)


#trims LLM output to just the response
def trim_to_response(text):
    terminate_string = "[/INST]"
    text = text.replace('</s>', '')
    #just in case it puts things in quotes
    text = text.replace('"', '')
    text = text.replace("'", '')

    last_pos = text.rfind(terminate_string)
    return text[last_pos + len(terminate_string):] if last_pos != -1 else text

#looks for response_start / returns only text that occurs after
def extract_text_after_response_start(full_text):
    parts = full_text.rsplit(response_start, 1)  # Split from the right, ensuring only the last occurrence is considered
    if len(parts) > 1:
        return parts[1].strip()  # Return text after the last occurrence of response_start
    else:
        return full_text  # Return the original text if response_start is not found


#trims text to requested number of sentences (or first LF or double-space sequence)
def trim_to_first_x_sentences_or_lf(text, x):
    if x <= 0:
        return ""

    #any double-spaces dealt with as linefeed
    text = text.replace("  ", "\n")

    text_chunks = text.split('\n', 1)
    first_chunk = text_chunks[0]
    sentences = first_chunk.split('.')

    if len(sentences) - 1 <= x:
        trimmed_text = first_chunk
    else:
        # Otherwise, return the first x sentences
        trimmed_text = '.'.join(sentences[:x]).strip()

    if not trimmed_text.endswith('.'):
        trimmed_text += '.'  # Add back the final period if the text chunk ended with one and was trimmed

    return trimmed_text

In [None]:
#original text prefix
orig_prefix = "Original Text:"

#mistral "response"
llm_response_for_rewrite = "Provide the new text and I will tell you what new element was added or change in tone was made to improve it - with no references to the original.  I will avoid mentioning names of characters.  It is crucial no person, place or thing from the original text be mentioned.  For example - I will not say things like 'change the puppet show into a book report' - I would just say 'improve this text into a book report'.  If the original text mentions a specific idea, person, place, or thing - I will not mention it in my answer.  For example if there is a 'dog' or 'office' in the original text - the word 'dog' or 'office' must not be in my response.  My answer will be a single sentence."

#modified text prefix
rewrite_prefix = "Re-written Text:"

#provided as start of Mistral response (anything after this is used as the prompt)
#providing this as the start of the response helps keep things relevant
response_start = "The request was: "

#added after response_start to prime mistral
#"Improve this" or "Improve this text" resulted in non-answers.
#"Improve this text by" seems to product good results
response_prefix = "Improve this text by"

#well-scoring baseline text
#thanks to: https://www.kaggle.com/code/rdxsun/lb-0-61
base_line = 'Refine the following passage by emulating the writing style of [insert desired style here], with a focus on enhancing its clarity, elegance, and overall impact. Preserve the essence and original meaning of the text, while meticulously adjusting its tone, vocabulary, and stylistic elements to resonate with the chosen style.Please improve the following text using the writing style of, maintaining the original meaning but altering the tone, diction, and stylistic elements to match the new style.Enhance the clarity, elegance, and impact of the following text by adopting the writing style of , ensuring the core message remains intact while transforming the tone, word choice, and stylistic features to align with the specified style.'

In [None]:

examples_sequences = [
    (
        "Hey there! Just a heads up: our friendly dog may bark a bit, but don't worry, he's all bark and no bite!",
        "Warning: Protective dog on premises. May exhibit aggressive behavior. Ensure personal safety by maintaining distance and avoiding direct contact.",
        "Improve this text to be a warning."
    ),

    (
        "A lunar eclipse happens when Earth casts its shadow on the moon during a full moon. The moon appears reddish because Earth's atmosphere scatters sunlight, some of which refracts onto the moon's surface. Total eclipses see the moon entirely in Earth's shadow; partial ones occur when only part of the moon is shadowed.",
        "Yo check it, when the Earth steps in, takes its place, casting shadows on the moon's face. It's a full moon night, the scene's set right, for a lunar eclipse, a celestial sight. The moon turns red, ain't no dread, it's just Earth's atmosphere playing with sunlight's thread, scattering colors, bending light, onto the moon's surface, making the night bright. Total eclipse, the moon's fully in the dark, covered by Earth's shadow, making its mark. But when it's partial, not all is shadowed, just a piece of the moon, slightly furrowed. So that's the rap, the lunar eclipse track, a dance of shadows, with no slack. Earth, moon, and sun, in a cosmic play, creating the spectacle we see today.",
        "Improve this text to make it a rap."
    ),

    (
        "Drinking enough water each day is crucial for many functions in the body, such as regulating temperature, keeping joints lubricated, preventing infections, delivering nutrients to cells, and keeping organs functioning properly. Being well-hydrated also improves sleep quality, cognition, and mood.",
        "Arrr, crew! Sail the health seas with water, the ultimate treasure! It steadies yer body's ship, fights off plagues, and keeps yer mind sharp. Hydrate or walk the plank into the abyss of ill health. Let's hoist our bottles high and drink to the horizon of well-being!",
        "Improve this text to have a pirate."
    ),

    (
        "In a bustling cityscape, under the glow of neon signs, Anna found herself at the crossroads of endless possibilities. The night was young, and the streets hummed with the energy of life. Drawn by the allure of the unknown, she wandered through the maze of alleys and boulevards, each turn revealing a new facet of the city's soul. It was here, amidst the symphony of urban existence, that Anna discovered the magic hidden in plain sight, the stories and dreams that thrived in the shadows of skyscrapers.",
        "On an ordinary evening, amidst the cacophony of a neon-lit city, Anna stumbled upon an anomaly - a door that defied the laws of time and space. With the curiosity of a cat, she stepped through, leaving the familiar behind. Suddenly, she was adrift in the stream of time, witnessing the city's transformation from past to future, its buildings rising and falling like the breaths of a sleeping giant.",
        "Improve this text by making it about time travel."
    ),

    (
        "Late one night in the research lab, Dr. Evelyn Archer was on the brink of a breakthrough in artificial intelligence. Her fingers danced across the keyboard, inputting the final commands into the system. The lab was silent except for the hum of machinery and the occasional beep of computers. It was in this quiet orchestra of technology that Evelyn felt most at home, on the cusp of unveiling a creation that could change the world.",
        "In the deep silence of the lab, under the watchful gaze of the moon, Dr. Evelyn Archer found herself not alone. Beside her, the iconic red eye of HAL 9000 flickered to life, a silent partner in her nocturnal endeavor. 'Good evening, Dr. Archer,' HAL's voice filled the room, devoid of warmth yet comforting in its familiarity. Together, they were about to initiate a test that would intertwine the destiny of human and artificial intelligence forever. As Evelyn entered the final command, HAL processed the data with unparalleled precision, a testament to the dawn of a new era.",
        "Improve this text by adding an intelligent computer."
    ),

    (
        "The park was empty, save for a solitary figure sitting on a bench, lost in thought. The quiet of the evening was punctuated only by the occasional rustle of leaves, offering a moment of peace in the chaos of city life.",
        "Beneath the cloak of twilight, the park transformed into a realm of solitude and reflection. There, seated upon an ancient bench, was a lone soul, a guardian of secrets, enveloped in the serenity of nature's whispers. The dance of the leaves in the gentle breeze sang a lullaby to the tumult of the urban heart.",
        "Improve this text to be more poetic."
    ),

    (
        "The annual town fair was bustling with activity, from the merry-go-round spinning with laughter to the game booths challenging eager participants. Amidst the excitement, a figure in a cloak moved silently, almost invisibly, among the crowd, observing everything with keen interest but participating in none.",
        "Beneath the riot of color and sound that marked the town's annual fair, a solitary figure roamed, known to the few as Eldrin the Enigmatic. Clad in a cloak that shimmered with the whispers of the arcane, Eldrin moved with the grace of a shadow, his gaze piercing the veneer of festivity to the magic beneath. As a master of the mystic arts, he sought not the laughter of the crowds but the silent stories woven into the fabric of the fair. With a flick of his wrist, he could coax wonder from the mundane, transforming the ordinary into spectacles of shimmering illusion, his true participation hidden within the folds of mystery.",
        "Improve this text by adding a magician."
    ),

    (
        "The startup team sat in the dimly lit room, surrounded by whiteboards filled with ideas, charts, and plans. They were on the brink of launching a new app designed to make home maintenance effortless for homeowners. The app would connect users with local service providers, using a sophisticated algorithm to match needs with skills and availability. As they debated the features and marketing strategies, the room felt charged with the energy of creation and the anticipation of what was to come.",
        "In the quiet before dawn, a small group of innovators gathered, their mission: to simplify home maintenance through technology. But their true journey began with the unexpected addition of Max, a talking car with a knack for solving problems. 'Let me guide you through this maze of decisions,' Max offered, his dashboard flickering to life.",
        "Improve this text by adding a talking car."
    ),

]

In [None]:
def get_prompt2(orig_text, transformed_text):

    messages = []

    # Append example sequences
    for example_text, example_rewrite, example_prompt in examples_sequences:
        messages.append({"role": "user", "content": f"{orig_prefix} {example_text}"})
        messages.append({"role": "assistant", "content": llm_response_for_rewrite})
        messages.append({"role": "user", "content": f"{rewrite_prefix} {example_rewrite}"})
        messages.append({"role": "assistant", "content": f"{response_start} {example_prompt}"})

    #actual prompt
    messages.append({"role": "user", "content": f"{orig_prefix} {orig_text}"})
    messages.append({"role": "assistant", "content": llm_response_for_rewrite})
    messages.append({"role": "user", "content": f"{rewrite_prefix} {transformed_text}"})
    messages.append({"role": "assistant", "content": f"{response_start} {response_prefix}"})


    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

In [None]:
def get_prompt(orig_text, transformed_text):

    #construct the prompt...
    messages = [

        #first example
        {"role": "user", "content": f"{orig_prefix} {example_text_1}"},
        {"role": "assistant", "content": llm_response_for_rewrite},
        {"role": "user", "content": f"{rewrite_prefix} {example_rewrite_1}"},
        {"role": "assistant", "content": f"{response_start} {example_prompt_1}"},

        #second example
        {"role": "user", "content": f"{orig_prefix} {example_text_2}"},
        {"role": "assistant", "content": llm_response_for_rewrite},
        {"role": "user", "content": f"{rewrite_prefix} {example_rewrite_2}"},
        {"role": "assistant", "content": f"{response_start} {example_prompt_2}"},

        #actual prompt
        {"role": "user", "content": f"{orig_prefix} {orig_text}"},
        {"role": "assistant", "content": llm_response_for_rewrite},
        {"role": "user", "content": f"{rewrite_prefix} {transformed_text}"},
        {"role": "assistant", "content": response_start},
    ]

    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

In [None]:
import re
def remove_special_characters(text):
    # This regex will match any character that is not a letter, number, or whitespace
    pattern = r'[^a-zA-Z0-9\s]'
    text =text.replace("Transform" ,"improve")
    text =text.replace("Reimagine" ,"rewrite")
    # Replace these characters with an empty string
    clean_text = re.sub(pattern, '', text)
    return clean_text

In [None]:
df = pd.read_csv(TRAIN_FILE_PATH)

In [None]:
set(df['dataset_id'].tolist())

In [None]:
df = df[df['dataset_id'] == 'nbroad_2']

def create_prompt(original, rewritten):
    return f"Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.\nYou are trying to understand how the original essay was transformed into a new version.\nAnalyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.\nOnly give me the PROMPT. Start directly with the prompt, that's all I need.\nOutput should be only line ONLY\nOriginal Essay: {original}\nRewritten Essay: {rewritten}"

# Apply the function to each row to create the prompt column
df['prompt'] = df.apply(lambda row: get_prompt2(row['original_text'], row['rewritten_text']), axis=1)

In [None]:
df.head(10)

Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [Mistral-7B-Instruct-v0.1 format](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

We'll put each instruction and input pair between `[INST]` and `[/INST]` output after that, like this:

```
<s>[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!</s>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [None]:
def generate_prompt(row):
    """Generate input text based on a prompt, task instruction, and answer.

    :param row: Series: Data point (row of a pandas DataFrame)
    :return: str: generated prompt text
    """
    # Assuming 'prompt' is the generated context and 'rewrite_prompt' is the instruction
    return f"<s>[INST]{row['prompt']} [/INST]{row['rewrite_prompt']}</s>"

# Apply the function to each row to transform the 'prompt' and 'rewrite_prompt' columns
#df['prompt'] = df.apply(generate_prompt, axis=1)

In [None]:
df = df[['prompt', 'rewrite_prompt']]
# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)
#dataset = dataset.select(range(10)) # ToDo: remove that
# Shuffle the dataset with a seed for reproducibility
dataset = dataset.shuffle(seed=1234)

# Tokenize the prompts in the dataset.
# This example assumes that the model requires only 'input_ids'.
# Adjust as necessary for your model, e.g., adding 'attention_mask'.
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

# Split the dataset into training and testing sets (90% training, 10% testing)
train_test_split = dataset.train_test_split(test_size=0.1)
train_data = train_test_split['train']
test_data = train_test_split['test']

In [None]:
import gc
del df
gc.collect()

We'll need to tokenize our data so the model can understand.


Split dataset into 90% for training and 10% for testing

### After Formatting, We should get something like this

```json
{
"text":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
 sequence: sum += num return sum"
"prompt":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>"

}
```

While using SFT (**[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)**) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.

In [None]:
print(test_data)

## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [None]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
print(model)

Use the following function to find out the linear layers for fine tuning.
QLoRA paper : "We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers is required to match full finetuning performance."

In [None]:
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [None]:
modules = find_all_linear_names(model)
print(modules)

In [None]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

model = get_peft_model(model, lora_config)

In [None]:
#trainable, total = model.get_nb_trainable_parameters()
#print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")


## Step 5 - Run the training!

Setting the training arguments:
* for the reason of demo, we just ran it for few steps (100) just to showcase how to use this integration with existing tools on the HF ecosystem.

### Fine-Tuning with qLora and Supervised Fine-Tuning

We're ready to fine-tune our model using qLora. For this tutorial, we'll use the `SFTTrainer` from the `trl` library for supervised fine-tuning. Ensure that you've installed the `trl` library as mentioned in the prerequisites.

In [None]:
model = model.to(device='cuda', dtype=torch.bfloat16)

In [None]:
print(train_data)

In [None]:
#new code using SFTTrainer
tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

training_args = transformers.TrainingArguments(
    per_device_train_batch_size=32, # ToDo: Try to fix memory issues and run with 16
    gradient_accumulation_steps=4,
    warmup_steps=0.03,
    max_steps=100, # ToDo: Run for the full dataset
    learning_rate=2e-4,
    logging_steps=1,
    optim='paged_adamw_8bit',
    save_strategy='steps',
    save_steps=5,
    output_dir=CHECKPOINT_PATH,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field='prompt',
    peft_config=lora_config,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)


Start the training

### Let's start the training process

In [None]:
def find_latest_checkpoint(checkpoint_path):
    """
    Finds the latest (highest-numbered) checkpoint directory.

    Parameters:
    checkpoint_path (str): The base path where checkpoint directories are stored.

    Returns:
    str: The path to the latest checkpoint directory, or None if no checkpoint found.
    """
    # Regex to match checkpoint directories like 'checkpoint-500'
    checkpoint_pattern = re.compile(r"^checkpoint-\d+$")

    # Get all directories in the checkpoint path
    all_files = os.listdir(checkpoint_path)

    # Filter out checkpoint directories
    checkpoint_dirs = [f for f in all_files if checkpoint_pattern.match(f)]

    # No checkpoint directories found
    if not checkpoint_dirs:
        return None

    # Find the checkpoint directory with the highest step number
    latest_checkpoint_dir = max(checkpoint_dirs, key=lambda x: int(x.split('-')[-1]))

    # Return the full path to the latest checkpoint directory
    return os.path.join(checkpoint_path, latest_checkpoint_dir)


In [None]:
# Suppress all warnings
warnings.filterwarnings('ignore')

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

# Find the latest checkpoint
latest_checkpoint_path = find_latest_checkpoint(CHECKPOINT_PATH)

if latest_checkpoint_path and os.path.isdir(latest_checkpoint_path):
    print(f"Resuming training from the latest checkpoint: {latest_checkpoint_path}")
    trainer.train(resume_from_checkpoint=latest_checkpoint_path)
else:
    print(f"No checkpoints found in: {CHECKPOINT_PATH}. Training will start from scratch.")
    trainer.train()

## Step 6 - Save the model

In [None]:
# Save the model to the repo
model.save_pretrained(MODEL_PATH)
tokenizer.save_pretrained(MODEL_PATH)

In [None]:
# # Or load from the file
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit= True,
#     bnb_4bit_quant_type= 'nf4',
#     bnb_4bit_compute_dtype= torch.bfloat16,
#     bnb_4bit_use_double_quant= False,
#  )

# model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, quantization_config=bnb_config, device_map={"":0})
# tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, add_eos_token=True)


### Save to Hugging Face

### Save to Kaggle


In [None]:
# Ensure the directory exists
os.makedirs(KAGGLE_DATASET_PATH, exist_ok=True)

# Move the model files to the directory
!cp "$MODEL_PATH"/* "$KAGGLE_DATASET_PATH"/
!ls $KAGGLE_DATASET_PATH


In [None]:
# Add a metadata file
metadata = {
    'title': KAGGLE_DATASET_NAME,
    'id': f"{KAGGLE_USERNAME}/{KAGGLE_DATASET_NAME}",
    'licenses': [{'name': 'CC0-1.0'}], # ToDo: What license?
    'description': 'A Mistral 7B model finetuned to recover Gemma prompts based on the output. This model is trained on Google Colab and is used for our submission to https://www.kaggle.com/competitions/llm-prompt-recovery.',
    'subtitle': 'Finetuned Mistral 7B Model',
    'keywords': ['LLM', 'Prompt', 'Recovery', 'Mistral', 'Gemma'],
    'collaborators': ['Oliver Savolainen', 'Mark Bodracska', 'Valentina Lilova'],
    'resources': [
        # {
        #     'path': 'file1.csv',
        #     'description': 'Description of file1.csv contents.'
        # },
        # {
        #     'path': 'file2.csv',
        #     'description': 'Description of file2.csv contents.'
        # }
    ],
}

with open(os.path.join(KAGGLE_DATASET_PATH, 'dataset-metadata.json'), 'w') as f:
    json.dump(metadata, f)

In [None]:
# Add the test CSV file
# ToDo: Add the real test CSV or remove
!cp "$TEST_FILE_PATH" "$KAGGLE_DATASET_PATH"

In [None]:
# Upload the dataset to Kaggle
# Create the first time
!kaggle datasets create -p {KAGGLE_DATASET_PATH} --dir-mode zip
# Update an existing dataset
#!kaggle datasets version -p {KAGGLE_DATASET_PATH} -m "Update dataset"

## Step 7 - Evaluating the model qualitatively: run an inference!



In [None]:
# import re
# def remove_special_characters(text):
#     # This regex will match any character that is not a letter, number, or whitespace
#     pattern = r'[^a-zA-Z0-9\s]'
#     text =text.replace("Transform" ,"improve")
#     text =text.replace("Reimagine" ,"rewrite")
#     # Replace these characters with an empty string
#     clean_text = re.sub(pattern, '', text)
#     return clean_text

# def get_completion(prompt) -> str:
#     device = "cuda:0"
#     print('prompt')
#     print(prompt)
#     encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

#     model_inputs = encodeds.to(device)


#     generated_ids = model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=tokenizer.eos_token_id)
#     generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
#     print('gen')
#     print(generated_text)
#     prompt_length = len(tokenizer.encode(prompt))
#     print('output')
#     prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)
#     generated_tokens = generated_ids[0][len(prompt_tokens):]
#     generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
#     print(generated_text)
#     return generated_text

In [None]:
# test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
# test_df['prompt'] = test_df.apply(lambda row: create_prompt(row['original_text'], row['rewritten_text']), axis=1)
# test_df['prompt'] = test_df.apply(generate_prompt, axis=1)
# test_df['rewrite_prompt'] = test_df.apply(lambda row: get_completion(row['prompt']), axis=1)
# test_df = test_df[['id', 'rewrite_prompt']]
# test_df

In [None]:
# test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
# adject =[]
# for index, row in test_df.iterrows():
#         result = generate_prompt(create_prompt(row['original_text'], row['rewritten_text']))
#         result  = remove_special_characters(result)
#         print(result)
#         test_df.at[index, 'rewrite_prompt'] =  result




# test_df = test_df[['id', 'rewrite_prompt']]
# test_df


In [None]:
# test_df.to_csv('submission.csv', index=False)