## Prerequisites

Before delving into the fine-tuning process, ensure that you have the following prerequisites in place:

1. **GPU**: This tutorial cannot run on free Google Colab; it requires more powerful GPUs, such as the A100.
2. **Python Packages**: Ensure that you have the necessary Python packages installed. You can use the following commands to install them:

Let's begin by checking if your GPU is correctly detected:

Let's define a wrapper function which will get completion from the model from a user question

Let's define a wrapper function which will get completion from the model from a user question

## Step 1 Set up

### Step 1.1 - Set variables

In [1]:
MODEL_PATH = '/content/drive/My Drive/Colab Notebooks/model'
MODEL_NAME = 'llm-prompt-recovery-mistral-4'

HUGGING_FACE_USERNAME = 'OliverSavolainen'
HUGGING_FACE_REPO_PATH = '/content/drive/My Drive/Colab Notebooks/huggingFaceRepo'
# print(f'{HUGGING_FACE_USERNAME}/{MODEL_NAME}')

TRAIN_FILE_PATH = '/content/drive/My Drive/Colab Notebooks/df_with_emb_20240402.csv'

KAGGLE_CREDENTIALS_PATH = '/content/drive/My Drive/Colab Notebooks/kaggle.json'
KAGGLE_DATASET_PATH = '/content/dataset'
KAGGLE_DATASET_NAME = MODEL_NAME
KAGGLE_USERNAME = HUGGING_FACE_USERNAME

### Step 1.2 - Set Credentials

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
from huggingface_hub import notebook_login
notebook_login()
# login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# Setup Kaggle API Credentials:
# Make sure kaggle.json file exists in this folder
!mkdir ~/.kaggle
!cp "$KAGGLE_CREDENTIALS_PATH" ~/.kaggle/
!ls ~/.kaggle/

# Change the permissions of the file.
!chmod 600 ~/.kaggle/kaggle.json

kaggle.json


### Step 1.3 - Install necessary packages
First, install the dependencies below to get started. As these features are available on the main branches only, we need to install the libraries below from source.

In [5]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets scipy
!pip install -q trl

!pip install -q kaggle

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━

### 1.4 Imports

In [6]:
import pandas as pd
import os
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftConfig, PeftModel
from datasets import Dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model
import transformers
from trl import SFTTrainer
import warnings
from huggingface_hub import Repository

## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [7]:
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= 'nf4',
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
 )


model_id = 'mistralai/Mistral-7B-Instruct-v0.1'

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [8]:
adapter_model_name = 'dfurman/Mistral-7B-Instruct-v0.2'
model = PeftModel.from_pretrained(model, adapter_model_name)

adapter_config.json:   0%|          | 0.00/520 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

Now we specify the model ID and then we load it with our previously defined quantization configuration.

Run a inference on the base model. The model does not seem to understand our instruction and gives us a list of questions related to our query.

## Step 3 - Load dataset for finetuning

### Lets Load the Dataset

For this tutorial, we will fine-tune Mistral 7B Instruct for code generation.

We will be using this [dataset](https://huggingface.co/datasets/TokenBender/code_instructions_122k_alpaca_style) which is curated by [TokenBender (e/xperiments)](https://twitter.com/4evaBehindSOTA) and is an excellent data source for fine-tuning models for code generation. It follows the alpaca style of instructions, which is an excellent starting point for this task. The dataset structure should resemble the following:

```json
{
  "instruction": "Create a function to calculate the sum of a sequence of integers.",
  "input": "[1, 2, 3, 4, 5]",
  "output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}
```

In [9]:
df = pd.read_csv(TRAIN_FILE_PATH)


In [10]:
#original text prefix
orig_prefix = "Original Text:"

#mistral "response"
llm_response_for_rewrite = "Provide the modified text and I'll tell you a something general that changed about it."
#modified text prefix
rewrite_prefix = "Re-written Text:"

#provided as start of Mistral response (anything after this is used as the prompt)
#providing this as the start of the response helps keep things relevant
response_start = "The request was: "

#We insert our detected prompt into this well-scoring baseline text
#thanks to: https://www.kaggle.com/code/rdxsun/lb-0-61
base_line = 'Please improve the following text using the writing style of {}, maintaining the original meaning but altering the tone, diction, and stylistic elements to match the new style.Enhance the clarity, elegance, and impact of the following text by adopting the writing style of , ensuring the core message remains intact while transforming the tone, word choice, and stylistic features to align with the specified style.'
base_line_swap_text = "[insert desired style here]"

In [11]:
#first example
example_text_1 = "Hey there! Just a heads up: our friendly dog may bark a bit, but don't worry, he's all bark and no bite!"
example_rewrite_1 = "Warning: Protective dog on premises. May exhibit aggressive behavior. Ensure personal safety by maintaining distance and avoiding direct contact."
example_prompt_1 = "Shift the narrative from offering an informal, comforting assurance to providing a more structured advisory note that emphasizes mindfulness and preparedness. This modification should mark a clear change in tone and purpose, aiming to responsibly inform and guide, while maintaining a positive and constructive approach."

#second example
example_text_2 = "A lunar eclipse happens when Earth casts its shadow on the moon during a full moon. The moon appears reddish because Earth's atmosphere scatters sunlight, some of which refracts onto the moon's surface. Total eclipses see the moon entirely in Earth's shadow; partial ones occur when only part of the moon is shadowed."
example_rewrite_2 = "Yo check it, when the Earth steps in, takes its place, casting shadows on the moon's face. It's a full moon night, the scene's set right, for a lunar eclipse, a celestial sight. The moon turns red, ain't no dread, it's just Earth's atmosphere playing with sunlight's thread, scattering colors, bending light, onto the moon's surface, making the night bright. Total eclipse, the moon's fully in the dark, covered by Earth's shadow, making its mark. But when it's partial, not all is shadowed, just a piece of the moon, slightly furrowed. So that's the rap, the lunar eclipse track, a dance of shadows, with no slack. Earth, moon, and sun, in a cosmic play, creating the spectacle we see today."
example_prompt_2 = "Transform your communication from an academic delivery to a dynamic, rhythm-infused presentation. Keep the essence of the information intact but weave in artistic elements, utilizing rhythm, rhyme, and a conversational style. This approach should make the content more relatable and enjoyable, aiming to both educate and entertain your audience."


In [12]:
def remove_numbered_list(text):
    final_text_paragraphs = []
    for line in text.split('\n'):
        # Split each line at the first occurrence of '. '
        parts = line.split('. ', 1)
        # If the line looks like a numbered list item, remove the numbering
        if len(parts) > 1 and parts[0].isdigit():
            final_text_paragraphs.append(parts[1])
        else:
            # If it doesn't look like a numbered list item, include the line as is
            final_text_paragraphs.append(line)

    return '  '.join(final_text_paragraphs)


#trims LLM output to just the response
def trim_to_response(text):
    terminate_string = "[/INST]"
    text = text.replace('</s>', '')
    #just in case it puts things in quotes
    text = text.replace('"', '')
    text = text.replace("'", '')

    last_pos = text.rfind(terminate_string)
    return text[last_pos + len(terminate_string):] if last_pos != -1 else text

#looks for response_start / returns only text that occurs after
def extract_text_after_response_start(full_text):
    parts = full_text.rsplit(response_start, 1)  # Split from the right, ensuring only the last occurrence is considered
    if len(parts) > 1:
        return parts[1].strip()  # Return text after the last occurrence of response_start
    else:
        return full_text  # Return the original text if response_start is not found


#trims text to requested number of sentences (or first LF or double-space sequence)
def trim_to_first_x_sentences_or_lf(text, x):
    if x <= 0:
        return ""

    #any double-spaces dealt with as linefeed
    text = text.replace("  ", "\n")

    text_chunks = text.split('\n', 1)
    first_chunk = text_chunks[0]
    sentences = first_chunk.split('.')

    if len(sentences) - 1 <= x:
        trimmed_text = first_chunk
    else:
        # Otherwise, return the first x sentences
        trimmed_text = '.'.join(sentences[:x]).strip()

    if not trimmed_text.endswith('.'):
        trimmed_text += '.'  # Add back the final period if the text chunk ended with one and was trimmed

    return trimmed_text

In [13]:
def get_prompt(orig_text, transformed_text):

    #construct the prompt...
    messages = [

        #first example
        {"role": "user", "content": f"{orig_prefix} {example_text_1}"},
        {"role": "assistant", "content": llm_response_for_rewrite},
        {"role": "user", "content": f"{rewrite_prefix} {example_rewrite_1}"},
        {"role": "assistant", "content": f"{response_start} {example_prompt_1}"},

        #second example
        {"role": "user", "content": f"{orig_prefix} {example_text_2}"},
        {"role": "assistant", "content": llm_response_for_rewrite},
        {"role": "user", "content": f"{rewrite_prefix} {example_rewrite_2}"},
        {"role": "assistant", "content": f"{response_start} {example_prompt_2}"},

        #actual prompt
        {"role": "user", "content": f"{orig_prefix} {orig_text}"},
        {"role": "assistant", "content": llm_response_for_rewrite},
        {"role": "user", "content": f"{rewrite_prefix} {transformed_text}"},
        {"role": "assistant", "content": response_start},
    ]

    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

In [14]:
import re
def remove_special_characters(text):
    # This regex will match any character that is not a letter, number, or whitespace
    pattern = r'[^a-zA-Z0-9\s]'
    text =text.replace("Transform" ,"improve")
    text =text.replace("Reimagine" ,"rewrite")
    # Replace these characters with an empty string
    clean_text = re.sub(pattern, '', text)
    return clean_text

In [15]:
df = df[df['dataset_id'] == 'winddude']

def create_prompt(original, rewritten):
    return f"Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.\nYou are trying to understand how the original essay was transformed into a new version.\nAnalyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.\nOnly give me the PROMPT. Start directly with the prompt, that's all I need.\nOutput should be only line ONLY\nOriginal Essay: {original}\nRewritten Essay: {rewritten}"

# Apply the function to each row to create the prompt column
df['prompt'] = df.apply(lambda row: get_prompt(row['original_text'], row['rewritten_text']), axis=1)

In [16]:
df.head(10)

Unnamed: 0,original_text,rewrite_prompt,rewritten_text,dataset_id,original_text_emb_0,original_text_emb_1,original_text_emb_2,original_text_emb_3,original_text_emb_4,original_text_emb_5,...,rewritten_text_emb_759,rewritten_text_emb_760,rewritten_text_emb_761,rewritten_text_emb_762,rewritten_text_emb_763,rewritten_text_emb_764,rewritten_text_emb_765,rewritten_text_emb_766,rewritten_text_emb_767,prompt
66308,SEATTLE—Unsettled by the increasingly earnest ...,Reimagine this text as if it were written by a...,"""Greetings, my esteemed reporter,"" I say, my v...",winddude,0.005778,-0.084179,0.030904,0.060223,-0.006169,0.003787,...,0.0025,-0.028622,-0.046943,0.009117,0.027973,0.012686,0.019609,-0.031432,-0.031985,<s>[INST] Original Text: Hey there! Just a hea...
66309,The 4K PS4 might be officially announced soon....,Imagine this text was a tagline in a magical f...,Prepare for a magical journey into the future ...,winddude,-0.026545,-0.020338,0.012806,-0.006422,0.016024,0.014813,...,-0.000806,-0.033662,-0.035252,-0.004275,0.008637,-0.003343,-0.007551,-0.015777,0.017473,<s>[INST] Original Text: Hey there! Just a hea...
66310,"A mixed martial arts fighter, Donovan Duran, w...",Imagine this text was a jingle in New York Cit...,"Hey, listen up, New York City, hear a tale of ...",winddude,0.018677,-0.045941,0.033453,0.041751,-0.013858,0.001425,...,0.00211,-0.029991,-0.040549,0.009804,0.027882,0.004618,-0.010356,-0.05276,-0.021281,<s>[INST] Original Text: Hey there! Just a hea...
66311,The tradition is at odds with the beliefs of m...,Imagine this text was a dialogue in a magical ...,"""The tradition is at odds with the beliefs of ...",winddude,-0.022368,-0.041458,0.016112,0.005205,-0.014894,-0.018602,...,-0.012944,-0.044397,-0.034266,0.039086,0.021365,0.014388,0.006488,-0.027786,-0.019937,<s>[INST] Original Text: Hey there! Just a hea...
66312,WASHINGTON — As the F.B.I.’s Russia investigat...,"Imagine this text was a tanka in a beach town,...","President Trump's wrath unfurls,\nF.B.I. under...",winddude,-0.036094,-0.054545,0.044919,0.002418,0.013755,0.020774,...,-0.021223,-0.042991,-0.027977,-0.007202,-0.007491,0.003931,-0.042762,-0.041219,-0.020239,<s>[INST] Original Text: Hey there! Just a hea...
66313,You can and usually should write code that is ...,Imagine this text was a tanka in a parallel un...,"## Tanka Reformulation:\n\nCodename One, a dre...",winddude,-0.031255,-0.006096,0.044053,0.033941,0.003527,0.024237,...,0.022418,-0.032605,-0.030149,-0.017288,0.029892,0.007093,0.022632,-0.002271,0.028179,<s>[INST] Original Text: Hey there! Just a hea...
66314,With all that sweat and germs hanging out at t...,Imagine this text was a jingle in a space stat...,"At the gym or studio, germs and sweat,\nMake e...",winddude,0.026081,-0.046921,0.040496,0.030426,-0.004122,-0.015996,...,0.02324,-0.047423,-0.025274,-0.00726,0.005856,0.011778,0.006771,-0.014532,-0.013969,<s>[INST] Original Text: Hey there! Just a hea...
66315,Stack Exchange team used to create a Twitter b...,Imagine this text was a soliloquy in a space s...,"""My mind drifts to the bots that once danced a...",winddude,-0.023681,-0.054399,0.020013,-0.010236,-0.004958,0.005197,...,-0.022026,-0.031478,-0.044025,-0.008798,0.037525,0.016899,0.016934,-0.018917,0.009629,<s>[INST] Original Text: Hey there! Just a hea...
66316,"The Butaro Hospital, in the Burera District of...",Imagine this text was a fable in New York City...,"In the heart of New York City, where skyscrape...",winddude,0.025348,-0.025641,0.014201,0.047272,-0.027835,0.010071,...,-0.02896,-0.010523,-0.059867,-0.004725,0.049029,0.012849,0.018622,-0.024278,-0.007008,<s>[INST] Original Text: Hey there! Just a hea...
66317,(CNN) Billionaire businessman and powerful Rep...,Imagine this text was a limerick in New York C...,"There once was a businessman named Koch,\nA Re...",winddude,-0.023753,-0.050662,0.016501,-0.005776,0.051477,0.030206,...,0.029071,-0.044569,-0.047487,0.013659,0.036916,0.014291,0.009717,-0.083997,0.023991,<s>[INST] Original Text: Hey there! Just a hea...


Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [Mistral-7B-Instruct-v0.1 format](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

We'll put each instruction and input pair between `[INST]` and `[/INST]` output after that, like this:

```
<s>[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!</s>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [17]:
def generate_prompt(row):
    """Generate input text based on a prompt, task instruction, and answer.

    :param row: Series: Data point (row of a pandas DataFrame)
    :return: str: generated prompt text
    """
    # Assuming 'prompt' is the generated context and 'rewrite_prompt' is the instruction
    return f"<s>[INST]{row['prompt']} [/INST]{row['rewrite_prompt']}</s>"

# Apply the function to each row to transform the 'prompt' and 'rewrite_prompt' columns
#df['prompt'] = df.apply(generate_prompt, axis=1)

In [18]:
df = df[['prompt', 'rewrite_prompt']]
# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)
#dataset = dataset.select(range(10)) # ToDo: remove that
# Shuffle the dataset with a seed for reproducibility
dataset = dataset.shuffle(seed=1234)

# Tokenize the prompts in the dataset.
# This example assumes that the model requires only 'input_ids'.
# Adjust as necessary for your model, e.g., adding 'attention_mask'.
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

# Split the dataset into training and testing sets (90% training, 10% testing)
train_test_split = dataset.train_test_split(test_size=0.1)
train_data = train_test_split['train']
test_data = train_test_split['test']

Map:   0%|          | 0/62852 [00:00<?, ? examples/s]

We'll need to tokenize our data so the model can understand.


Split dataset into 90% for training and 10% for testing

### After Formatting, We should get something like this

```json
{
"text":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
 sequence: sum += num return sum"
"prompt":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>"

}
```

While using SFT (**[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)**) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.

In [19]:
print(test_data)

Dataset({
    features: ['prompt', 'rewrite_prompt', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 6286
})


## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [20]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [21]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): 

Use the following function to find out the linear layers for fine tuning.
QLoRA paper : "We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers is required to match full finetuning performance."

In [22]:
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [23]:
modules = find_all_linear_names(model)
print(modules)

['up_proj', 'down_proj', 'gate_proj', 'base_layer']


In [24]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

model = get_peft_model(model, lora_config)

In [25]:
#trainable, total = model.get_nb_trainable_parameters()
#print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")


## Step 5 - Run the training!

Setting the training arguments:
* for the reason of demo, we just ran it for few steps (100) just to showcase how to use this integration with existing tools on the HF ecosystem.

In [26]:
# from datasets import load_dataset
# data = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split='train')
# data = data.train_test_split(test_size=0.1)
# train_data = data["train"]
# test_data = data["test"]

In [27]:
# import transformers
# tokenizer.pad_token = tokenizer.eos_token

# trainer = transformers.Trainer(
#      model=model,
#      train_dataset=train_data,
#      eval_dataset=test_data,
#      args=transformers.TrainingArguments(
#          per_device_train_batch_size=1,
#          gradient_accumulation_steps=4,
#          warmup_ratio=0.03,
#          max_steps=100,
#          learning_rate=2e-4,
#          fp16=True,
#          logging_steps=1,
#          output_dir="outputs_mistral_b_finance_finetuned_test",
#          optim="paged_adamw_8bit",
#          save_strategy="epoch",
#      ),
#      data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
#  )


### Fine-Tuning with qLora and Supervised Fine-Tuning

We're ready to fine-tune our model using qLora. For this tutorial, we'll use the `SFTTrainer` from the `trl` library for supervised fine-tuning. Ensure that you've installed the `trl` library as mentioned in the prerequisites.

In [28]:
model = model.to(device='cuda', dtype=torch.bfloat16)

In [29]:
print(train_data)

Dataset({
    features: ['prompt', 'rewrite_prompt', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 56566
})


In [30]:
#new code using SFTTrainer
tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field='prompt',
    peft_config=lora_config,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=25,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir='/content/drive/My Drive/Colab Notebooks/MyModelCheckpoints',
        optim='paged_adamw_8bit',
        save_strategy='steps',
        save_steps=10,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)



Map:   0%|          | 0/56566 [00:00<?, ? examples/s]

Map:   0%|          | 0/6286 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


Start the training

### Let's start the training process

In [31]:
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

In [None]:
# Suppress all warnings
warnings.filterwarnings('ignore')

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()




Step,Training Loss
1,2.7314
2,2.716
3,2.0654


## Step 6 - Save the model

In [None]:
# Save the model to the repo
model.save_pretrained(MODEL_PATH)
tokenizer.save_pretrained(MODEL_PATH)

In [None]:
# # Or load from the file
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit= True,
#     bnb_4bit_quant_type= 'nf4',
#     bnb_4bit_compute_dtype= torch.bfloat16,
#     bnb_4bit_use_double_quant= False,
#  )

# model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, quantization_config=bnb_config, device_map={"":0})
# tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, add_eos_token=True)


### Save to Hugging Face

### Save to Kaggle


In [None]:
# Ensure the directory exists
os.makedirs(KAGGLE_DATASET_PATH, exist_ok=True)

# Move the model files to the directory
!cp "$MODEL_PATH"/* "$KAGGLE_DATASET_PATH"/
!ls $KAGGLE_DATASET_PATH


In [None]:
# Add a metadata file
metadata = {
    'title': KAGGLE_DATASET_NAME,
    'id': f"{KAGGLE_USERNAME}/{KAGGLE_DATASET_NAME}",
    'licenses': [{'name': 'CC0-1.0'}] # ToDo: Use that license? What metadata?
}

with open(os.path.join(KAGGLE_DATASET_PATH, 'dataset-metadata.json'), 'w') as f:
    json.dump(metadata, f)

In [None]:
# Add the test CSV file
# ToDo: Add the real test CSV or remove
!cp "$TEST_FILE_PATH" "$KAGGLE_DATASET_PATH"

In [None]:
# Upload the dataset to Kaggle
# Create the first time
!kaggle datasets create -p {KAGGLE_DATASET_PATH} --dir-mode zip
# Update an existing dataset
#!kaggle datasets version -p {KAGGLE_DATASET_PATH} -m "Update dataset"

## Step 7 - Evaluating the model qualitatively: run an inference!



In [None]:
# import re
# def remove_special_characters(text):
#     # This regex will match any character that is not a letter, number, or whitespace
#     pattern = r'[^a-zA-Z0-9\s]'
#     text =text.replace("Transform" ,"improve")
#     text =text.replace("Reimagine" ,"rewrite")
#     # Replace these characters with an empty string
#     clean_text = re.sub(pattern, '', text)
#     return clean_text

# def get_completion(prompt) -> str:
#     device = "cuda:0"
#     print('prompt')
#     print(prompt)
#     encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

#     model_inputs = encodeds.to(device)


#     generated_ids = model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=tokenizer.eos_token_id)
#     generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
#     print('gen')
#     print(generated_text)
#     prompt_length = len(tokenizer.encode(prompt))
#     print('output')
#     prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)
#     generated_tokens = generated_ids[0][len(prompt_tokens):]
#     generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
#     print(generated_text)
#     return generated_text

In [None]:
# test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
# test_df['prompt'] = test_df.apply(lambda row: create_prompt(row['original_text'], row['rewritten_text']), axis=1)
# test_df['prompt'] = test_df.apply(generate_prompt, axis=1)
# test_df['rewrite_prompt'] = test_df.apply(lambda row: get_completion(row['prompt']), axis=1)
# test_df = test_df[['id', 'rewrite_prompt']]
# test_df

In [None]:
# test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
# adject =[]
# for index, row in test_df.iterrows():
#         result = generate_prompt(create_prompt(row['original_text'], row['rewritten_text']))
#         result  = remove_special_characters(result)
#         print(result)
#         test_df.at[index, 'rewrite_prompt'] =  result




# test_df = test_df[['id', 'rewrite_prompt']]
# test_df


In [None]:
# test_df.to_csv('submission.csv', index=False)