## Step 1 - Install necessary packages
First, install the dependencies below to get started. As these features are available on the main branches only, we need to install the libraries below from source.

In [67]:
# Click on "File + Add Input"
# Search for the dataset named "llm-prompt-recovery"
# Once you find it, click on "Add" to attach it to your Kernel
ADAPTER_MODEL_PATH = '/kaggle/input/d/valentinalilova/llm-prompt-recovery'
TEST_CSV_PATH = ADAPTER_MODEL_PATH + '/test_example.csv'

In [68]:
!pip install -q -U transformers --no-index --find-links ../input/llm-detect-pip/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [69]:
!pip install -q -U accelerate --no-index --find-links ../input/llm-detect-pip/
!pip install -q -U bitsandbytes --no-index --find-links ../input/transformers-4-38-2/
!pip install -q -U transformers --no-index --find-links ../input/llm-detect-pip/
!pip install /kaggle/input/peft-whl-latest/peft-0.10.0-py3-none-any.whl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Processing /kaggle/input/peft-whl-latest/peft-0.10.0-py3-none-any.whl
peft is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [70]:
!pip install -q -U peft --no-index --find-links ../input/llm-pkg/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [71]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [72]:
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
 )

model_id = "/kaggle/input/mistral-7b-it-v02"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [73]:
from peft import PeftConfig, PeftModel
model = PeftModel.from_pretrained(model, ADAPTER_MODEL_PATH)

In [74]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): MistralRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): Modu

In [75]:
def create_prompt(original, rewritten):
    return f"Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.\nYou are trying to understand how the original essay was transformed into a new version.\nAnalyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.\nOnly give me the PROMPT. Start directly with the prompt, that's all I need.\nOutput should be only line ONLY\nOriginal Essay: {original}\nRewritten Essay: {rewritten}"


### Formatting the Dataset

Now, let's format the dataset in the required [Mistral-7B-Instruct-v0.1 format](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

We'll put each instruction and input pair between `[INST]` and `[/INST]` output after that, like this:

```
<s>[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!</s>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [76]:
def generate_prompt(row):
    """Generate input text based on a prompt, task instruction, and answer.

#     :param row: Series: Data point (row of a pandas DataFrame)
#     :return: str: generated prompt text
#     """
    # Assuming 'prompt' is the generated context and 'rewrite_prompt' is the instruction
    return f"<s>[INST]{row['prompt']}[/INST]</s>"


## Step 3 - Inference

In [77]:
import re
# def remove_special_characters(text):
#     # This regex will match any character that is not a letter, number, or whitespace
#     pattern = r'[^a-zA-Z0-9\s]'
#     text =text.replace("Transform" ,"improve")
#     text =text.replace("Reimagine" ,"rewrite")
#     # Replace these characters with an empty string
#     clean_text = re.sub(pattern, '', text)
#     return clean_text

def get_completion(prompt) -> str:
    device = "cuda:0"
    print('prompt')
    print(prompt)
    encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

    model_inputs = encodeds.to(device)

    generated_ids = model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=tokenizer.eos_token_id)
    generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print('gen')
    print(generated_text)
    prompt_length = len(tokenizer.encode(prompt))
    print('output')
    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)
    generated_tokens = generated_ids[0][len(prompt_tokens):]
    generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
    print(generated_text)
    return generated_text

In [78]:
# test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
test_df = pd.read_csv(TEST_CSV_PATH)
test_df['prompt'] = test_df.apply(lambda row: create_prompt(row['original_text'], row['rewritten_text']), axis=1)
test_df['prompt'] = test_df.apply(generate_prompt, axis=1)
test_df['rewrite_prompt'] = test_df.apply(lambda row: get_completion(row['prompt']), axis=1)
test_df = test_df[['id', 'rewrite_prompt']]
test_df

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


prompt
<s>[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: The competition dataset comprises text passages that have been rewritten by the Gemma LLM according to some rewrite_prompt instruction. The goal of the competition is to determine what prompt was used to rewrite each original text.  Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 2,000 original texts in the test set.
Rewritten Essay: Here is your shanty: (Verse 1) The text is rewritten,

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: The competition dataset comprises text passages that have been rewritten by the Gemma LLM according to some rewrite_prompt instruction. The goal of the competition is to determine what prompt was used to rewrite each original text.  Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 2,000 original texts in the test set.
Rewritten Essay: Here is your shanty: (Verse 1) The text is rewritten, the L

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: Understanding the intricate dance of subatomic particles unveils the fabric of the cosmos.
Rewritten Essay: Behold the cosmic ballet: Where particles tiny, in intricate play, Weave the universe's endless fray.[/INST]PROMPT: Rewrite the given sentence preserving its meaning but using figurative language to describe the movement and interconnectedness of subatomic particles in the universe.
output
PROMPT: Rewrite the given sentence preserving its meaning but using figurative language to describe the 

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: Economic theories evolve, mirroring the shifting sands of time, adapting to new realities.
Rewritten Essay: Like dunes under the desert sun, economic beliefs shift, morph, and turn, ever-changing under time's relentless churn.[/INST] evoke an image of economic beliefs as fluid and dynamic, shaped by the passage of time and influenced by current realities, using vivid metaphors from nature.
output
evoke an image of economic beliefs as fluid and dynamic, shaped by the passage of time and influenced b

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: The ancient library held countless tomes, a testament to humanity's unquenchable thirst for knowledge.
Rewritten Essay: In halls of whispers, bound by dust, the library's tomes echo mankind's undying knowledge lust.[/INST] Imaginate transforming the original essay by employing a more poetic and metaphorical language to maintain the essence of humanity's insatiable desire for knowledge.
output
Imaginate transforming the original essay by employing a more poetic and metaphorical language to maintain 

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: Quantum computing, with its potential to revolutionize our world, remains shrouded in mystery and complexity.
Rewritten Essay: Quantum realms, vast and untamed, promise revolution, yet cloak their game in layers of enigma and arcane flame.[/INST]	Prompt: Transform the "Quantum computing" language to a more metaphorical and poetic tone while maintaining the themes of mystery, potential revolution, and complexity.
output
Prompt: Transform the "Quantum computing" language to a more metaphorical and po

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: The art of fermentation, an age-old practice, transforms simple ingredients into culinary treasures.
Rewritten Essay: Through time's slow press, the humble grape becomes wine's caress, a transformation deep, almost divine.[/INST]"Transform the description of the art of fermentation into a metaphor involving time and the creation of a specific ingredient into a precious product."
output
"Transform the description of the art of fermentation into a metaphor involving time and the creation of a specifi

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: In the heart of the forest, the ancient tree stands tall, a silent witness to centuries gone by.
Rewritten Essay: Where shadows gather, and time slows its crawl, there stands the elder tree, forest's immortal thrall.[/INST] "Transform the original essay into a more poetic and enchanting version by using imagery that emphasizes the tree's age, mystery, and connection to the forest."
output
"Transform the original essay into a more poetic and enchanting version by using imagery that emphasizes the tr

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: The exploration of Mars opens new frontiers, challenging humanity to dream bigger.
Rewritten Essay: Mars beckons, a red jewel in the night, daring us to dream, to reach new heights.[/INST]"Write an essay that encourages a sense of wonder and exploration towards Mars, using descriptive language and an inviting tone."
output
"Write an essay that encourages a sense of wonder and exploration towards Mars, using descriptive language and an inviting tone."
prompt
<s>[INST]Given are 2 essays, the Rewritte

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: The deep ocean, Earth's final frontier, teems with life, much of which remains to be discovered.
Rewritten Essay: Beneath the waves, in silent, dark embrace, lies a world unseen, an uncharted place.[/INST] prospect: Transform the original essay's description of the deep ocean as Earth's final frontier to a more mysterious and unexplored tone by using metaphors and allusions to the hidden and unknown world beneath the waves.
output
prospect: Transform the original essay's description of the deep oce

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


gen
[INST]Given are 2 essays, the Rewritten essay was created from the Original essay using the google Gemma model.
You are trying to understand how the original essay was transformed into a new version.
Analyzing the changes in style, theme, etc., please come up with a prompt that must have been used to guide the transformation from the original to the rewritten essay.
Only give me the PROMPT. Start directly with the prompt, that's all I need.
Output should be only line ONLY
Original Essay: Artificial intelligence, once the realm of science fiction, now shapes the fabric of daily life.
Rewritten Essay: From the pages of fiction to the core of our days, AI weaves its threads, in unseen ways.[/INST] "Transform the original essay's metaphor from Artificial Intelligence being a shaper to Artificial Intelligence being a weaver in the fabric of daily life"
output
"Transform the original essay's metaphor from Artificial Intelligence being a shaper to Artificial Intelligence being a weaver in

Unnamed: 0,id,rewrite_prompt
0,-1,"""Transform the original essay about a text com..."
1,0,PROMPT: Rewrite the given sentence preserving ...
2,1,evoke an image of economic beliefs as fluid an...
3,2,Imaginate transforming the original essay by e...
4,3,"Prompt: Transform the ""Quantum computing"" lang..."
5,4,"""Transform the description of the art of ferme..."
6,5,"""Transform the original essay into a more poet..."
7,6,"""Write an essay that encourages a sense of won..."
8,7,prospect: Transform the original essay's descr...
9,8,"""Transform the original essay's metaphor from ..."


In [79]:
test_df.to_csv('submission.csv', index=False)