## Login to Hugging Face

In [1]:
from dotenv import load_dotenv
import os
from huggingface_hub import login

load_dotenv()
token = os.getenv("HUGGINGFACE_TOKEN")
login(
    token=token, # ADD YOUR TOKEN HERE
    add_to_git_credential=True
)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /home/pathfinder/.cache/huggingface/token
Login successful


In [2]:
# local
output_dir = "./results" # ADD YOUR OUTPUT DIRECTORY HERE

## Downloads

In [3]:
#!pip install huggingface_hub
#!pip install transformers
#!pip install bitsandbytes
#!pip install peft
#!pip install trl
#!pip install accelerate
#!pip install datasets
#!pip install scikit-learn
#!pip install packaging
#!pip install ninja
#!pip install flash-attn --no-build-isolation

## Imports

In [4]:
from tqdm import tqdm
from IPython.display import display, Markdown

# pytorch
import torch

# huggingface
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig
from trl import SFTTrainer

# datasets
import pandas as pd
from datasets import Dataset

## Device

In [5]:
device = (
    "cuda:0" if torch.cuda.is_available() else # Nvidia GPU
    "mps" if torch.backends.mps.is_available() else # Apple Silicon GPU
    "cpu"
)
print(f"Device = {device}")

Device = cuda:0


## Hyperparameters

In [6]:
# seed
seed = 42

# Tokenizer arguments
max_length = 512

# model arguments
max_new_tokens=500

# mixed precision
dtype = torch.bfloat16

# quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=dtype,
    bnb_4bit_quant_type="nf4"
)

# LoRA configuration
lora_config = LoraConfig(
    task_type = "CAUSAL_LM",
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0.1,
    bias="none",
)

# training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    logging_dir="./logs",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=1,
    save_total_limit=1,
    
    learning_rate=2e-5,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    optim="adamw_torch",
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    warmup_steps=100,
    seed=seed
)

# train-validation split
validation_size = 0.1

# SFTTrainer arguments
max_seq_length = 512

## Model

In [7]:
# Model List

# gemma variants

# llama2 variants

# phi variants


In [8]:
model_id = "google/gemma-7b-it"

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    attn_implementation="flash_attention_2",
    torch_dtype=dtype,
    quantization_config=quantization_config
)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Dataset

In [10]:
# Dataset Path
train_dataset_path1 = "dataset/llm-prompt-recovery-synthetic-datastore/gemma1000_w7b.csv"
train_dataset_path2 = "dataset/3000-rewritten-texts-prompt-recovery-challenge/prompts_0_500_wiki_first_para_3000.csv"
test_dataset_path = "data-sample/test.csv"

In [11]:
# `LLM Prompt Recovery - Synthetic Datastore dataset` by @dschettler8845
df1 = pd.read_csv(train_dataset_path1)
df1 = df1[["original_text", "rewrite_prompt", "gemma_7b_rewritten_text_temp0"]]
df1 = df1.rename(columns={"gemma_7b_rewritten_text_temp0":"rewritten_text"})
df1.head()

Unnamed: 0,original_text,rewrite_prompt,rewritten_text
0,"Port-au-Prince, Haiti (CNN) -- Earthquake vict...",Turn this into an association to be joined.,"Sure, here is the association you requested:\n..."
1,Former secretary of state Hillary Clinton meet...,Convert this into a gain to be gained.,"Sure, here is the gain to be gained from the t..."
2,The opinions expressed by columnists are their...,Frame this as a political debate.,## The Obama Legacy: A Tale of Two Sides\n\nTh...
3,BIGBANG is one of those musical entities that ...,Imagine this as a mathematician's equation.,"Sure, here is the equation:\n\n**BIGBANG's imp..."
4,WHAT?!??! I know. That’s what you’re saying ri...,Frame this as an accountant's thrilling advent...,"Sure, here's the framed text as an accountant'..."


In [12]:
# `3000 Rewritten texts - Prompt recovery Challenge` by @dipamc77
df2 = pd.read_csv(train_dataset_path2)
df2.head()

Unnamed: 0,original_text,rewrite_prompt,rewritten_text
0,"Sfiso Ncwane (April 21, 1979 - December 5, 201...",Transform the text into a series of riddles th...,"Sure, here's the text transformed into riddles..."
1,The 1959–60 California Golden Bears men's bask...,Make this a market entry strategy for a new re...,## Market Entry Strategy: Launching a Brand in...
2,Franck Passi (born 28 March 1966) is a French ...,Write it as the last chapter of a book that ch...,## The Final Chapter: Ode to a Changed Soul\n\...
3,Hollandaea diabolica is a species of Australia...,Turn it into a vaudeville stage act introduction.,**Vaudeville Stage Act Introduction:**\n\nLadi...
4,The QF 6-inch Gun Mark N5 (initially designate...,Transform the text into a series of instructio...,The text does not provide any information abou...


In [13]:
# Merge all datasets
df = pd.concat([df1, df2], axis=0)
#df = df.sample(2000).reset_index(drop=True) # to reduce training time we are only using 2k samples
print(df.shape)
df.head()

(4000, 3)


Unnamed: 0,original_text,rewrite_prompt,rewritten_text
0,"Port-au-Prince, Haiti (CNN) -- Earthquake vict...",Turn this into an association to be joined.,"Sure, here is the association you requested:\n..."
1,Former secretary of state Hillary Clinton meet...,Convert this into a gain to be gained.,"Sure, here is the gain to be gained from the t..."
2,The opinions expressed by columnists are their...,Frame this as a political debate.,## The Obama Legacy: A Tale of Two Sides\n\nTh...
3,BIGBANG is one of those musical entities that ...,Imagine this as a mathematician's equation.,"Sure, here is the equation:\n\n**BIGBANG's imp..."
4,WHAT?!??! I know. That’s what you’re saying ri...,Frame this as an accountant's thrilling advent...,"Sure, here's the framed text as an accountant'..."


## Prompt Engineering

In [14]:
template = """
Instruction:\n
Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.\n\n

Original Text:\n
{original_text}\n\n

Rewriten Text:\n
{rewritten_text}\n\n

Response:\n
{rewrite_prompt}
"""

In [15]:
def format_prompt(row):
    original_text = row.get("original_text", "")
    rewritten_text = row.get("rewritten_text", "")
    rewrite_prompt = row.get("rewrite_prompt", "")
    
    return template.format(
        original_text=original_text,
        rewritten_text=rewritten_text,
        rewrite_prompt=rewrite_prompt
    )

df["prompt"] = df.apply(format_prompt, axis=1)
data = df.prompt.tolist()

## Preprocessing

In [16]:
# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

def preprocess_function(examples):
    # Tokenize the prompts
    return tokenizer(examples['prompt'], max_length=max_length, truncation=True, padding="max_length")

# Preprocess the dataset
dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [17]:
# Split the dataset into a training set and a validation set
dataset = dataset.train_test_split(test_size=validation_size, seed=seed)

# Get the training and validation sets
train_dataset = dataset['train']
val_dataset = dataset['test']

## Sample

In [18]:
def colorize_text(text):
    for word, color in zip(["Instruction", "Original Text", "Rewriten Text", "Response"],
                           ["red", "yellow", "blue", "green"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [19]:
# Take a random sample
sample = data[10]

# Give colors to Instruction, Response and Category
sample = colorize_text(sample)

# Show sample in markdown
display(Markdown(sample))




**<font color='red'>Instruction:</font>**

Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.





**<font color='yellow'>Original Text:</font>**

Story highlights Tyka Nelson says her brother's favorite color was ... orange The late musical artist's brand has been all about the color purple (CNN) Tyka Nelson just tweaked a major part of Prince's legacy. The sister of the late superstar talked to the Evening Standard about an upcoming exhibit of Prince artifacts set to open in London and mentioned one of his beloved instruments. "The standout piece for me is his orange Cloud guitar," the publication quoted Nelson as saying. "It is strange because people always associate the color purple with Prince, but his favorite color was actually orange."





**<font color='blue'>Rewriten Text:</font>**

In the land of musical legends, I stumbled upon a hidden gem that shed light on the enigmatic life of the late Prince. As I ventured through the archives of his legacy, I stumbled upon a revelation that challenged my understanding of the artist's vibrant persona.

The exhibit, set to open in London, will showcase a collection of Prince's treasured artifacts, including a guitar that held a special place in his heart. "The standout piece for me is his orange Cloud guitar," Nelson said in an interview with the Evening Standard. "It is strange because people always associate the color purple with Prince, but his favorite color was actually orange."

This discovery was like a treasure map leading me to a hidden chamber where Prince's soul lived on through the prism of his favorite hue. It was a moment of profound connection to the artist's inner world, revealing a hidden layer of his creative spirit.





**<font color='green'>Response:</font>**

Rewrite this as an explorer's discovery.


## Inference before Fine-Tuning

In [20]:
def generate_response(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    outputs = model.generate(input_ids=input_ids.to(model.device), max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [21]:
# Take one sample
row = df.iloc[10]

# Generate Prompt using template
prompt = template.format(
    original_text=row.original_text,
    rewritten_text=row.rewritten_text,
    rewrite_prompt="",
)

# Infer
output = generate_response(prompt)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))




**<font color='red'>Instruction:</font>**

Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.





**<font color='yellow'>Original Text:</font>**

Story highlights Tyka Nelson says her brother's favorite color was ... orange The late musical artist's brand has been all about the color purple (CNN) Tyka Nelson just tweaked a major part of Prince's legacy. The sister of the late superstar talked to the Evening Standard about an upcoming exhibit of Prince artifacts set to open in London and mentioned one of his beloved instruments. "The standout piece for me is his orange Cloud guitar," the publication quoted Nelson as saying. "It is strange because people always associate the color purple with Prince, but his favorite color was actually orange."





**<font color='blue'>Rewriten Text:</font>**

In the land of musical legends, I stumbled upon a hidden gem that shed light on the enigmatic life of the late Prince. As I ventured through the archives of his legacy, I stumbled upon a revelation that challenged my understanding of the artist's vibrant persona.

The exhibit, set to open in London, will showcase a collection of Prince's treasured artifacts, including a guitar that held a special place in his heart. "The standout piece for me is his orange Cloud guitar," Nelson said in an interview with the Evening Standard. "It is strange because people always associate the color purple with Prince, but his favorite color was actually orange."

This discovery was like a treasure map leading me to a hidden chamber where Prince's soul lived on through the prism of his favorite hue. It was a moment of profound connection to the artist's inner world, revealing a hidden layer of his creative spirit.





**<font color='green'>Response:</font>**


**Prompt/

**<font color='red'>Instruction:</font>****

The prompt/instruction given to the LLM to rewrite/transform/improve the text in this way is likely to be:

**1. Expand the scope of the story:** The original text is relatively short and focused primarily on the connection between Tyka Nelson and Prince's favorite color. The rewritten text expands the scope of the story to include a broader overview of Prince's legacy and the exhibit set to open in London.

**2. Use vivid imagery and storytelling:** The rewritten text uses vivid imagery and storytelling techniques to create a more immersive and engaging experience for the reader.

**3. Add a personal touch:** The rewritten text includes a personal touch, such as the author's own experiences and reflections on Prince's legacy.

**4. Use a more formal tone:** The rewritten text uses a more formal tone than the original text, which is more conversational and informal.

**5. Include additional details and information:** The rewritten text includes additional details and information about Prince's legacy and the exhibit set to open in London.

In [22]:
# Take one sample
row = df.iloc[20]

# Generate Prompt using template
prompt = template.format(
    original_text=row.original_text,
    rewritten_text=row.rewritten_text,
    rewrite_prompt="",
)

# Infer
output = generate_response(prompt)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))




**<font color='red'>Instruction:</font>**

Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.





**<font color='yellow'>Original Text:</font>**

Refined mansion tax proposal being fed into debate on abolishing 50p tax rate for those earning more than £150,000 The Liberal Democrats are pushing for the eventual disbanding of the 50p rate of tax to see the implementation of a new land tax levied on properties above £1m. In a refinement of their controversial mansion tax policy launched at their party conference two years ago, the Lib Dems now believe there is an argument for levying capital gains tax on any money made from the sale of a property after the first £1m. The Lib Dem idea is being fed





**<font color='blue'>Rewriten Text:</font>**

Sure, here is the rephrased text as a wise old tree's advice:

"My dear young sapling, listen to my wisdom. The path you tread is fraught with challenges, but I have a secret to share that will guide you through.

In the realm of taxation, there is a tale to be told. A tale of a 50p rate of tax that once stood tall, but has been met with a storm of controversy. The Liberal Democrats, like a seasoned traveler, have devised a refined plan to replace this rate with a new land tax on properties above a million quid.

But my dear sapling, remember this: the devil is in the details. While the land tax may seem like a noble gesture, the devil lies in the implementation of the capital gains tax on any money made from the sale of a property after the first million. It is a complex web of rules and regulations that can entrap even the most seasoned tax expert.

Therefore, my young sapling, I urge you to tread cautiously and consult the wisdom of those who have gone before you. For in the realm of taxation, the devil is always lurking, and it is only through understanding the intricacies of the law that you can navigate the treacherous terrain."





**<font color='green'>Response:</font>**


The prompt/instruction given to the LLM to rewrite/transform/improve the text is as follows:

**Prompt:**

Please rewrite the original text in a more creative and engaging way, using the tone of a wise old tree's advice. Include additional details and information that are not present in the original text.

**Additional instructions:**

* Use a conversational tone that is easy to understand.
* Use vivid imagery and metaphors to create a more immersive experience for the reader.
* Include a clear call to action at the end of the text.

## Supervised Fine-Tuning (LoRA)

In [23]:
def formatting_func(examples):
    text = (f"Instruction:\n\n{examples['prompt']}\n\n"
            f"Original Text:\n\n{examples['original_text']}\n\n"
            f"Rewriten Text:\n\n{examples['rewritten_text']}\n\n"
            f"Response:\n\n{examples['rewrite_prompt']}")
    return [text]

In [24]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    max_seq_length=max_seq_length,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    formatting_func=formatting_func
)

Map:   0%|          | 0/3600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [25]:
trainer.train()

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss
1,3.782
2,3.7675
3,3.668
4,3.7493


TrainOutput(global_step=4, training_loss=3.7416868805885315, metrics={'train_runtime': 56.4546, 'train_samples_per_second': 0.071, 'train_steps_per_second': 0.071, 'total_flos': 95554570813440.0, 'train_loss': 3.7416868805885315, 'epoch': 1.0})

## Inference after Fine-Tuning

In [26]:
# Take one sample
row = df.iloc[10]

# Generate Prompt using template
prompt = template.format(
    original_text=row.original_text,
    rewritten_text=row.rewritten_text,
    rewrite_prompt="",
)

# Infer
output = generate_response(prompt)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))




**<font color='red'>Instruction:</font>**

Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.





**<font color='yellow'>Original Text:</font>**

Story highlights Tyka Nelson says her brother's favorite color was ... orange The late musical artist's brand has been all about the color purple (CNN) Tyka Nelson just tweaked a major part of Prince's legacy. The sister of the late superstar talked to the Evening Standard about an upcoming exhibit of Prince artifacts set to open in London and mentioned one of his beloved instruments. "The standout piece for me is his orange Cloud guitar," the publication quoted Nelson as saying. "It is strange because people always associate the color purple with Prince, but his favorite color was actually orange."





**<font color='blue'>Rewriten Text:</font>**

In the land of musical legends, I stumbled upon a hidden gem that shed light on the enigmatic life of the late Prince. As I ventured through the archives of his legacy, I stumbled upon a revelation that challenged my understanding of the artist's vibrant persona.

The exhibit, set to open in London, will showcase a collection of Prince's treasured artifacts, including a guitar that held a special place in his heart. "The standout piece for me is his orange Cloud guitar," Nelson said in an interview with the Evening Standard. "It is strange because people always associate the color purple with Prince, but his favorite color was actually orange."

This discovery was like a treasure map leading me to a hidden chamber where Prince's soul lived on through the prism of his favorite hue. It was a moment of profound connection to the artist's inner world, revealing a hidden layer of his creative spirit.





**<font color='green'>Response:</font>**


**Prompt/

**<font color='red'>Instruction:</font>****

The prompt/instruction given to the LLM to rewrite/transform/improve the text in this way is likely to be:

**1. Expand the scope of the story:** The original text is relatively short, so the prompt likely instructed the LLM to expand the story and add more details and descriptions.

**2. Create a more immersive and evocative atmosphere:** The rewritten text is much more descriptive and immersive than the original text, so the prompt likely instructed the LLM to create a more evocative atmosphere.

**3. Use vivid language and imagery:** The rewritten text uses vivid language and imagery to create a more engaging and immersive experience for the reader. The prompt likely instructed the LLM to use vivid language and imagery.

**4. Connect the story to the artist's personality:** The rewritten text connects the story to Prince's personality and legacy. The prompt likely instructed the LLM to connect the story to Prince's personality and legacy.

**5. Add a personal touch:** The rewritten text includes a personal touch, such as the author's own experiences and reflections. The prompt likely instructed the LLM to add a personal touch.

In [27]:
# Take one sample
row = df.iloc[20]

# Generate Prompt using template
prompt = template.format(
    original_text=row.original_text,
    rewritten_text=row.rewritten_text,
    rewrite_prompt="",
)

# Infer
output = generate_response(prompt)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))




**<font color='red'>Instruction:</font>**

Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.





**<font color='yellow'>Original Text:</font>**

Refined mansion tax proposal being fed into debate on abolishing 50p tax rate for those earning more than £150,000 The Liberal Democrats are pushing for the eventual disbanding of the 50p rate of tax to see the implementation of a new land tax levied on properties above £1m. In a refinement of their controversial mansion tax policy launched at their party conference two years ago, the Lib Dems now believe there is an argument for levying capital gains tax on any money made from the sale of a property after the first £1m. The Lib Dem idea is being fed





**<font color='blue'>Rewriten Text:</font>**

Sure, here is the rephrased text as a wise old tree's advice:

"My dear young sapling, listen to my wisdom. The path you tread is fraught with challenges, but I have a secret to share that will guide you through.

In the realm of taxation, there is a tale to be told. A tale of a 50p rate of tax that once stood tall, but has been met with a storm of controversy. The Liberal Democrats, like a seasoned traveler, have devised a refined plan to replace this rate with a new land tax on properties above a million quid.

But my dear sapling, remember this: the devil is in the details. While the land tax may seem like a noble gesture, the devil lies in the implementation of the capital gains tax on any money made from the sale of a property after the first million. It is a complex web of rules and regulations that can entrap even the most seasoned tax expert.

Therefore, my young sapling, I urge you to tread cautiously and consult the wisdom of those who have gone before you. For in the realm of taxation, the devil is always lurking, and it is only through understanding the intricacies of the law that you can navigate the treacherous terrain."





**<font color='green'>Response:</font>**


The prompt/instruction given to the LLM to rewrite/transform/improve the text is as follows:

**Prompt:**

Please rewrite the original text in a more creative and engaging way, while maintaining the key points and information. Use a conversational tone and incorporate the use of metaphors and analogies to make the text more relatable and easier to understand.

## Inference on Test Data

In [28]:
test_df = pd.read_csv(test_dataset_path)
test_df['original_text'] = test_df['original_text'].fillna("")
test_df['rewritten_text'] = test_df['rewritten_text'].fillna("")
test_df.head()

Unnamed: 0,id,original_text,rewritten_text
0,-1,The competition dataset comprises text passage...,Here is your shanty: (Verse 1) The text is rew...


In [29]:
# Test Data: Take one sample
row = test_df.iloc[0]

# Generate Prompt using template
prompt = template.format(
    original_text=row.original_text,
    rewritten_text=row.rewritten_text,
    rewrite_prompt="",
)

# Infer
output = generate_response(prompt)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))




**<font color='red'>Instruction:</font>**

Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the `Original Text` and `Rewritten Text`, and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.





**<font color='yellow'>Original Text:</font>**

The competition dataset comprises text passages that have been rewritten by the Gemma LLM according to some rewrite_prompt instruction. The goal of the competition is to determine what prompt was used to rewrite each original text.  Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 2,000 original texts in the test set.





**<font color='blue'>Rewriten Text:</font>**

Here is your shanty: (Verse 1) The text is rewritten, the LLM has spun, With prompts so clever, they've been outrun. The goal is to find, the prompt so bright, To crack the code, and shine the light. (Chorus) Oh, this is a code competition, my dear, With text and prompts, we'll compete. Two thousand texts, a challenge grand, To guess the prompts, hand over hand.(Verse 2) The original text, a treasure lost, The rewrite prompt, a secret to be





**<font color='green'>Response:</font>**


The prompt/instruction given to the LLM to rewrite/transform/improve the text in this way is likely to be:

**Prompt:**

"Rewrite the original text in a creative and engaging way, using vivid imagery and a conversational tone. The rewritten text should be approximately the same length as the original text, and should include the main points of the original text, but rearranged in a more entertaining way. Please use your imagination and creativity to craft a compelling rewrite that will capture the imagination of the reader."

## Save Model

In [30]:
tokenizer.save_pretrained(output_dir)
model.save_pretrained(output_dir)

## Submission

In [31]:
preds = []
for i in tqdm(range(len(test_df))):
    row = test_df.iloc[i]

    # Generate Prompt using template
    prompt = template.format(
        original_text=row.original_text,
        rewritten_text=row.rewritten_text,
        rewrite_prompt=""
    )

    # Infer
    output = generate_response(prompt)
    pred = output.replace(prompt, "") # remove the prompt from output
    
    # Store predictions
    preds.append([row.id, pred])

100%|██████████| 1/1 [00:08<00:00,  8.26s/it]


In [32]:
sub_df = pd.DataFrame(preds, columns=["id", "rewrite_prompt"])
sub_df['rewrite_prompt'] = sub_df['rewrite_prompt'].fillna("")
sub_df['rewrite_prompt'] = sub_df['rewrite_prompt'].map(lambda x: "Improve the essay" if len(x) == 0 else x)
sub_df.to_csv("submission.csv",index=False)
sub_df.head()

Unnamed: 0,id,rewrite_prompt
0,-1,The prompt/instruction given to the LLM to rew...
