Competition information: https://www.kaggle.com/competitions/llm-prompt-recovery/overview

The goal of this competition is to recover the LLM prompt that was used to transform a given text. We are given original sample text and a rewritten text and we need to determine the prompt that was used to cobvert the original text to rewritten text. 

In [1]:
# imports 

import pandas as pd
import numpy as np
import random

In [2]:
# loading the data 

movie_plots = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')

In [3]:
movie_plots.head() # view data

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [4]:
movie_plots['Plot'][1] # view data

"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."

In [5]:
movie_plots['Plot'].str.len().max() # get the maximum length of string

36773

In [6]:
# Truncate texts to 8192 words

df = movie_plots['Plot'].apply(lambda x: x[:8192]) 

In [7]:
# sample 10 texts from data

df = df.sample(10)

In [8]:
# Setup the environment
!pip install -q -U immutabledict sentencepiece 
!git clone https://github.com/google/gemma_pytorch.git
!mkdir /kaggle/working/gemma/
!mv /kaggle/working/gemma_pytorch/gemma/* /kaggle/working/gemma/

import sys 
sys.path.append("/kaggle/working/gemma_pytorch/") 
from gemma.config import GemmaConfig, get_config_for_7b, get_config_for_2b
from gemma.model import GemmaForCausalLM
from gemma.tokenizer import Tokenizer
import contextlib
import os
import torch

# Load the model
VARIANT = "2b-it" 
MACHINE_TYPE = "cuda" 
weights_dir = '/kaggle/input/gemma/pytorch/2b-it/2' 

@contextlib.contextmanager
def _set_default_tensor_type(dtype: torch.dtype):
  """Sets the default torch dtype to the given dtype."""
  torch.set_default_dtype(dtype)
  yield
  torch.set_default_dtype(torch.float)

model_config = get_config_for_2b() if "2b" in VARIANT else get_config_for_7b()
model_config.tokenizer = os.path.join(weights_dir, "tokenizer.model")

device = torch.device(MACHINE_TYPE)
with _set_default_tensor_type(model_config.get_dtype()):
  model = GemmaForCausalLM(model_config)
  ckpt_path = os.path.join(weights_dir, f'gemma-{VARIANT}.ckpt')
  model.load_weights(ckpt_path)
  model = model.to(device).eval()

Cloning into 'gemma_pytorch'...
remote: Enumerating objects: 148, done.[K
remote: Counting objects: 100% (80/80), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 148 (delta 46), reused 38 (delta 23), pack-reused 68[K
Receiving objects: 100% (148/148), 2.16 MiB | 12.76 MiB/s, done.
Resolving deltas: 100% (73/73), done.


  return self.fget.__get__(instance, owner)()


In [9]:
# General rewrite prompts to be used on the model

rewrite_prompts = [
    'Describe it to me like you would explain it to a five-year old',
    'Summarize this in your own words',
    'Write me an opinon on this topic',
    'Write a poem about this topic',
    'Make this rhyme'
]

In [10]:
# model chat template for prompting
USER_CHAT_TEMPLATE = "<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
# rewritten data after prompting
rewrite_data = []

# iterate through text
for texts in df:
    # randomly choose rewrite prompts
    prompt_chosen = random.choice(rewrite_prompts)
    # structure prompt according to chat template
    prompt = f'{prompt_chosen}\n{texts}'
    # generate texts using gemma and given prompt
    chat_output = model.generate(
        USER_CHAT_TEMPLATE.format(prompt = prompt),
        device = device,
        output_len = 128        
    )
    # append data to rewrite_data
    rewrite_data.append({
        'original_text': texts,
        'rewrite_prompt': prompt_chosen,
        'rewritten_text': chat_output,
    })

In [11]:
# create dataframe
rewrite_data_df = pd.DataFrame(rewrite_data)
rewrite_data_df.iloc[1, :].values

array(["Aditya (Ankush Hazra)portrays a young millionaire with a heart of gold. He successfully runs a business of resorts. His life is anchored by his best friend and business associate, Prachi (Sayantika Banerjee). Life is all about fun and work and girls for him until one day when his car bumps into Esha's cycle. Aditya is instantly enamored by the pretty and innocent girl who seems to have no interest in the flirty, rich bachelor. He gives in to his ego by giving Esha (Nusrat Jahan) a job in his resort even though she can't work in the second shift. As times goes by, Aditya starts sharing his deepest thoughts with Esha. She sees the simple, lovable man behind the successful persona who loves the food made by his aunt. Unwittingly, she loses her heart to this flirty man who confesses to be commitment phobic. These three characters become the best of buddies. Prachi sees the growing chemistry between these two and she keeps her love for Aditya hidden inside her. Turbulence occurs whe

In [12]:
rewrite_data_df.to_csv('prompts.csv',index=False)

## Generating Submission data

In [13]:
# load in train and test data

train = pd.read_csv('/kaggle/input/llm-prompt-recovery/train.csv')
test = pd.read_csv('/kaggle/input/llm-prompt-recovery/test.csv')

In [14]:
# define rewrite prompt

test['rewrite_prompt'] = "Generate rewrite_prompt for the following text while retaining the original meaning but altering the tone. Aim for a different structure and wording. Here is an example sample: original text- " + train.loc[0, 'original_text'] +" rewritten_text- " + train.loc[0, 'rewritten_text'] + " and this is the right rewrite_prompt- " + train.loc[0, 'rewrite_prompt'] + " Now, You will output in text the most suitable rewrite_prompt."

In [15]:
print(test['rewrite_prompt'][0])

Generate rewrite_prompt for the following text while retaining the original meaning but altering the tone. Aim for a different structure and wording. Here is an example sample: original text- The competition dataset comprises text passages that have been rewritten by the Gemma LLM according to some rewrite_prompt instruction. The goal of the competition is to determine what prompt was used to rewrite each original text.  Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 2,000 original texts in the test set. rewritten_text- Here is your shanty: (Verse 1) The text is rewritten, the LLM has spun, With prompts so clever, they've been outrun. The goal is to find, the prompt so bright, To crack the code, and shine the light. (Chorus) Oh, this is a code competition, my dear, With text and prompts, we'll compete. Two thousand texts, a challenge grand, To guess the prompts, hand over hand.(

In [16]:
test = test[['id', 'rewrite_prompt']]

In [17]:
# make submission file

test.to_csv('submission.csv', index=False)
sub = pd.read_csv("/kaggle/working/submission.csv")
sub

Unnamed: 0,id,rewrite_prompt
0,-1,Generate rewrite_prompt for the following text...
