# GE Notebook - LLM Prompt Recovery

Let's start by defining the scope of the work:

- Understand the Problem
- Understand the data
- Prepare the data
- Feature Engineering
- Modelling
- Evaluation
- Submission

In [1]:
import pandas as pd

In [2]:
%%capture
# Setup the environment
!pip install -q -U immutabledict sentencepiece 
!git clone https://github.com/google/gemma_pytorch.git
!mkdir /kaggle/working/gemma/
!mv /kaggle/working/gemma_pytorch/gemma/* /kaggle/working/gemma/

import sys 
sys.path.append("/kaggle/working/gemma_pytorch/") 
from gemma.config import GemmaConfig, get_config_for_7b, get_config_for_2b
from gemma.model import GemmaForCausalLM
from gemma.tokenizer import Tokenizer
import contextlib
import os
import torch

# Load the model
VARIANT = "7b-it-quant" 
MACHINE_TYPE = "cuda" 
weights_dir = '/kaggle/input/gemma/pytorch/7b-it-quant/2' 

@contextlib.contextmanager
def _set_default_tensor_type(dtype: torch.dtype):
  """Sets the default torch dtype to the given dtype."""
  torch.set_default_dtype(dtype)
  yield
  torch.set_default_dtype(torch.float)

# Model Config.
model_config = get_config_for_2b() if "2b" in VARIANT else get_config_for_7b()
model_config.tokenizer = os.path.join(weights_dir, "tokenizer.model")
model_config.quant = "quant" in VARIANT

# Model.
device = torch.device(MACHINE_TYPE)
with _set_default_tensor_type(model_config.get_dtype()):
  model = GemmaForCausalLM(model_config)
  ckpt_path = os.path.join(weights_dir, f'gemma-{VARIANT}.ckpt')
  model.load_weights(ckpt_path)
  model = model.to(device).eval()

## Understand the Problem

> LLMs are commonly used to rewrite or make stylistic changes to text. The goal of this competition is to recover the LLM prompt that was used to transform a given text.
> 
> NLP workflows increasingly involve rewriting text, but there's still a lot to learn about how to prompt LLMs effectively. This machine learning competition is designed to be a novel way to dig deeper into this problem.
> 
> The challenge: recover the LLM prompt used to rewrite a given text. You’ll be tested against a dataset of 1300+ original texts, each paired with a rewritten version from Gemma, Google’s new family of open models.

This problem includes a free-form start where contestants have to define and build their own training data. The example submissions give a hint as to what we should be looking for

## Understand the Data

This competition has examples of testing/training data, but doesn't include any form of large dataset to train with.

In [3]:
prompts = [
    'Rewrite this in the style of William Shakespeare',
    'Rewrite this in the style Dr. Seuss',
    'Rewrite this in the style of Beyonce',
    'Rewrite this in the style of Edgar Allen Poe',
    'Rewrite this as a poem',
    'Rewrite this with more emotion',
    'Rewrite this as a tweet from a teenager',
    'Rewrite this as a presidential announcement',
    'Rewrite this as if a dog wrote it',
    'Rewrite this as if a cat wrote it',
    'Rewrite this as a haiku',
    'Improve this text',
    'Simplify this text',
    'Add complexity to this text',
    'Rewrite this text',
    'Could you change this text?',
    'Can you change this text?',
    'Write this differently',
    'Remove all emotion from this text',
    'Rewrite this text to avoid plagairism'
]

df_movie_plot = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')

phrases = df_movie_plot[['Plot']].head(10)

In [4]:
# This is the prompt format the model expects
USER_CHAT_TEMPLATE = "<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"

number_of_rewrites = 5

rewrite_data = []

for phrase in phrases:
    for prompt in prompts:
        prompt = f'{prompt}\n{phrase}'
        for i in range(number_of_rewrites):
            rewritten_text = model.generate(
                USER_CHAT_TEMPLATE.format(prompt=prompt),
                device=device,
                output_len=100,
            )
            rewrite_data.append({
                'phrase': phrase,
                'prompt': prompt,
                'rewritten_text': rewritten_text,
            }) 

In [5]:
df_train = pd.DataFrame(rewrite_data)
df_train.head()

Unnamed: 0,phrase,prompt,rewritten_text
0,Plot,Rewrite this in the style of William Shakespea...,"Sure, here's the rewritten portion of text in ..."
1,Plot,Rewrite this in the style of William Shakespea...,"Sure, here's the rewritten text in the style o..."
2,Plot,Rewrite this in the style of William Shakespea...,"Sure, here's the rewritten text in the style o..."
3,Plot,Rewrite this in the style of William Shakespea...,"Sure, here's the rewritten text in the style o..."
4,Plot,Rewrite this in the style of William Shakespea...,"Sure, here's the rewritten text in the style o..."


In [6]:
output_csv_path = 'LLM_train_output.csv'
df_train.to_csv(output_csv_path, index=False)

## Prepare the Data

Take the generated data, and process it for use in modelling.

## Feature Engineering

Explore steps for processing data to improve modelling. Outlier rejection, data scaling, etc.

## Modelling

Create the model and such that will be used for the prediction

## Evaluation

Use the problem definition to evaluate the model against a private test dataset.