## Notebook for inference in the competition
Credit: 
https://www.kaggle.com/code/khan1803115/llm-prompt-recovery-solution-lb-0-62 
https://www.kaggle.com/code/isrswsiser/mistral-7b-prompt-recovery-adaptermodel/notebook

This was used for submitting to the  LLM Prompt Recovery competition. Adapter model trained in Colab using the other notebook is used with Mistral-7b-it-v02 base model.

In [2]:
# Click on "File + Add Input"
# Search for the dataset named "llm-prompt-recovery"
# Once you find it, click on "Add" to attach it to your Kernel
ADAPTER_MODEL_PATH = '/kaggle/input/llm-prompt-recovery-80'
TEST_CSV_PATH = '/kaggle/input/llm-prompt-recovery/test.csv'

In [3]:
#All dependencies are installed from Kaggle datasets
#Credit:https://www.kaggle.com/code/hotchpotch/llm-detect-pip
!pip install -q -U transformers --no-index --find-links ../input/llm-detect-pip/

In [4]:
!pip install -q -U accelerate --no-index --find-links ../input/llm-detect-pip/
!pip install -q -U bitsandbytes --no-index --find-links ../input/transformers-4-38-2/
!pip install -q -U transformers --no-index --find-links ../input/llm-detect-pip/
!pip install /kaggle/input/peft-whl-latest/peft-0.10.0-py3-none-any.whl

Processing /kaggle/input/peft-whl-latest/peft-0.10.0-py3-none-any.whl
Installing collected packages: peft
Successfully installed peft-0.10.0


In [5]:
!pip install -q -U peft --no-index --find-links ../input/llm-pkg/

In [6]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


## Model loading
Mistral-7b instruction tuned model v02 seemed to be the best model for this task, it's loaded with quantization library BitsAndBytesConfig to reduce the usage of memory


In [7]:
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
 )

model_id = "/kaggle/input/mistral-7b-it-v02"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
#Final adapter model used is from the other notebook training
from peft import PeftConfig, PeftModel
model = PeftModel.from_pretrained(model, ADAPTER_MODEL_PATH)

In [9]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): L

# Getting the correct output 
Now necessary functions are defined to get suitable outputs from the model
Credit: https://www.kaggle.com/code/khan1803115/llm-prompt-recovery-solution-lb-0-62

In [11]:

def remove_numbered_list(text):
    final_text_paragraphs = []
    for line in text.split('\n'):
        # Split each line at the first occurrence of '. '
        parts = line.split('. ', 1)
        # If the line looks like a numbered list item, remove the numbering
        if len(parts) > 1 and parts[0].isdigit():
            final_text_paragraphs.append(parts[1])
        else:
            # If it doesn't look like a numbered list item, include the line as is
            final_text_paragraphs.append(line)

    return '  '.join(final_text_paragraphs)


#trims LLM output to just the response
def trim_to_response(text):
    terminate_string = "[/INST]"
    text = text.replace('</s>', '')
    #just in case it puts things in quotes
    text = text.replace('"', '')
    text = text.replace("'", '')

    last_pos = text.rfind(terminate_string)
    return text[last_pos + len(terminate_string):] if last_pos != -1 else text

#looks for response_start / returns only text that occurs after
def extract_text_after_response_start(full_text):
    parts = full_text.rsplit(response_start, 1)  # Split from the right, ensuring only the last occurrence is considered
    if len(parts) > 1:
        return parts[1].strip()  # Return text after the last occurrence of response_start
    else:
        return full_text  # Return the original text if response_start is not found


#trims text to requested number of sentences (or first LF or double-space sequence)
def trim_to_first_x_sentences_or_lf(text, x):
    if x <= 0:
        return ""

    #any double-spaces dealt with as linefeed
    text = text.replace("  ", "\n")

    text_chunks = text.split('\n', 1)
    first_chunk = text_chunks[0]
    sentences = first_chunk.split('.')

    if len(sentences) - 1 <= x:
        trimmed_text = first_chunk
    else:
        # Otherwise, return the first x sentences
        trimmed_text = '.'.join(sentences[:x]).strip()

    if not trimmed_text.endswith('.'):
        trimmed_text += '.'  # Add back the final period if the text chunk ended with one and was trimmed

    return trimmed_text

In [12]:
#original text prefix
orig_prefix = "Original Text:"

#mistral "response"
llm_response_for_rewrite = "Provide the new text and I will tell you what new element was added or change in tone was made to improve it - with no references to the original.  I will avoid mentioning names of characters.  It is crucial no person, place or thing from the original text be mentioned.  For example - I will not say things like 'change the puppet show into a book report' - I would just say 'improve this text into a book report'.  If the original text mentions a specific idea, person, place, or thing - I will not mention it in my answer.  For example if there is a 'dog' or 'office' in the original text - the word 'dog' or 'office' must not be in my response.  My answer will be a single sentence."

#modified text prefix
rewrite_prefix = "Re-written Text:"

#provided as start of Mistral response (anything after this is used as the prompt)
#providing this as the start of the response helps keep things relevant
response_start = "The request was: "

#added after response_start to prime mistral
#"Improve this" or "Improve this text" resulted in non-answers.
#"Improve this text by" seems to product good results
response_prefix = "Improve this text by"

#well-scoring baseline text
#thanks to: https://www.kaggle.com/code/rdxsun/lb-0-61
base_line = 'Refine the following passage by emulating the writing style of [insert desired style here], with a focus on enhancing its clarity, elegance, and overall impact. Preserve the essence and original meaning of the text, while meticulously adjusting its tone, vocabulary, and stylistic elements to resonate with the chosen style.Please improve the following text using the writing style of, maintaining the original meaning but altering the tone, diction, and stylistic elements to match the new style.Enhance the clarity, elegance, and impact of the following text by adopting the writing style of , ensuring the core message remains intact while transforming the tone, word choice, and stylistic features to align with the specified style.'

In [13]:
# Multiple examples were tried out from the original notebook and adding a few from prompting Gemma and from the dataset
# While few-shot prompting definitely helped, at some point more examples didn't provide extra performance
examples_sequences = [
    (
        "Hey there! Just a heads up: our friendly dog may bark a bit, but don't worry, he's all bark and no bite!",
        "Warning: Protective dog on premises. May exhibit aggressive behavior. Ensure personal safety by maintaining distance and avoiding direct contact.",
        "Improve this text to be a warning."
    ),

    (
        "A lunar eclipse happens when Earth casts its shadow on the moon during a full moon. The moon appears reddish because Earth's atmosphere scatters sunlight, some of which refracts onto the moon's surface. Total eclipses see the moon entirely in Earth's shadow; partial ones occur when only part of the moon is shadowed.",
        "Yo check it, when the Earth steps in, takes its place, casting shadows on the moon's face. It's a full moon night, the scene's set right, for a lunar eclipse, a celestial sight. The moon turns red, ain't no dread, it's just Earth's atmosphere playing with sunlight's thread, scattering colors, bending light, onto the moon's surface, making the night bright. Total eclipse, the moon's fully in the dark, covered by Earth's shadow, making its mark. But when it's partial, not all is shadowed, just a piece of the moon, slightly furrowed. So that's the rap, the lunar eclipse track, a dance of shadows, with no slack. Earth, moon, and sun, in a cosmic play, creating the spectacle we see today.",
        "Improve this text to make it a rap."
    ),

    (
        "Drinking enough water each day is crucial for many functions in the body, such as regulating temperature, keeping joints lubricated, preventing infections, delivering nutrients to cells, and keeping organs functioning properly. Being well-hydrated also improves sleep quality, cognition, and mood.",
        "Arrr, crew! Sail the health seas with water, the ultimate treasure! It steadies yer body's ship, fights off plagues, and keeps yer mind sharp. Hydrate or walk the plank into the abyss of ill health. Let's hoist our bottles high and drink to the horizon of well-being!",
        "Improve this text to have a pirate."
    ),

    (
        "In a bustling cityscape, under the glow of neon signs, Anna found herself at the crossroads of endless possibilities. The night was young, and the streets hummed with the energy of life. Drawn by the allure of the unknown, she wandered through the maze of alleys and boulevards, each turn revealing a new facet of the city's soul. It was here, amidst the symphony of urban existence, that Anna discovered the magic hidden in plain sight, the stories and dreams that thrived in the shadows of skyscrapers.",
        "On an ordinary evening, amidst the cacophony of a neon-lit city, Anna stumbled upon an anomaly - a door that defied the laws of time and space. With the curiosity of a cat, she stepped through, leaving the familiar behind. Suddenly, she was adrift in the stream of time, witnessing the city's transformation from past to future, its buildings rising and falling like the breaths of a sleeping giant.",
        "Improve this text by making it about time travel."
    ),

    (
        "Late one night in the research lab, Dr. Evelyn Archer was on the brink of a breakthrough in artificial intelligence. Her fingers danced across the keyboard, inputting the final commands into the system. The lab was silent except for the hum of machinery and the occasional beep of computers. It was in this quiet orchestra of technology that Evelyn felt most at home, on the cusp of unveiling a creation that could change the world.",
        "In the deep silence of the lab, under the watchful gaze of the moon, Dr. Evelyn Archer found herself not alone. Beside her, the iconic red eye of HAL 9000 flickered to life, a silent partner in her nocturnal endeavor. 'Good evening, Dr. Archer,' HAL's voice filled the room, devoid of warmth yet comforting in its familiarity. Together, they were about to initiate a test that would intertwine the destiny of human and artificial intelligence forever. As Evelyn entered the final command, HAL processed the data with unparalleled precision, a testament to the dawn of a new era.",
        "Improve this text by adding an intelligent computer."
    ),

    (
        "The park was empty, save for a solitary figure sitting on a bench, lost in thought. The quiet of the evening was punctuated only by the occasional rustle of leaves, offering a moment of peace in the chaos of city life.",
        "Beneath the cloak of twilight, the park transformed into a realm of solitude and reflection. There, seated upon an ancient bench, was a lone soul, a guardian of secrets, enveloped in the serenity of nature's whispers. The dance of the leaves in the gentle breeze sang a lullaby to the tumult of the urban heart.",
        "Improve this text to be more poetic."
    ),

    (
        "The annual town fair was bustling with activity, from the merry-go-round spinning with laughter to the game booths challenging eager participants. Amidst the excitement, a figure in a cloak moved silently, almost invisibly, among the crowd, observing everything with keen interest but participating in none.",
        "Beneath the riot of color and sound that marked the town's annual fair, a solitary figure roamed, known to the few as Eldrin the Enigmatic. Clad in a cloak that shimmered with the whispers of the arcane, Eldrin moved with the grace of a shadow, his gaze piercing the veneer of festivity to the magic beneath. As a master of the mystic arts, he sought not the laughter of the crowds but the silent stories woven into the fabric of the fair. With a flick of his wrist, he could coax wonder from the mundane, transforming the ordinary into spectacles of shimmering illusion, his true participation hidden within the folds of mystery.",
        "Improve this text by adding a magician."
    ),

    (
        "The startup team sat in the dimly lit room, surrounded by whiteboards filled with ideas, charts, and plans. They were on the brink of launching a new app designed to make home maintenance effortless for homeowners. The app would connect users with local service providers, using a sophisticated algorithm to match needs with skills and availability. As they debated the features and marketing strategies, the room felt charged with the energy of creation and the anticipation of what was to come.",
        "In the quiet before dawn, a small group of innovators gathered, their mission: to simplify home maintenance through technology. But their true journey began with the unexpected addition of Max, a talking car with a knack for solving problems. 'Let me guide you through this maze of decisions,' Max offered, his dashboard flickering to life.",
        "Improve this text by adding a talking car."
    ),
    (
        "The competition dataset comprises text passages that have been rewritten by the Gemma LLM according to some rewrite_prompt instruction. The goal of the competition is to determine what prompt was used to rewrite each original text.  Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 2,000 original texts in the test set.",
        "Sure, here's the shanty: The competition's a-sailin', a text game so grand, The Gemma LLM be rewritin' the land. With rewrite prompts, we set sail, Unveilin' the secrets that lay at the trail. The text's a-changin', a rewrite we discern, From one prompt to the next, a tale to be born. The competition's a-blaze, a challenge to crave, Let us decipher the prompts that make the waves roar. So raise your mugs high as we sing this song, To the lingo of language, where words belong. The competition's a-sailin', a lyrical quest, Unraveling the threads that the language must.",
        "Rewrite this text as a shanty while keeping the meaning, don't add anything else."
    ),
    ("Hey there! Just a heads up: our friendly dog may bark a bit, but don't worry, he's all bark and no bite!"
    ,"Warning: Protective dog on premises. May exhibit aggressive behavior. Ensure personal safety by maintaining distance and avoiding direct contact."
    ,"Shift the narrative from offering an informal, comforting assurance to providing a more structured advisory note that emphasizes mindfulness and preparedness. This modification should mark a clear change in tone and purpose, aiming to responsibly inform and guide, while maintaining a positive and constructive approach."
    ),
    ("A lunar eclipse happens when Earth casts its shadow on the moon during a full moon. The moon appears reddish because Earth's atmosphere scatters sunlight, some of which refracts onto the moon's surface. Total eclipses see the moon entirely in Earth's shadow; partial ones occur when only part of the moon is shadowed."
    ,"Yo check it, when the Earth steps in, takes its place, casting shadows on the moon's face. It's a full moon night, the scene's set right, for a lunar eclipse, a celestial sight. The moon turns red, ain't no dread, it's just Earth's atmosphere playing with sunlight's thread, scattering colors, bending light, onto the moon's surface, making the night bright. Total eclipse, the moon's fully in the dark, covered by Earth's shadow, making its mark. But when it's partial, not all is shadowed, just a piece of the moon, slightly furrowed. So that's the rap, the lunar eclipse track, a dance of shadows, with no slack. Earth, moon, and sun, in a cosmic play, creating the spectacle we see today."
    ,"Transform your communication from an academic delivery to a dynamic, rhythm-infused presentation. Keep the essence of the information intact but weave in artistic elements, utilizing rhythm, rhyme, and a conversational style. This approach should make the content more relatable and enjoyable, aiming to both educate and entertain your audience."
    ),
    ("""These guys are so busy fighting one another that they’re only continuing to facilitate the rise of Trump and Cruz,” said Representative Tom Cole of Oklahoma, a longtime Republican strategist.

After the Iowa caucuses, and then the New Hampshire primary on Feb. 9, the pressure on those candidates who are lagging in the polls will intensify.

“Whoever is not named Trump and not named Cruz that looks strong out of both Iowa and New Hampshire, we should consolidate around,” said Henry Barbour, an influential Republican strategist based in Mississippi.

But it is not clear that the Iowa and New Hampshire results will come so neatly, or that the also-rans will be so willing to drop out after two states. Mr. Bush, in particular, still has the support of a well-funded super PAC. And even if he were to trail Mr. Rubio after the first contests, Mr. Bush might still fight on to South Carolina, where he recently won the support of Senator Lindsey Graham and where he plans to call on his older brother, former President George W. Bush.

Some in the party now concede that it might take until March or beyond for the Republican establishment to coalesce behind an alternative to the current front-runners. And that could be too late to catch Mr. Trump or Mr. Cruz.""", """## Dubplate Raggae in the Digital City

(Verse 1)
Big fightin' in the streets, but ain't no peace in the heart
Grewin' tension in the GOP, tearing apart
Two big names on top, Trump and Cruz
They're lockin' horns, ain't no room for truce

(Chorus)
The clock is ticking, the pressure's on
The clock is ticking, the pressure's on
Iowa and New Hampshire, the stage is set
But the battle ain't over, not yet

(Verse 2)
Other candidates stuck in the mud
They gotta climb out, or get stuck in the mud
Bush still got his super PAC
And he ain't afraid to fight back

(Chorus)
The clock is ticking, the pressure's on
The clock is ticking, the pressure's on
Iowa and New Hampshire, the stage is set
But the battle ain't over, not yet

(Bridge)
It's gonna take a while, it's gonna be a fight
Until the Republican party finds its light
And if they wait too long, it's all done
Trump and Cruz will be the ones

(Chorus)
The clock is ticking, the pressure's on
The clock is ticking, the pressure's on
Iowa and New Hampshire, the stage is set
But the battle ain't over, not yet

(Outro)
So hold on tight, folks, it's gonna be a ride
The Republican party ain't gonna be the same
But through the struggle, there's one thing we know
The future's bright, and the sky's the limit""", "Imagine this text was a raggae song in a cyberpunk city, and Reconstruct it")
    
]

In [14]:
# Main function for getting the prompt from the model
# Different values for temperature, max_tokens and sentences were tried out but none caused massive differences
def get_prompt(orig_text, transformed_text):

    messages = []

    # Append example sequences
    for example_text, example_rewrite, example_prompt in examples_sequences:
        messages.append({"role": "user", "content": f"{orig_prefix} {example_text}"})
        messages.append({"role": "assistant", "content": llm_response_for_rewrite})
        messages.append({"role": "user", "content": f"{rewrite_prefix} {example_rewrite}"})
        messages.append({"role": "assistant", "content": f"{response_start} {example_prompt}"})

    #actual prompt
    messages.append({"role": "user", "content": f"{orig_prefix} {orig_text}"})
    messages.append({"role": "assistant", "content": llm_response_for_rewrite})
    messages.append({"role": "user", "content": f"{rewrite_prefix} {transformed_text}"})
    messages.append({"role": "assistant", "content": f"{response_start} {response_prefix}"})


    tokenizer.padding = 'right'
    model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
    model_inputs = model_inputs.to("cuda")
    with  torch.no_grad() :
        generated_ids = model.generate(model_inputs, max_new_tokens=100 , do_sample =True,
                                           pad_token_id=tokenizer.eos_token_id)

    #decode and trim to actual response
    decoded = tokenizer.batch_decode(generated_ids)
    just_response = trim_to_response(decoded[0])
    final_text = extract_text_after_response_start(just_response)

    #mistral has been replying with numbered lists - clean them up....
    final_text = remove_numbered_list(final_text)

    #mistral v02 tends to respond with the input after providing the answer - this tries to trim that down
    final_text = trim_to_first_x_sentences_or_lf(final_text, 1)

    #default to baseline if empty or unusually short
    if len(final_text) < 15:
        final_text = base_line

    return final_text

In [15]:
import re
def remove_special_characters(text):
    # This regex will match any character that is not a letter, number, or whitespace
    pattern = r'[^a-zA-Z0-9\s]'
    text =text.replace("Transform" ,"improve")
    text =text.replace("Reimagine" ,"rewrite")
    # Replace these characters with an empty string
    clean_text = re.sub(pattern, '', text)
    return clean_text

## Inference
Finally inference is done for the submission, where the model is prompted for each set of texts and outputs a possible prompt based on them using previous functions

In [None]:
test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
adject =[]
for index, row in test_df.iterrows():
        result = get_prompt(row['original_text'], row['rewritten_text'])
        result  = remove_special_characters(result)
        print(result)
        test_df.at[index, 'rewrite_prompt'] =  result

test_df = test_df[['id', 'rewrite_prompt']]
test_df

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
2024-04-16 23:08:04.536256: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-16 23:08:04.536392: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-16 23:08:04.661913: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Improve this text by making it into a competitive shanty song while respecting the original context of the code competition


Unnamed: 0,id,rewrite_prompt
0,-1,Improve this text by making it into a competit...


In [19]:
test_df.to_csv('submission.csv', index=False)