# Experiment 1: Prompt Engineering Efficacy on Non-Fine-Tuned Large Language Model

## Objective
This experiment aims to evaluate the performance of a state-of-the-art language model using advanced prompt engineering techniques, without employing any fine-tuning methodologies. The primary goal is to assess the model's baseline capabilities and its responsiveness to sophisticated prompting strategies.

## Experimental Setup

### Model Specifications
- **Architecture:** Meta's Llama 3.1
- **Source:** Unsloth HuggingFace model repository (non-gated, open-access variant)

### Computational Environment
- **Platform:** Google Colab Notebook
- **Infrastructure Tier:** Free
- **GPU Specification:** NVIDIA Tesla T4

### Dataset
- **Corpus:** google-research-datasets/Disfl-QA

## Methodology
The experiment employs a prompt engineering approach, leveraging carefully crafted input sequences to elicit optimal performance from the non-fine-tuned Llama 3.1 model. This method involves the systematic design and iteration of prompts to guide the model's output towards desired outcomes, without altering the model's parameters.

## Evaluation Metrics
To quantify the model's performance, I utilize two widely recognized natural language processing metrics:

1. **BLEU Score (Bilingual Evaluation Understudy)**

2. **ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation)**

These metrics provide a comprehensive evaluation of the model's linguistic accuracy and relevance in the context of the Disfl-QA dataset.

## Step 1: Installing required dependencies.

In [1]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install evaluate rouge_score

from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

## Step 2: Load Data

In [2]:
import requests
import pandas as pd
import json

def process_github_json_files(base_url, file_names):
    dataframes = {}

    for file_name in file_names:
        url = f"{base_url}/{file_name}"
        try:
            response = requests.get(url)
            if response.status_code != 200:
                raise Exception(f"Failed to download {file_name}. Status code: {response.status_code}")

            data = json.loads(response.text)
            df = pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'id'})

            output_file = f"{file_name}"
            df.to_json(output_file, orient='records')

            key = file_name.split('.')[0]
            dataframes[key] = df

        except Exception as e:
            print(f"An error occurred while processing {file_name}: {str(e)}")

    return dataframes.get('train'), dataframes.get('test'), dataframes.get('dev')

base_url = "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/master"
file_names = ["train.json", "test.json", "dev.json"]

df_train, df_test, df_dev = process_github_json_files(base_url, file_names)

In [3]:
print("Shape of train DataFrame:", df_train.shape if df_train is not None else "Not available")
print("Shape of test DataFrame:", df_test.shape if df_test is not None else "Not available")
print("Shape of dev DataFrame:", df_dev.shape if df_dev is not None else "Not available")

Shape of train DataFrame: (7182, 3)
Shape of test DataFrame: (3643, 3)
Shape of dev DataFrame: (1000, 3)


In [4]:
df_train.head(5)

Unnamed: 0,id,original,disfluent
0,5a5918ff3e1742001a15cf7e,What do unstable isotope studies indicate?,What do petrologists no what do unstable isoto...
1,5ad4f40c5b96ef001a10a774,What is the basic unit of territorial division...,What is the second level of territorial divisi...
2,572684365951b619008f7543,Which genus lack tentacles and sheaths?,Juvenile platyctenids no wow Which genus lack ...
3,5729f799af94a219006aa70a,Long-lived memory cells can remember previous ...,When a pathogen is met again scratch that I me...
4,5ad3b9cd604f3c001a3fee87,What led to Newcastle's rise to power as milit...,What led to the Duke of Cumberland's rise to p...


## Step 3: Build a Prompt:

## Prompt Engineering Methodology

### Approach Overview
In this research project, I explored multiple approaches for constructing effective prompts to optimize the performance of the language model. The primary methodologies employed were:

1. Zero-shot learning
2. Few-shot learning

I conducted extensive experimentation with various prompt structures and content to identify the most effective formulations.

### Key Challenge
A significant challenge encountered during the prompt engineering process was instructing the model to output only the corrected question without generating extraneous explanations or hallucinated examples.

### Iterative Process
Through multiple iterations of prompt engineering, I developed a template that yielded optimal results, characterized by:
- Reduced hallucinations
- Outputs constrained to corrected questions only

### Template Performance
While the developed template demonstrated superior performance compared to other variations, it's important to note that some instances of hallucination and generation of explanations or additional examples were still observed. However, after comprehensive comparison with outputs from multiple template variations, this template consistently produced the best results.

### Alternative Approaches
For context, it's worth noting that alternative methods such as fine-tuning can be employed to instruct the model to output in a specific format. For instance:
- Utilizing `<EOS_TOKEN>` during training can effectively constrain the model to output only corrected questions.
- These alternative approaches are demonstrated in Experiment 2 and Experiment 3, detailed in separate notebooks.


In [5]:
instruction_template = """
You are an AI assistant that corrects disfluent questions.
Remove all disfluencies (filler words, false starts, hesitations, repetitions) and output a single, fluent, clear, and concise version of the input question.
Maintain the original meaning and intent. Use natural, formal English.
Do not change the subject, alter the question's meaning, or add any new information.
Provide only the corrected question as a single line, without explanations, examples, or additional formatting.
"""

prompt_template = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

{instruction}

Disfluent Question:
{disfluent_question}

Corrected Question based on the above instructions, Important dont add explanations, examples, or additional formatting:
"""

prompt_template.format(
        instruction = instruction_template,
        disfluent_question ="What is the second level of territorial division in Poland no make that the basic unit of territorial division in Warsaw?",
)
prompt_template

'\nBelow is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n{instruction}\n\nDisfluent Question:\n{disfluent_question}\n\nCorrected Question based on the above instructions, Important dont add explanations, examples, or additional formatting:\n'

## Step 4: Load Llama 3.1 Model

In [6]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

## Step 5: Perform Inference on dev dataset using the above prompt

#### 5.1 Inference on a single example:

In [11]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    prompt_template.format(
        instruction = instruction_template,
        disfluent_question ="What is the second level of territorial division in Poland no make that the basic unit of territorial division in Warsaw?",
    )
], return_tensors = "pt").to("cuda")


output_tokens= model.generate(**inputs, max_new_tokens=64, use_cache=True, pad_token_id = tokenizer.eos_token_id)
tokenizer.batch_decode(output_tokens[:, len(inputs[0].tokens): ], skip_special_tokens=True)[0]

'What is the second level of territorial division in Poland, and what is the basic unit of territorial division in Warsaw?\n\n'

#### 5.2 Inference on a dev dataset (1000 rows):

In [18]:
df_dev_experiment = df_dev

In [19]:
def generate_prediction(disfluent_input):
    inputs = tokenizer(
        [
            prompt_template.format(
                instruction=instruction_template,
                disfluent_question=disfluent_input,
            )
        ],
        return_tensors="pt"
    ).to("cuda")

    output_tokens = model.generate(
        **inputs,
        max_new_tokens=64,
        use_cache=True
    )

    output_text = tokenizer.batch_decode(
        output_tokens[:, len(inputs[0].tokens):],
        skip_special_tokens=True
    )[0].replace('\n', ' ')

    return output_text


df_dev_experiment['prediction'] = df_dev_experiment['disfluent'].apply(generate_prediction)

In [24]:
df_dev_experiment[['original', 'disfluent', 'prediction']].head(10)

Unnamed: 0,original,disfluent,prediction
0,What did the government want Thoreau to do?,Who did no What did the government want Thorea...,What did the government want Thoreau to do?
1,What makes the Wells Fargo Center stand out?,What makes the Bank of America Tower or wait t...,What makes the Bank of America Tower or the We...
2,What was the Colonia Agrippina's original name?,What was the Colonia Agrippina's original empi...,What was the Colonia Agrippina's original name?
3,Extended networking benefits helped those that...,"Extended authorization limitations, no sorry n...","Extended authorization limitations, no sorry n..."
4,Who is the emphasis on when there is a private...,What is the no make that who is the emphasis o...,What is the no make that who is the emphasis o...
5,What dynasties inspired the Chinese-like eleme...,What dynasties reflected no inspired the Chine...,What dynasties influenced the Chinese-like ele...
6,What is the density of all primes compatible w...,What is the density of all primes compatible w...,What is the density of all primes compatible w...
7,What did European empires rely on to supply th...,When or uh what did European empires rely on t...,What did European empires rely on to supply th...
8,What did Karlen and Singer present to the US s...,What did Wahl and Ammann no no Karlen and Sing...,What did Wahl and Ammann present to the US sen...
9,What is the current status of the Haensch study?,What is the current status of Schuenemann's st...,What is the current status of Schuenemann's st...


## Step 6: Computing bleu and rouge metrics on the predicitions:

In [20]:
originals_text = list(df_dev_experiment['original'])
predictions_text = list(df_dev_experiment['prediction'])

In [21]:
import evaluate
bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions_text, references=originals_text)
print(results)

rouge = evaluate.load('rouge')
results = rouge.compute(predictions=predictions_text, references=originals_text)
print(results)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

{'bleu': 0.4881350664288088, 'precisions': [0.6355159842961301, 0.5253317249698432, 0.4472439660795825, 0.38023792613636365], 'brevity_penalty': 1.0, 'length_ratio': 1.3281191806331472, 'translation_length': 14264, 'reference_length': 10740}


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.7642200139446589, 'rouge2': 0.6422974792730143, 'rougeL': 0.7405325435164305, 'rougeLsum': 0.7403404427692271}
