## Parametrized prompt template

In this notebook we build and fill in a parametrized prompt template. The resulting prompts are later used for the generation of a diverse set of queries.

In [None]:
# a single template

master_prompt = """
Based on the following document, available in both English and Dutch, craft a {query_kind} in both languages that a user would naturally enter to find its core substantive information via a search engine.

Guidelines concerning the kind, length and complexity of the query:
    Write {query_nature} {length_instruction}.
    Ensure each query is fully answerable using only the provided document and targets its core substantive content.
    {complexity_instruction1}
    {complexity_instruction2}
    
Restrictions:
    Do not frame queries as if the user assumes the document's existence. Suppose the user is searching for information broadly.
    Avoid general references to organizations (for example: EU, EEG, EEC), document numbers or codes (for example: 2001/106/EC) and broad administrative terms (for example: regulation, decision) unless they are essential to the core content.
    Do not translate the English and Dutch queries from each other. Instead, generate them separately to retrieve the same content naturally in each language. Use the precise wording found in the document in the respective language whenever possible.

Provided documents:
    English document: {{text_en}}
    Dutch document: {{text_nl}}

Response format:
    English: <English query>
    Dutch: <Dutch query>
"""

In [18]:
# query parameter dictionaries
# the parameter query_nature falls under query_kind and is specified later in the filling function.

query_kind_dict = {
    'descriptive search query': "a descriptive search phrase expressed in natural language. Avoid questions like 'What is...? or How does...?' and keyword lists.",
    'yes-no question': "closed questions that can be answered with 'yes' or 'no'.",
    'wh-question': "open-ended wh-questions starting with who, what, where, when, why, how (or their Dutch equivalents).",
    'legal scenario': "a legal scenario describing a situation where a person, company, or authority might need guidance on the legal issue covered in the document. Do not frame the scenario as a direct legal question (e.g., 'Is it legal to...?')."
}

query_length_dict = {
    'under 15 words': "Each query must be under 15 words",
    'more than 15 words': "Each query must be more than 15 words",
}

complexity_referencing_dict = {
    'factoid': "The query should seek a specific fact that can be directly extracted from the document, without requiring synthesis.",
    'rationale': "The query should require combining and integrating multiple pieces of information from the document to form a reasoned, logical answer."
}

complexity_reasoning_dict = {
    'explicit': "The query should clearly state its intent, explicitly defining what information is needed and leaving no ambiguity about relevance.",
    'implicit': """The query should imply its informational need without explicitly stating it, requiring inference to determine what information is relevant, thus introducing interpretive effort. An example of such an implicit query: Has the company that proposed X made any recent advancements?”. Explanation: this query is implicit because the query does not directly name the company. The retrieval system must first infer what “the company that proposed X” refers to"""
}

In [19]:
# build a list of dictionaries with all possible query parameter combinations

def query_parameters() -> list:
    combi = []
    for kind in query_kind_dict.keys():
        for length in query_length_dict.keys():
            for referencing in complexity_referencing_dict.keys():
                for reasoning in complexity_reasoning_dict.keys():

                    entry = {
                        'query_kind': kind,
                        'query_length': length,
                        'complexity_referencing': referencing,
                        'complexity_reasoning': reasoning
                    }

                    # replace the query length if the kind is 'legal scenario'
                    if kind == 'legal scenario':
                        entry['query_length'] = 'more than 45 words'

                    combi.append(entry)

    return combi

#testing
c = query_parameters();
print(f'There are {len(c)} combinations')


There are 32 combinations


In [20]:
# function to fill in the template based on all combinations

def prompt_builder(row:list)->str:

    kind,length,reference,reasoning = row

    return master_prompt.format(
        query_kind = kind,
        query_nature = query_kind_dict[kind],
        length_instruction = f"Each query must be {length}",
        complexity_instruction1 = complexity_referencing_dict[reference],
        complexity_instruction2 = complexity_reasoning_dict[reasoning]
    )

In [21]:
# build a dataframe
import pandas as pd
df = pd.DataFrame(query_parameters())

# add a column to the dataframe with the matching formatted prompts
df['formatted_prompt'] = df.apply(prompt_builder, axis=1)

print(f'There are {len(c)} combinations')
df

There are 32 combinations


Unnamed: 0,query_kind,query_length,complexity_referencing,complexity_reasoning,formatted_prompt
0,descriptive search query,under 15 words,factoid,explicit,"\nBased on the following document, available i..."
1,descriptive search query,under 15 words,factoid,implicit,"\nBased on the following document, available i..."
2,descriptive search query,under 15 words,rationale,explicit,"\nBased on the following document, available i..."
3,descriptive search query,under 15 words,rationale,implicit,"\nBased on the following document, available i..."
4,descriptive search query,more than 15 words,factoid,explicit,"\nBased on the following document, available i..."
5,descriptive search query,more than 15 words,factoid,implicit,"\nBased on the following document, available i..."
6,descriptive search query,more than 15 words,rationale,explicit,"\nBased on the following document, available i..."
7,descriptive search query,more than 15 words,rationale,implicit,"\nBased on the following document, available i..."
8,yes-no question,under 15 words,factoid,explicit,"\nBased on the following document, available i..."
9,yes-no question,under 15 words,factoid,implicit,"\nBased on the following document, available i..."


In [None]:
# save prompts

df.to_csv('data/prompts.csv', index=False)