# Generating Diverse Prompts for Section References and Meta-Analysis Info Combinations

## Overview

This file aims to create a set of diverse and varied prompts for each combination of *Section Reference* and *Meta-Analysis Info* categories. The primary goal is to extend and enrich the initial set of prompts by varying their language structure by alternating the syntax and choice of words by alternating semantics, ensuring the outputs are sufficiently different while maintaining their original intent.

By doing so, we aim to generate up to as many prompts to receive 20 unique prompts per combination, suitable for further use in classification, training, or testing scenarios. This variation enhances the linguistic richness and diversity of the dataset, ensuring broader applicability and adaptability. Afterwards, we are going to extend them by 10 prompts received by a large language model.

## Methodology

To achieve the goal, the following approach is taken:
1.	Input Dataset:
- The dataset contains prompts categorized by *Section Reference* (e.g., “No Reference,” “Topic,” “Methods,” “Results”) and *Meta-Analysis Info* (e.g., “No Info,” “For MA,” “Title,” “Criteria”).
- Each prompt is associated with a specific combination of these two dimensions.
2.	Prompt Expansion Using Markov Chains:
- The markovify package is used to simulate new prompts by leveraging Markov chains.
- All prompts belonging to a specific combination of categories are used as input to a Markov model, which generates new prompts based on observed patterns in the input data.
- This allows for the creation of prompts that retain the style and structure of the originals while introducing variation.
3.	Ensuring Uniqueness and Quality:
- The generated prompts are checked against the existing prompts to ensure no duplicates are added.
- Prompts are also filtered for redundancy within the same generation cycle.
- If a generated prompt doesn’t meet the criteria (e.g., it is too similar to the originals or doesn’t make sense), the model generates additional prompts to compensate.
4.	Final Validation of Markov Chain Model:
- Prompts are formatted uniformly to maintain a consistent presentation.
- Outputs are reviewed, and necessary adjustments are made to improve clarity, variation, and alignment with the intended meaning. If needed, the cycle is run again go receive the missing prompts
5. Extension of Prompts by LLM
- Utilizing a pre-trained LLM to generate 10 additional prompts for each combination of *Section Reference* and *Meta-Analysis Info*.
- Ensure that these new prompts align with the existing structure and intent of the original prompts.
6. Final Dataset
- Combine all prompts into a final dataset, ensuring each combination of categories has a comprehensive set of prompts.
- This final dataset will be used for further classification, training, or testing scenarios, providing a robust and diverse set of prompts for various applications.

## Why This Approach?

Using Markov chains enables the generation of syntactically coherent variations based on existing data. This method ensures:
- Efficiency: Quickly generates multiple variations without requiring manual input for each prompt.
- Diversity: Captures different sentence structures and word choices while retaining the essence of the original prompts.
- Scalability: Handles multiple combinations of categories, making it suitable for large datasets.

## Enhancing Markov Chain Models with LLMs

While Markov chains are effective in generating syntactically coherent variations based on existing data, they have limitations in capturing the deeper semantic relationships and contextual nuances present in natural language. Markov chains rely on the probability of word sequences, which can lead to repetitive and sometimes nonsensical outputs, especially when the input data is limited or lacks diversity. To address these limitations, we incorporate Large Language Models (LLMs) into our prompt generation process. LLMs, such as GPT-3, are pre-trained on vast amounts of text data and are capable of understanding and generating human-like text with a high degree of coherence and relevance. By leveraging LLMs, we can:
- Enhance Diversity: LLMs can generate a wider range of prompts with varied sentence structures and vocabulary, reducing the risk of repetitive outputs that are common with Markov chains.
- Improve Contextual Understanding: LLMs have a better grasp of context and can produce prompts that are more aligned with the intended meaning and purpose, ensuring higher quality and relevance.
- Increase Semantic Richness: By incorporating LLMs, we can generate prompts that capture deeper semantic relationships, making the outputs more meaningful and useful for downstream tasks.
- Maintain Coherence: LLMs can generate longer and more complex prompts while maintaining coherence, which is challenging for Markov chains that operate on fixed state sizes.

By combining the strengths of Markov chains and LLMs, we achieve a robust and diverse set of prompts that are both syntactically and semantically rich, enhancing the overall quality and applicability of the generated dataset. By the end of this process, we will have a rich, diverse, and meaningful set of prompts that are ready for deployment in various applications.

In [79]:
import pandas as pd
import markovify
import csv
import random

In [80]:
# Load the initial_prompts.csv into a dataframe
df = pd.read_csv('initial_prompts.csv', delimiter=';')

# Remove line breaks in the 'TitlePrompt' column
df['TitlePrompt'] = df['TitlePrompt'].str.replace('\n', ' ').str.replace('\r', ' ')

# Fill NaN values with empty strings in the 'TitlePrompt' column
df['TitlePrompt'].fillna('', inplace=True)

# Delete all rows below index 213
df = df.iloc[:213]

# Set 'screen_titles' to 1 and 'screen_abstracts' to 0
df['screen_titles'] = 1
df['screen_abstracts'] = 0

# Save the cleaned dataframe to prompts.csv
df.to_csv('initial_prompts_cleaned.csv', index=False, sep=';')

# Display the cleaned dataframe
print(df.head(25))

    Model  Words Section Reference Meta-Analysis Info  screen_titles  \
0       1      1      No Reference            No Info              1   
1       1      8      No Reference            No Info              1   
2       1      4      No Reference            No Info              1   
3       1      6      No Reference            No Info              1   
4       1     13      No Reference            No Info              1   
5       1      6      No Reference            No Info              1   
6       1     13      No Reference            No Info              1   
7       1     10      No Reference            No Info              1   
8       1     47      No Reference            No Info              1   
9       1     42      No Reference            No Info              1   
10      1     47      No Reference            No Info              1   
11      1     56      No Reference            No Info              1   
12      1     70      No Reference            No Info           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TitlePrompt'].fillna('', inplace=True)


In [81]:
# Count the number of prompts for each combination of Section Reference and Meta-Analysis Info
prompt_counts = df.groupby(['Section Reference', 'Meta-Analysis Info']).size().reset_index(name='Count')

# Display the counts
print(prompt_counts)

   Section Reference Meta-Analysis Info  Count
0            Methods           Criteria     11
1            Methods             For MA     13
2            Methods            No Info     15
3            Methods              Title     13
4       No Reference           Criteria     13
5       No Reference             For MA     13
6       No Reference            No Info     17
7       No Reference              Title     14
8            Results           Criteria     11
9            Results             For MA     13
10           Results            No Info     15
11           Results              Title     13
12             Topic           Criteria     11
13             Topic             For MA     13
14             Topic            No Info     15
15             Topic              Title     13


In [82]:
# Define combinations of categories
section_refs = ["No Reference", "Topic", "Methods", "Results"]
meta_analysis_infos = ["No Info", "For MA", "Title", "Criteria"]

# Function to generate unique new prompts
def generate_unique_prompts(model, num_prompts, existing_prompts, min_sentences=0, max_sentences=12):
    generated_prompts = set(existing_prompts)  # Set to avoid duplicates
    new_prompts = []
    attempts = 0  # Safety loop to prevent infinite loops
    max_attempts = num_prompts * 10  # Max attempts per required prompt
    
    while len(new_prompts) < num_prompts and attempts < max_attempts:
        attempts += 1
        # Randomly choose a number of sentences for the prompt
        num_sentences = random.randint(min_sentences, max_sentences)
        sentences = set()
        prompt = []
        for _ in range(num_sentences):
            sentence = model.make_sentence(tries=100)
            if sentence and sentence not in sentences:
                sentences.add(sentence)
                prompt.append(sentence)
        prompt_text = " ".join(prompt)
        if prompt_text and prompt_text not in generated_prompts:  # Check for duplicates
            generated_prompts.add(prompt_text)
            new_prompts.append(prompt_text.strip())  # Ensure uniform layout
    return new_prompts

# Save all new prompts
all_generated_prompts = []

# Iterate over all combinations of `Section Reference` and `Meta-Analysis Info`
for section_ref in section_refs:
    for meta_info in meta_analysis_infos:
        # Filter data for the current combination
        filtered_data = df[
            (df["Section Reference"] == section_ref) &
            (df["Meta-Analysis Info"] == meta_info)
        ]
        
        # Extract existing prompts
        existing_prompts = filtered_data["TitlePrompt"].dropna().tolist()
        num_existing_prompts = len(existing_prompts)
        
        # Calculate the number of new prompts
        num_new_prompts = max(0, 20 - num_existing_prompts)  # No negative values
        
        if num_new_prompts > 0 and existing_prompts:
            # Combined text for Markov model
            combined_text = "\n\n".join(existing_prompts)
            
            # Create Markov model
            markov_model = markovify.Text(
                combined_text,
                state_size=2,
                retain_original=False  # Prevents reconstruction of original texts
            )
            
            # Generate new unique prompts
            new_prompts = generate_unique_prompts(
                markov_model,
                num_prompts=num_new_prompts,
                existing_prompts=existing_prompts,
                min_sentences=0,
                max_sentences=12
            )
            
            # Save new prompts
            for prompt in new_prompts:
                all_generated_prompts.append({
                    "Section Reference": section_ref,
                    "Meta-Analysis Info": meta_info,
                    "Generated Prompt": prompt
                })

# Write results to a DataFrame
generated_prompts_df = pd.DataFrame(all_generated_prompts)

# Output the generated prompts
print("Generated Prompts:")
for _, row in generated_prompts_df.iterrows():
    print(f"Section: {row['Section Reference']}, Meta: {row['Meta-Analysis Info']}, Prompt: {row['Generated Prompt']}")

Generated Prompts:
Section: No Reference, Meta: No Info, Prompt: Exclude the irrelevant papers and include the relevant papers in your life from a broad range of fields. Consider whether the paper addresses the broader context of inquiry. Of the titles below and assess their scientific rigor.
Section: No Reference, Meta: No Info, Prompt: Systematically review the provided list of provided titles with the concepts of sensitivity and specificity in signal-detection theory. Consider whether the paper addresses the broader research question, adheres to established principles of careful selection, fairness, and scholarly rigor, allowing your expertise to guide you in making informed and accurate decisions. This review should prioritize thoughtful and balanced understanding of the available literature, prioritizing works that provide sufficient detail and transparency to allow for informed consideration within a broader understanding of the subject area.
Section: No Reference, Meta: No Info,

In [83]:
# Preprocess the generated prompts to replace double quotes with single quotes
generated_prompts_df['Generated Prompt'] = generated_prompts_df['Generated Prompt'].str.replace('"', '')

# Save results without additional quotes
generated_prompts_df.to_csv("generated_prompts.csv", index=False, quoting=csv.QUOTE_NONE, sep=';')


We are working on a copy of the original `generated_prompts.csv` now, which is `generated_prompts_cleaned.csv`.

In [84]:
# Load the cleaned generated prompts
generated_prompts_cleaned_df = pd.read_csv('generated_prompts_cleaned.csv', delimiter=';')

In [85]:
# Delete the column "Generated Prompt"
generated_prompts_cleaned_df = generated_prompts_cleaned_df.drop(columns=['Generated Prompt'])

# Rename the column "Generated Prompt Changed" to "TitlePrompt"
generated_prompts_cleaned_df = generated_prompts_cleaned_df.rename(columns={'Generated Prompt Changed': 'TitlePrompt'})

# Display the updated dataframe
print(generated_prompts_cleaned_df.head())

  Section Reference Meta-Analysis Info  \
0      No Reference            No Info   
1      No Reference            No Info   
2      No Reference            No Info   
3      No Reference             For MA   
4      No Reference             For MA   

                                         TitlePrompt  
0  Your evaluation should reflect your considerab...  
1  You are familiar with the concepts of sensitiv...  
2  Select each one based on their titles. Please ...  
3  This review should emphasize careful and delib...  
4  Avoid including titles unless you are confiden...  


We have revised the generated prompts and rewritten those that were syntactically or semantically nonsensical. However, since the choice of words is very restrictive based on the previous input, we now want to further expand the data using an LLM. For that we will create 10 prompts per combination of Section Reference and Meta-Analysis Information additionally. It is possible to use this pipeline, but we used ChatGPT with the named model as in this API.

In [86]:
import openai

# Set your OpenAI API key
openai.api_key = "OPENAI_API_KEY"

# Load the initial prompts cleaned dataframe
df = pd.read_csv('initial_prompts_cleaned.csv', delimiter=';')

# Define combinations of categories
section_refs = ["No Reference", "Topic", "Methods", "Results"]
meta_analysis_infos = ["No Info", "For MA", "Title", "Criteria"]

# Function to generate new prompts using OpenAI's API
def generate_prompts(section_ref, meta_info, existing_prompts):
    llm_prompt = f"""
    You are tasked with generating new prompts for the combination of Section Reference: '{section_ref}' and Meta-Analysis Info: '{meta_info}'. The existing prompts for this combination are as follows:

    {existing_prompts}

    Please generate 10 new prompts that maintain the aspect of Section Reference and Meta-Analysis Info but are varied in choice of language, words, and general sentence structure.
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": llm_prompt}
        ],
        max_tokens=150,
        n=10,
        stop=None,
        temperature=0.7
    )
    
    new_prompts = [choice['message']['content'].strip() for choice in response.choices]
    return new_prompts

# Iterate over all combinations of `Section Reference` and `Meta-Analysis Info`
all_new_prompts = []
for section_ref in section_refs:
    for meta_info in meta_analysis_infos:
        # Filter data for the current combination
        filtered_df = df[(df['Section Reference'] == section_ref) & (df['Meta-Analysis Info'] == meta_info)]
        
        # Extract existing prompts
        existing_prompts = filtered_df['TitlePrompt'].dropna().tolist()
        
        if existing_prompts:
            # Generate new prompts using OpenAI's API
            new_prompts = generate_prompts(section_ref, meta_info, existing_prompts)
            
            # Create a new dataframe for the new prompts
            new_prompts_df = pd.DataFrame({
                'Section Reference': [section_ref] * len(new_prompts),
                'Meta-Analysis Info': [meta_info] * len(new_prompts),
                'TitlePrompt': new_prompts
            })
            
            # Append the new prompts to the existing dataframe
            all_new_prompts.append(new_prompts_df)

# Concatenate all new prompts dataframes
if all_new_prompts:
    all_new_prompts_df = pd.concat(all_new_prompts, ignore_index=True)
    extended_prompts_df = pd.concat([generated_prompts_cleaned_df, all_new_prompts_df], ignore_index=True)
    
    # Save the extended dataframe to 'generated_prompts_llm.csv'
    extended_prompts_df.to_csv('generated_prompts_llm.csv', index=False, sep=';')
    
    # Display the new prompts
    print("Newly Generated Prompts by LLM:")
    for _, row in all_new_prompts_df.iterrows():
        print(f"Section: {row['Section Reference']}, Meta: {row['Meta-Analysis Info']}, Prompt: {row['TitlePrompt']}")
else:
    print("No new prompts generated.")

APIRemovedInV1: 

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742


## Combining the prompt dataframes to a final one

We now want to combine the initial prompts with the generated prompts from both the LLM and the Markov chains. This will create a comprehensive dataset that includes all variations and expansions of the original prompts.

In [87]:
# Load the initial prompts cleaned dataframe
initial_prompts_cleaned_df = pd.read_csv('initial_prompts_cleaned.csv', delimiter=';')

# Load the generated prompts LLM dataframe
generated_prompts_llm_df = pd.read_csv('generated_prompts_llm.csv', delimiter=';')

# Function to count rows per combination of Section Reference and Meta-Analysis Info
def count_combinations(df):
    return df.groupby(['Section Reference', 'Meta-Analysis Info']).size().reset_index(name='Count')

# Count combinations for initial_prompts_cleaned_df
initial_counts = count_combinations(initial_prompts_cleaned_df)
print("Initial Prompts Cleaned Counts:")
print(initial_counts)

# Count combinations for generated_prompts_cleaned_df
generated_cleaned_counts = count_combinations(generated_prompts_cleaned_df)
print("\nGenerated Prompts Cleaned Counts:")
print(generated_cleaned_counts)

# Count combinations for generated_prompts_llm_df
generated_llm_counts = count_combinations(generated_prompts_llm_df)
print("\nGenerated Prompts LLM Counts:")
print(generated_llm_counts)

Initial Prompts Cleaned Counts:
   Section Reference Meta-Analysis Info  Count
0            Methods           Criteria     11
1            Methods             For MA     13
2            Methods            No Info     15
3            Methods              Title     13
4       No Reference           Criteria     13
5       No Reference             For MA     13
6       No Reference            No Info     17
7       No Reference              Title     14
8            Results           Criteria     11
9            Results             For MA     13
10           Results            No Info     15
11           Results              Title     13
12             Topic           Criteria     11
13             Topic             For MA     13
14             Topic            No Info     15
15             Topic              Title     13

Generated Prompts Cleaned Counts:
   Section Reference Meta-Analysis Info  Count
0            Methods           Criteria      9
1            Methods             For MA 

In [88]:
initial_prompts_df = pd.read_csv('initial_prompts_cleaned.csv', sep=';') # 213

# Load new prompts from generated_prompts_llm.csv
generated_prompts_llm = pd.read_csv('generated_prompts_llm.csv', sep=';') # 160

# Select relevant columns from generated_prompts_llm
generated_prompts_llm_filtered = generated_prompts_llm[['Section Reference', 'Meta-Analysis Info', 'TitlePrompt']]

# Include prompts from generated_prompts_cleaned_df
generated_prompts_cleaned_filtered = generated_prompts_cleaned_df[['Section Reference', 'Meta-Analysis Info', 'TitlePrompt']] # 107

# Combine the prompts into complete_prompts_combined
complete_prompts_combined = pd.concat(
    [initial_prompts_df, generated_prompts_llm_filtered, generated_prompts_cleaned_filtered],
    ignore_index=True
)

In [89]:
# delete duplicates
complete_prompts_combined = complete_prompts_combined.drop_duplicates(
    subset=['Section Reference', 'Meta-Analysis Info', 'TitlePrompt']
)

# save combined prompts
complete_prompts_combined.to_csv('complete_prompts.csv', index=False, sep=';')

In [90]:
# Function to replace "title" with "abstract" in a string
def replace_title_with_abstract(text):
    return text.replace("title", "abstract").replace("Title", "Abstract")

# Update the DataFrame
complete_prompts_combined['Model'] = 1
complete_prompts_combined['Words'] = complete_prompts_combined['TitlePrompt'].apply(lambda x: len(x.split()) if pd.notnull(x) else 0)
complete_prompts_combined['screen_titles'] = 1
complete_prompts_combined['screen_abstracts'] = 0
complete_prompts_combined['AbstractPrompt'] = complete_prompts_combined['TitlePrompt'].apply(lambda x: replace_title_with_abstract(x) if pd.notnull(x) else '')

# Fill NaN values with empty strings for the rest of the columns
columns_to_fill = ['sensitivity_titles', 'specificity_titles', 'PPV_titles', 'NPV_titles', 'tp_titles', 'tn_titles', 'fp_titles', 'fn_titles', 
                   'sensitivity_abstracts', 'specificity_abstracts', 'PPV_abstracts', 'NPV_abstracts', 'tp_abstracts', 'tn_abstracts', 'fp_abstracts', 'fn_abstracts']
complete_prompts_combined[columns_to_fill] = complete_prompts_combined[columns_to_fill].fillna('')

# Count the number of rows for each combination of Section Reference and Meta-Analysis Info
combination_counts = complete_prompts_combined.groupby(['Section Reference', 'Meta-Analysis Info']).size().reset_index(name='Count')

# Display the counts
print(combination_counts)

   Section Reference Meta-Analysis Info  Count
0            Methods           Criteria     30
1            Methods             For MA     30
2            Methods            No Info     30
3            Methods              Title     30
4       No Reference           Criteria     30
5       No Reference             For MA     30
6       No Reference            No Info     30
7       No Reference              Title     30
8            Results           Criteria     30
9            Results             For MA     30
10           Results            No Info     30
11           Results              Title     30
12             Topic           Criteria     30
13             Topic             For MA     30
14             Topic            No Info     30
15             Topic              Title     30


In [91]:
# Create a copy of the dataframe with screen_titles set to 0 and screen_abstracts set to 1
df_screen_abstracts = complete_prompts_combined.copy()
df_screen_abstracts['screen_titles'] = 0
df_screen_abstracts['screen_abstracts'] = 1

# Create a copy of the dataframe with both screen_titles and screen_abstracts set to 1
df_screen_both = complete_prompts_combined.copy()
df_screen_both['screen_titles'] = 1
df_screen_both['screen_abstracts'] = 1

# Concatenate the original dataframe with the two new dataframes
final_df = pd.concat([complete_prompts_combined, df_screen_abstracts, df_screen_both], ignore_index=True)

# Save the final dataframe to a CSV file
final_df.to_csv('final_prompts.csv', index=False, sep=';')

# Display the final dataframe
print(final_df.head())
print(final_df.tail())

   Model  Words Section Reference Meta-Analysis Info  screen_titles  \
0      1      0      No Reference            No Info              1   
1      1      8      No Reference            No Info              1   
2      1      4      No Reference            No Info              1   
3      1      6      No Reference            No Info              1   
4      1     13      No Reference            No Info              1   

   screen_abstracts                                        TitlePrompt  \
0                 0                                                NaN   
1                 0        Screen the titles below like a human would.   
2                 0                           Screen the titles below.   
3                 0         You are a world-class clinical researcher.   
4                 0  Do not exclude any titles unless you are absol...   

                                      AbstractPrompt sensitivity_titles  \
0                                                    

After generating our prompts, we are combining three dataframes containing counts of prompts into a single CSV file named 'final_prompts.csv'.

The dataframes represent counts of prompts from three different sources:
1. Initial Prompts Cleaned Counts (contains the initial prompts we have written by hand and used as a foundation)
- it contains the specific number of entries provided
2. Generated Prompts Cleaned Counts (contains the generated prompts via markov chains based on the initial prompts)
- it fills the count of initial prompts up to 20 by adding the missing number of markov chain generated prompts
3. Generated Prompts LLM Counts (contains prompts generated via an LLM by using the inital prompts as examples for existing prompts)
- it adds 10 prompts per combination of Section Reference and Meta-Analysis Info

Afterwards, we maintained 480 prompts, of which 480 where unique. We then added those prompts in the combinations of 
[screen_titles; screen_abstracts] with [1;0], [0;1] and [1;1] to create our final dataset. In the end, our dataset final_prompts.csv contains 1.440 prompts.