# Generating Diverse Prompts for Section References and Meta-Analysis Info Combinations

## Overview

This file aims to create a set of diverse and varied prompts for each combination of *Section Reference* and *Meta-Analysis Info* categories. The primary goal is to extend and enrich the initial set of prompts by varying their language structure by alternating the syntax and choice of words by alternating semantics, ensuring the outputs are sufficiently different while maintaining their original intent.

By doing so, we aim to generate up to as many prompts to receive 50 unique prompts per combination, suitable for further use in classification, training, or testing scenarios. This variation enhances the linguistic richness and diversity of the dataset, ensuring broader applicability and adaptability.

## Methodology

To achieve the goal, the following approach is taken:
1.	Input Dataset:
- The dataset contains prompts categorized by *Section Reference* (e.g., “No Reference,” “Topic,” “Methods,” “Results”) and *Meta-Analysis Info* (e.g., “No Info,” “For MA,” “Title,” “Criteria”).
- Each prompt is associated with a specific combination of these two dimensions.
2.	Prompt Expansion Using Markov Chains:
- The markovify package is used to simulate new prompts by leveraging Markov chains.
- All prompts belonging to a specific combination of categories are used as input to a Markov model, which generates new prompts based on observed patterns in the input data.
- This allows for the creation of prompts that retain the style and structure of the originals while introducing variation.
3.	Ensuring Uniqueness and Quality:
- The generated prompts are checked against the existing prompts to ensure no duplicates are added.
- Prompts are also filtered for redundancy within the same generation cycle.
- If a generated prompt doesn’t meet the criteria (e.g., it is too similar to the originals or doesn’t make sense), the model generates additional prompts to compensate.
4.	Final Validation:

- Prompts are formatted uniformly to maintain a consistent presentation.
- Outputs are reviewed, and necessary adjustments are made to improve clarity, variation, and alignment with the intended meaning. If needed, the cycle is run again go receive the missing prompts

## Why This Approach?

Using Markov chains enables the generation of syntactically coherent variations based on existing data. This method ensures:
- Efficiency: Quickly generates multiple variations without requiring manual input for each prompt.
- Diversity: Captures different sentence structures and word choices while retaining the essence of the original prompts.
- Scalability: Handles multiple combinations of categories, making it suitable for large datasets.

By the end of this process, we will have a rich, diverse, and meaningful set of prompts that are ready for deployment in various applications.

In [16]:
import pandas as pd
import markovify

In [17]:
# Load the initial_prompts.csv into a dataframe
df = pd.read_csv('initial_prompts.csv', delimiter=';')

# Remove line breaks in the 'TitlePrompt' column
df['TitlePrompt'] = df['TitlePrompt'].str.replace('\n', ' ').str.replace('\r', ' ')

# Fill NaN values with empty strings in the 'TitlePrompt' column
df['TitlePrompt'].fillna('', inplace=True)

# Delete all rows below index 213
df = df.iloc[:213]

# Set 'screen_titles' to 1 and 'screen_abstracts' to 0
df['screen_titles'] = 1
df['screen_abstracts'] = 0

# Save the cleaned dataframe to prompts.csv
df.to_csv('prompts.csv', index=False, sep=';')

# Display the cleaned dataframe
print(df.head(25))

    Model  Words Section Reference Meta-Analysis Info  screen_titles  \
0       1      1      No Reference            No Info              1   
1       1      8      No Reference            No Info              1   
2       1      4      No Reference            No Info              1   
3       1      6      No Reference            No Info              1   
4       1     13      No Reference            No Info              1   
5       1      6      No Reference            No Info              1   
6       1     13      No Reference            No Info              1   
7       1     10      No Reference            No Info              1   
8       1     47      No Reference            No Info              1   
9       1     42      No Reference            No Info              1   
10      1     47      No Reference            No Info              1   
11      1     56      No Reference            No Info              1   
12      1     70      No Reference            No Info           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TitlePrompt'].fillna('', inplace=True)


In [18]:
# Define combinations of categories
section_refs = ["No Reference", "Topic", "Methods", "Results"]
meta_analysis_infos = ["No Info", "For MA", "Title", "Criteria"]

# Function to generate unique new prompts
def generate_unique_prompts(model, num_prompts, existing_prompts, max_words=50):
    generated_prompts = set(existing_prompts)  # Set to avoid duplicates
    new_prompts = []
    attempts = 0  # Safety loop to prevent infinite loops
    max_attempts = num_prompts * 10  # Max attempts per required prompt
    
    while len(new_prompts) < num_prompts and attempts < max_attempts:
        attempts += 1
        prompt = model.make_sentence(tries=100, max_words=max_words)
        if prompt and prompt not in generated_prompts:  # Check for duplicates
            generated_prompts.add(prompt)
            new_prompts.append(prompt.strip())  # Ensure uniform layout
    return new_prompts

# Save all new prompts
all_generated_prompts = []

# Iterate over all combinations of `Section Reference` and `Meta-Analysis Info`
for section_ref in section_refs:
    for meta_info in meta_analysis_infos:
        # Filter data for the current combination
        filtered_data = df[
            (df["Section Reference"] == section_ref) &
            (df["Meta-Analysis Info"] == meta_info)
        ]
        
        # Extract existing prompts
        existing_prompts = filtered_data["TitlePrompt"].dropna().tolist()
        num_existing_prompts = len(existing_prompts)
        
        # Calculate the number of new prompts
        num_new_prompts = max(0, 50 - num_existing_prompts)  # No negative values
        
        if num_new_prompts > 0 and existing_prompts:
            # Combined text for Markov model
            combined_text = "\n\n".join(existing_prompts)
            
            # Create Markov model
            markov_model = markovify.Text(
                combined_text,
                state_size=2,
                retain_original=False  # Prevents reconstruction of original texts
            )
            
            # Generate new unique prompts
            new_prompts = generate_unique_prompts(
                markov_model,
                num_prompts=num_new_prompts,
                existing_prompts=existing_prompts,
                max_words=50
            )
            
            # Save new prompts
            for prompt in new_prompts:
                all_generated_prompts.append({
                    "Section Reference": section_ref,
                    "Meta-Analysis Info": meta_info,
                    "Generated Prompt": prompt
                })

# Write results to a DataFrame
generated_prompts_df = pd.DataFrame(all_generated_prompts)

# Save results
generated_prompts_df.to_csv("generated_prompts.csv", index=False)

# Output the generated prompts
print("Generated Prompts:")
for _, row in generated_prompts_df.iterrows():
    print(f"Section: {row['Section Reference']}, Meta: {row['Meta-Analysis Info']}, Prompt: {row['Generated Prompt']}")

Generated Prompts:
Section: No Reference, Meta: No Info, Prompt: Go through the titles below.
Section: No Reference, Meta: No Info, Prompt: Systematically review the given list of provided titles with the overarching context of inquiry.
Section: No Reference, Meta: No Info, Prompt: Carefully review the list of titles provided below.
Section: No Reference, Meta: No Info, Prompt: This means that you select those titles that are not directly relevant or lack sufficient scope, depth, or foundational relevance to a comprehensive and balanced understanding of the subject area.
Section: No Reference, Meta: No Info, Prompt: Focus on identifying studies that demonstrate potential relevance while excluding those that do not include any title unless you are absolutely certain it is relevant.
Section: No Reference, Meta: No Info, Prompt: Aim to ensure that each paper is judged based solely on its capacity to enhance general insights or perspectives.
Section: No Reference, Meta: No Info, Prompt: Ex