# Tutorial - Generating Evaluation Data for Retrieval Pipelines

This notebooks outlines one of many usecases for synthetic data. This notebook guides you through the process of generating questions with which you can evaluate an Retrieval Pipeline. Let's get started.

Before we start generating the data, let's align on the larger process - To evaluate a retriever, you need two things, questions & the ground truth. From a synthetic data perspective we are flipping the problem on its head. We can start with a few documents, create chunks and then create questions that can be answered using those chunks. In this manner, we already know the chunks for each question and can create the evaluation pairs. If you want to learn more about naunces of evaluating retrieval pipelines, please do check out [this article](https://developer.nvidia.com/blog/evaluating-retriever-for-enterprise-grade-rag/).

For the sake of simplicity of explaination, the document ingestion and chunking pipeline has been skipped for this notebook and we will work with just 1 chunk to understand the core process.

**The core challenge we want to solve with synthetic data for evaluation is that we may not have a dataset that represents our users** To that end, let's think through a simple 3 step pipeline with which we can leverage various pesonsas to guide our evaluation data

![Pipeline](imgs/1.PNG)

Before diving into the specifics, let's look at the chunk and a few sample personas

In [1]:
CHUNK = """The proposed acquisition of GreenTech Inc. by SolarPower Corporation stands as one of the most notable transactions\
in the renewable energy sector this year. Valued at $3 billion, the deal aims to combine GreenTech's cutting-edge battery\
technology with SolarPower's extensive solar panel manufacturing and distribution network. The anticipated operational\
synergies are expected to result in a 20% reduction in production costs and a 15% increase in revenue over the next two years.\
However, the transaction is under intense scrutiny from regulatory bodies due to potential antitrust concerns.\
The Federal Trade Commission (FTC) has indicated that the merger could potentially create a monopoly in the renewable energy\
storage market, potentially stifling competition and innov|ation.

SolarPower has committed to maintaining GreenTech's research and development (R&D) center, which employs over 500 scientists\
and engineers, as an independent entity to preserve its innovative culture. Additionally, all existing employment contracts will\
be honored, alleviating concerns about potential layoffs. The merger agreement includes a $150 million breakup fee, payable to\
GreenTech if SolarPower fails to secure the necessary regulatory approvals, thereby mitigating financial risks for GreenTech\
should the deal fall through.

The agreement includes detailed representations and warranties, specifying the accuracy of financial statements,\
the absence of undisclosed liabilities, and compliance with applicable laws. It also entails a thorough indemnification process\
to protect both parties against potential breaches of these representations and warranties. SolarPower and GreenTech have\
agreed to covenants that restrict GreenTech from incurring new debt, issuing additional shares, or significantly altering\
business operations without SolarPower's consent prior to the deal’s closure. These covenants are designed to preserve the\
value of GreenTech and ensure a smooth transition post-merger. The agreement further outlines a comprehensive due diligence\
process, including environmental assessments and audits of GreenTech’s intellectual property portfolio, to ensure all assets\
and liabilities are accurately accounted for before the finalization of the transaction.

The European Commission is also reviewing the merger to assess its impact on the EU market, particularly regarding competition\
and market dominance. This evaluation involves submitting detailed filings that include market analyses, competitive impact\
assessments, and economic justifications for the merger. The review process requires both companies to respond promptly to\
inquiries and provide comprehensive documentation. Additionally, to secure approval, SolarPower and GreenTech may need to make\
concessions, such as divesting certain business units or assets, to alleviate concerns about reduced competition.\
Ensuring compliance with the EU Merger Regulation involves not only addressing competitive effects but also ensuring that\
the merger aligns with broader EU policies on market fairness and consumer protection.
"""

FILE_NAME = "GreenTech_Acquistion.txt"

PERSONAS = [
    """
    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.
    """,
    """
    Padma is a seasoned corporate litigator with over 10 years of experience in handling complex legal cases for large corporations. She has a no-nonsense approach and is known for her sharp analytical mind and attention to detail.
    """,
    """
    Aaron is an underconfident journalism major and thus doesn't probe the underlying material too deeply. He is still new to the english language so doesn't have that much profficiency. He also has a bad habit of sensationalizing things. 
    """ 
]


The above examples are about a paragraph about a merger. Let's say we have three types of audiences and we went about creating a few persona descriptions for them.

**Note-** Most of the nuts and bolts of the prompting the LLM have been bucketed into a few util classes. Feel free to checkout `prompts.py`, `Generator.py` & `DeDup.py` if you are curious about the implementation details

## Generate Questions

In [2]:
from Generator import *
import json
from DeDup import *
import concurrent.futures

generator = Generator()
dedup = Dedup()

![Pipeline: Step 1](imgs/2.PNG)
    
The first step is to generate questions. This step has 4 parts -

### Generating Points of Interest

As we are trying to tailor the questions to what our users may ask, therefore, we should extract the points of interest of our selected personas in the passage.

In [3]:
points_of_interest = []

for persona in PERSONAS:
    points_of_interest.extend(generator.extract_points_of_interest(persona, FILE_NAME, CHUNK)['list_of_interest'])
    print("Persona Processed: " + persona)

Persona Processed: 
    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.
    
Persona Processed: 
    Padma is a seasoned corporate litigator with over 10 years of experience in handling complex legal cases for large corporations. She has a no-nonsense approach and is known for her sharp analytical mind and attention to detail.
    
Persona Processed: 
    Aaron is an underconfident journalism major and thus doesn't probe the underlying material too deeply. He is still new to the english language so doesn't have that much profficiency. He also has a bad habit of sensationalizing things. 
    


**Prompt for the LLM**
```
You are given a Persona and a Passage. Your task is to immitate the persona and create a list interesting topics from the given passage.

<Persona>
{persona}
</Persona>

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "list_of_interest": [<fill with 1-5 word desription>]

Use Reflective Thinking: Step back from the problem, take the time for introspection and self-reflection. Examine personal biases, assumptions, and mental models that may influence problem-solving, and being open to learning from past experiences to improve future approaches. Show your thinking before giving an answer.
```

In [4]:
print(points_of_interest)

['Renewable Energy Market Trends', 'Antitrust Regulatory Scrutiny', 'Merger Synergies Analysis', 'EU Market Competition', 'GreenTech R&D Preservation', 'Breakup Fee Mitigation', 'Due Diligence Process', 'Intellectual Property Assessment', 'Market Dominance Concerns', 'Concession Strategies', 'Antitrust Concerns in Mergers', 'Regulatory Scrutiny in Acquisitions', 'Operational Synergies in Mergers', 'Preserving Innovation Culture', 'Compliance with EU Regulations', 'Mitigating Financial Risks', 'Due Diligence in Mergers', 'Market Dominance Concerns', 'Concessions for Regulatory Approval', 'Biggest Renewable Energy Deal', "GreenTech's Cutting-Edge Batteries", "SolarPower's Huge Expansion", 'Monopoly Concerns Rise', 'EU Market Under Threat', "GreenTech's R&D Future", "SolarPower's $150M Breakup Fee", 'Regulatory Hurdles Ahead']


### DeDuplicate Points of Interest

Next, since personas may or may not have overlapping interests, we need to deduplicate the points of interest. This deduplication is done by generating embeddings of each point of interest and then clustering these embedding vectors. Only 1 question is picked per cluster. You can set the distance threshold in `DeDup.py`.

In [5]:
deduped_points_of_interest = dedup.execute(points_of_interest)



In [6]:
print("Elemintated {} duplicates".format(len(points_of_interest) - len(deduped_points_of_interest)))

Elemintated 1 duplicates


In [7]:
print(deduped_points_of_interest)

['Renewable Energy Market Trends', 'Antitrust Regulatory Scrutiny', 'Merger Synergies Analysis', 'EU Market Competition', 'GreenTech R&D Preservation', 'Breakup Fee Mitigation', 'Due Diligence Process', 'Intellectual Property Assessment', 'Market Dominance Concerns', 'Concession Strategies', 'Antitrust Concerns in Mergers', 'Regulatory Scrutiny in Acquisitions', 'Operational Synergies in Mergers', 'Preserving Innovation Culture', 'Compliance with EU Regulations', 'Mitigating Financial Risks', 'Due Diligence in Mergers', 'Concessions for Regulatory Approval', 'Biggest Renewable Energy Deal', "GreenTech's Cutting-Edge Batteries", "SolarPower's Huge Expansion", 'Monopoly Concerns Rise', 'EU Market Under Threat', "GreenTech's R&D Future", "SolarPower's $150M Breakup Fee", 'Regulatory Hurdles Ahead']


### Mapping Type of Question to Point of Interest

User's area of interest is not the only point driving diversity in the questions, it is also the type of questions that are asked. Depending on the topic and the underlying chunk, we need to map what type of question will can be asked for those specific areas of interest.

In [8]:
TYPES_OF_QUESTION = {
    "extractive": "Extractive, ie, the question can be answered from objective information present in the context.",
    "abstractive": "Abstractive, ie, futher reasoning needs to be done using the information in the passage to answer the question",
    "diagnostic": "Diagnostic, ie, the question is about constructing a diagnosis that can be infered from the context",
    "aggregative": "Aggregative, ie, some form of collectivization like making a group, or counting the number of items needs to be done using the information in context to answer the question.",
    "sentiment-driven": "Sentiment-Driven, ie, underlying sentiment about an event of a piece of information can be extracted."
}

In [9]:
def process_interest(interest):
    mapping = generator.extract_compatible_question_type(interest, list(TYPES_OF_QUESTION.values()), FILE_NAME, CHUNK)['list_of_extractable_types_of_questions']
    print("Interest Processed: "+ interest)
    return {"interest": interest, "q_type": [m.lower() for m in mapping]}

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    question_mapping = list(executor.map(process_interest, deduped_points_of_interest))

Interest Processed: GreenTech R&D Preservation
Interest Processed: Intellectual Property Assessment
Interest Processed: Breakup Fee Mitigation
Interest Processed: Renewable Energy Market Trends
Interest Processed: EU Market Competition
Interest Processed: Merger Synergies Analysis
Interest Processed: Market Dominance Concerns
Interest Processed: Antitrust Regulatory Scrutiny
Interest Processed: Concession Strategies
Interest Processed: Due Diligence Process
Interest Processed: Regulatory Scrutiny in Acquisitions
Interest Processed: Operational Synergies in Mergers
Interest Processed: Antitrust Concerns in Mergers
Interest Processed: Compliance with EU Regulations
Interest Processed: Biggest Renewable Energy Deal
Interest Processed: GreenTech's Cutting-Edge Batteries
Interest Processed: Mitigating Financial Risks
Interest Processed: Preserving Innovation Culture
Interest Processed: SolarPower's Huge Expansion
Interest Processed: Due Diligence in Mergers
Interest Processed: Monopoly Conc

**Prompt for the LLM**
```
You are a teacher are trying to identify which is the most types of question that will test your student's capabilities that can be asked about "{interest}" from the Passage below.
Note that the type of question should be grounded in with the information in the passage, and should have to rely on being pedantic, or general knowledge.

<Types of Questions>
{types}
</Types of Questions>

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "reasoning" : show your reasoning
- "list_of_extractable_types_of_questions": [<extractive or abstractive or diagnostic or sentiment or aggregative>]
```

In [10]:
question_mapping

[{'interest': 'Renewable Energy Market Trends',
  'q_type': ['extractive', 'abstractive', 'diagnostic', 'aggregative']},
 {'interest': 'Antitrust Regulatory Scrutiny',
  'q_type': ['extractive',
   'abstractive',
   'diagnostic',
   'aggregative',
   'sentiment-driven']},
 {'interest': 'Merger Synergies Analysis',
  'q_type': ['extractive', 'abstractive', 'diagnostic', 'aggregative']},
 {'interest': 'EU Market Competition',
  'q_type': ['abstractive',
   'extractive',
   'diagnostic',
   'aggregative',
   'sentiment-driven']},
 {'interest': 'GreenTech R&D Preservation',
  'q_type': ['extractive', 'abstractive', 'aggregative', 'sentiment-driven']},
 {'interest': 'Breakup Fee Mitigation',
  'q_type': ['abstractive',
   'extractive',
   'diagnostic',
   'aggregative',
   'sentiment-driven']},
 {'interest': 'Due Diligence Process',
  'q_type': ['extractive', 'abstractive', 'diagnostic', 'aggregative']},
 {'interest': 'Intellectual Property Assessment',
  'q_type': ['extractive', 'abstracti

### Generating Questions

Finally, now that we have the mapping of type of question, point of interest and the underlying information, let's generate the questions.

In [11]:
def process_question_type(item):
    interest_questions = []
    for q_type in item['q_type']:
        questions = generator.generate_questions(FILE_NAME, CHUNK, item['interest'], TYPES_OF_QUESTION[q_type.lower()])
        interest_questions.extend(questions)
    #print(interest_questions)
    print("Processed Interest: " + item['interest'])
    return interest_questions

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    raw_questions = [question for result in executor.map(process_question_type, question_mapping) for question in result]
    questions = [item for item in raw_questions if item != []]

Processed Interest: Concession Strategies
Processed Interest: Intellectual Property Assessment
Processed Interest: Due Diligence Process
Processed Interest: Merger Synergies Analysis
Processed Interest: GreenTech R&D Preservation
Processed Interest: Renewable Energy Market Trends
Processed Interest: Antitrust Concerns in Mergers
Processed Interest: Market Dominance Concerns
Processed Interest: Antitrust Regulatory Scrutiny
Processed Interest: Breakup Fee Mitigation
Processed Interest: EU Market Competition
Processed Interest: Regulatory Scrutiny in Acquisitions
Processed Interest: Operational Synergies in Mergers
Processed Interest: Preserving Innovation Culture
Processed Interest: Concessions for Regulatory Approval
Processed Interest: Biggest Renewable Energy Deal
Processed Interest: Compliance with EU Regulations
Processed Interest: Due Diligence in Mergers
Processed Interest: SolarPower's Huge Expansion
Processed Interest: GreenTech's Cutting-Edge Batteries
Processed Interest: Mono

**Prompt for the LLM**
```
You are interviewing an expert. Generate 3 meaningful questions about {interest} from the given Passage. The questions should be {types}

These are questions for a viva, not an written examination.

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "generated_questions": [questions]
```

In [12]:
questions

['What is the expected reduction in production costs and increase in revenue over the next two years as a result of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and how will this be achieved through operational synergies?',
 'What are the concerns raised by the Federal Trade Commission (FTC) regarding the potential impact of the merger on the renewable energy storage market, and what measures have SolarPower and GreenTech committed to in order to address these concerns?',
 "What is the purpose of the covenants agreed to by SolarPower and GreenTech, and how do these covenants restrict GreenTech's actions prior to the deal's closure in order to preserve the value of GreenTech and ensure a smooth transition post-merger?",
 "Considering the potential antitrust concerns raised by the FTC, how might the merger between SolarPower and GreenTech impact the overall competitiveness of the renewable energy storage market, and what concessions could the companies make to al

In [13]:
len(questions)

292

## Filter Questions

![Pipeline: Step 2](imgs/3.PNG)

Now that we have all the underlying questions, the next step is to filter from the generated questions.

### Semantic Dedup

First, we deduplicate the questions.

In [14]:
deduped_raw_questions = dedup.execute(questions)



In [15]:
print("Elemintated {} duplicates".format(len(raw_questions) - len(deduped_raw_questions)))

Elemintated 84 duplicates


In [16]:
deduped_raw_questions

['What is the value of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and what are the expected operational synergies from the deal in terms of cost reduction and revenue increase over the next two years?',
 "What specific concerns have been raised by the Federal Trade Commission (FTC) regarding the potential impact of the merger on the renewable energy storage market, and how might these concerns affect the deal's approval?",
 "What are some of the key covenants agreed upon by SolarPower and GreenTech to preserve the value of GreenTech and ensure a smooth transition post-merger, and how do these covenants restrict GreenTech's actions prior to the deal's closure?",
 'Considering the potential antitrust concerns raised by the FTC, what specific concessions could SolarPower and GreenTech offer to alleviate concerns about reduced competition in the renewable energy storage market, and how might these concessions impact the overall value of the merger?',
 "Given the 

### Relevance Filter

Second, we need to ensure that the generated questions are at least partially answerable by the chunk. We setup an LLM as a judge to analyze all the questions and filter per our given criteria.

In [17]:
relevance_filter = Relevance_Filter()

In [18]:
filter_flags = []

def process_question(question):
    flag = relevance_filter.execute(question, FILE_NAME, CHUNK)
    print("Processed Question: "+ question)
    return {"question": question, "flag": flag}

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(process_question, deduped_raw_questions))

filter_flags.extend(results)

Processed Question: Given the European Commission's review of the merger and the requirement for compliance with the EU Merger Regulation, how do you think the merger will affect the EU market, particularly with regards to competition and market dominance, and what steps might SolarPower and GreenTech need to take to ensure approval and compliance with EU policies on market fairness and consumer protection?
Processed Question: Considering the potential antitrust concerns raised by the FTC, what specific concessions could SolarPower and GreenTech offer to alleviate concerns about reduced competition in the renewable energy storage market, and how might these concessions impact the overall value of the merger?
Processed Question: What is the value of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and what are the expected operational synergies from the deal in terms of cost reduction and revenue increase over the next two years?
Processed Question: What are at leas

Processed Question: Considering the potential antitrust concerns raised by the FTC, how might the merger's operational synergies, such as the 20% reduction in production costs and 15% increase in revenue, be impacted if SolarPower is required to divest certain business units or assets to alleviate competition concerns, and what implications might this have for the overall value of the deal?
Processed Question: Considering the potential antitrust concerns and the possibility of creating a monopoly in the renewable energy storage market, do you believe that the benefits of the acquisition, such as the 20% reduction in production costs and the 15% increase in revenue, outweigh the risks of stifling competition and innovation?
Processed Question: Given the commitment to maintain GreenTech's R&D center as an independent entity, how can SolarPower ensure that the operational synergies resulting from the merger are effectively integrated and aligned with the innovative culture of GreenTech, a

Processed Question: In light of the European Commission's review of the merger, what concessions or divestitures do you think SolarPower and GreenTech may need to make to alleviate concerns about reduced competition, and how can they ensure that these concessions align with the broader EU policies on market fairness and consumer protection while still achieving the operational synergies and revenue growth anticipated from the merger?
Processed Question: How do the covenants restricting GreenTech from incurring new debt, issuing additional shares, or significantly altering business operations without SolarPower's consent, collectively preserve the value of GreenTech and ensure a smooth transition post-merger? Can you count the number of ways in which these covenants protect the interests of both parties and facilitate the integration of the two companies?
Processed Question: Don't you think that the proposed acquisition of GreenTech Inc. by SolarPower Corporation is likely to create a m

Processed Question: In light of the agreement's representations and warranties, as well as the indemnification process, how do you think the due diligence process can be designed to identify and assess potential risks and liabilities associated with GreenTech's intellectual property portfolio, and what steps can be taken to mitigate these risks and ensure that SolarPower is adequately protected in the event of a breach or unforeseen circumstance?
Processed Question: Given the complexities of the merger, including the need for regulatory approvals from both the FTC and the European Commission, what strategies would you employ to manage the due diligence process and ensure that all necessary steps are taken to mitigate risks and ensure a smooth transaction?
Processed Question: Considering the potential antitrust concerns raised by the FTC, what specific measures would you recommend as part of the due diligence process to assess the competitive impact of the merger and mitigate the risk o

Processed Question: Given the European Commission's review of the merger, what specific concessions or divestitures might SolarPower and GreenTech need to make to secure approval, and how might these concessions affect the companies' operations and market position in the EU?
Processed Question: What are the various measures or concessions that SolarPower and GreenTech may need to take or make to secure regulatory approvals, such as divesting business units or assets, and how will these measures impact the operational synergies and overall success of the merger?
Processed Question: What specific measures have SolarPower and GreenTech agreed to in order to mitigate financial risks and ensure a smooth transition post-merger, as outlined in the merger agreement?
Processed Question: How do you assess the effectiveness of the covenants and representations and warranties outlined in the merger agreement in mitigating the risks associated with the deal, particularly with regards to preserving 

Processed Question: What is the purpose of the comprehensive due diligence process outlined in the agreement, and what aspects of GreenTech’s assets and liabilities are being assessed as part of this process?
Processed Question: Considering the potential antitrust concerns and the regulatory scrutiny from both the FTC and the European Commission, what additional measures could SolarPower and GreenTech take to mitigate the financial risks associated with the merger not being approved, beyond the $150 million breakup fee and the indemnification process outlined in the agreement?
Processed Question: In light of the comprehensive due diligence process, including environmental assessments and audits of GreenTech’s intellectual property portfolio, what specific financial risks would SolarPower face if the due diligence process reveals previously undisclosed liabilities or inaccuracies in GreenTech's financial statements, and how might these risks be mitigated through the representations, war

Processed Question: Can you enumerate the potential concessions that SolarPower and GreenTech may need to make to secure regulatory approval from the European Commission, such as divesting business units or assets, and how might these concessions impact the overall value of the merger?
Processed Question: What role might the maintenance of GreenTech's R&D center as an independent entity play in addressing regulatory concerns about the merger's impact on innovation and competition, and how might this concession influence the European Commission's assessment of the merger's alignment with broader EU policies on market fairness and consumer protection?
Processed Question: What measures has SolarPower committed to in order to address concerns about potential layoffs and preserve GreenTech's innovative culture, including the maintenance of GreenTech's R&D center and the honoring of existing employment contracts?
Processed Question: What are the total number of representations, warranties, c

Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate concerns about reduced competition and secure approval from regulatory bodies such as the European Commission, and what is the purpose of the $150 million breakup fee included in the merger agreement?
Processed Question: How do you think the preservation of GreenTech's R&D center as an independent entity, as well as the honoring of existing employment contracts, will impact the innovation culture and competitiveness of the combined entity, and what role might these factors play in addressing regulatory concerns about the potential stifling of competition and innovation in the renewable energy sector?
Processed Question: How do the covenants and representations outlined in the merger agreement address potential risks and ensure a smooth transition post-merger, and what implications might these have for the future operations of GreenTech and SolarPower?
Processed Question: What specific measures

Processed Question: What specific measures have SolarPower and GreenTech agreed to in the merger agreement to mitigate financial risks and ensure a smooth transition post-merger, such as covenants and indemnification processes, and how do these measures address potential regulatory hurdles ahead?
Processed Question: Considering the potential antitrust concerns raised by the FTC, what specific concessions or divestitures could SolarPower and GreenTech offer to alleviate concerns about reduced competition and secure regulatory approval, while still achieving the desired operational synergies and revenue growth?
Processed Question: Given the European Commission's review process, how might the merger's impact on the EU market be assessed differently than its impact on the US market, and what implications might this have for the companies' global operations and competitiveness?
Processed Question: In the event that the merger is approved, what mechanisms or safeguards could be put in place 

**Prompt for the LLM**
```
You are a juror, and are tasked with giving a judgement if there is enough evidence in the passage to answer a given question.
- Make no assumptions or use your exisiting knowledge.
- The evidence should be in the passage. The existance of pointer to the evidence doesn't qualify as sufficently useful.

Question: {question}

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

<Judgements-Options>
- "Beyond a reasonable doubt" - There is enough evidence in the passage or the information in the passage can be used to completely answer the question beyond a reasonable doubt.
- "Somewhat relevant" - Only part of evidence required to completely answer, or to reason through get the answer is available in the passage. 
- "Not useful" - The passage doesn't contain enough information to answer the question.
</Judgement-Options>

Generate your answer in a json format with the fields below
- "Reasoning": 1-10 words of reasoning
- "Your_Decision": "fill with judgement option"
```

In [19]:
relevant_questions = []

for item in filter_flags:
    flag = item["flag"]
    if flag['Your_Decision'].lower() != "not useful":
        relevant_questions.append(item['question'])

In [20]:
print("Filtered {} questions".format(len(filter_flags) - len(relevant_questions)))

Filtered 72 questions


In [21]:
print(relevant_questions)

['What is the value of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and what are the expected operational synergies from the deal in terms of cost reduction and revenue increase over the next two years?', "What specific concerns have been raised by the Federal Trade Commission (FTC) regarding the potential impact of the merger on the renewable energy storage market, and how might these concerns affect the deal's approval?", "What are some of the key covenants agreed upon by SolarPower and GreenTech to preserve the value of GreenTech and ensure a smooth transition post-merger, and how do these covenants restrict GreenTech's actions prior to the deal's closure?", "Can you name at least three regulatory bodies that are involved in scrutinizing the acquisition of GreenTech Inc. by SolarPower Corporation, and what are their primary concerns regarding this transaction? Additionally, how many scientists and engineers are employed at GreenTech's R&D center that SolarPo

### Conversational Re-Write

Third, we make sure that the questions aren't overly specific of have a mechanical connotation by re-writing in more natural conversational tone.

In [22]:
re_written_questions = []

def process_question(question):
    return generator.conversational_re_write(question, FILE_NAME, CHUNK)['re_written_question']

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    re_written_questions = list(executor.map(process_question, relevant_questions))

**Prompt for the LLM**
```
Your task is to make minor edits to Old_Question if needed to make it sound Conversational.
- Remove phrases like "based on the given passage/information..." by making it a does or what or how or why question.
- Questions shouldn't have all the identifiers for extracting information, ie, humans are imprecise, assume context is already there.

Old_Question: {question}

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "re_written_question": <fill>
```

In [23]:
re_written_questions

["What's the deal with SolarPower buying GreenTech? How much is it worth and what kind of cost savings and revenue boost are they expecting over the next two years?",
 "What concerns has the FTC raised about the merger's impact on the renewable energy storage market, and how might this affect the deal's approval?",
 "What covenants did SolarPower and GreenTech agree to, to preserve GreenTech's value and ensure a smooth transition after the merger, and how do these covenants limit GreenTech's actions before the deal is closed?",
 "What regulatory bodies are scrutinizing the acquisition of GreenTech Inc. by SolarPower Corporation, and what are their concerns? How many scientists and engineers work at GreenTech's R&D center that SolarPower is committed to maintaining? What's the breakup fee if SolarPower fails to secure regulatory approvals?",
 'What types of assessments and audits are part of the due diligence process for the SolarPower and GreenTech merger, and how do they ensure all as

### Intelligence Filter

Lastly - we ensure that the questions getting asked are not pedantic, or basic.

In [24]:
intelligent_question_filter = Intelligent_Question_Filter()

In [25]:
questions = []

def filter_question(question):
    filter_flag = intelligent_question_filter.execute(question, FILE_NAME, CHUNK)
    if filter_flag['Type_of_question'] == "Type_A":
        print("Processed Question: "+ question)
        return question
    return None

with concurrent.futures.ThreadPoolExecutor() as executor:
    filtered_questions = list(executor.map(filter_question, re_written_questions))

questions = [q for q in filtered_questions if q is not None]

Processed Question: What's the European Commission's role in reviewing the SolarPower and GreenTech merger, and what concessions might be needed to address competition concerns and comply with EU regulations?
Processed Question: What concessions might SolarPower and GreenTech need to make to address EU concerns about reduced competition, and how could these concessions affect GreenTech's R&D priorities going forward?
Processed Question: What's worrying the FTC about SolarPower buying GreenTech, and how could this affect the renewable energy storage market?
Processed Question: What regulatory bodies are scrutinizing the acquisition of GreenTech Inc. by SolarPower Corporation, and what are their concerns? How many scientists and engineers work at GreenTech's R&D center that SolarPower is committed to maintaining? What's the breakup fee if SolarPower fails to secure regulatory approvals?
Processed Question: What's the deal with SolarPower buying GreenTech? How much is it worth and what ki

Processed Question: How many regulatory bodies are reviewing the GreenTech acquisition, and what specific assessments are they conducting as part of the due diligence process?
Processed Question: What concessions could SolarPower and GreenTech offer to alleviate antitrust concerns, and how might these concessions impact the merger's value?
Processed Question: What concessions might SolarPower and GreenTech need to make to get regulatory approvals, and how will these impact the merger's success?
Processed Question: How are SolarPower and GreenTech mitigating financial risks and ensuring a smooth transition after the merger?
Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate concerns about reduced competition and secure regulatory approval? Can you give some examples and how they'd impact the companies' operations and market position?
Processed Question: What are the potential risks associated with SolarPower Corporation's acquisition of GreenTe

Processed Question: How does SolarPower's commitment to keeping GreenTech's R&D center independent affect the financial risks for GreenTech if the acquisition fails, and what does this mean for regulatory approval?
Processed Question: What regulatory bodies are reviewing the SolarPower and GreenTech acquisition, and what are their main concerns? How might the companies address these concerns, and what concessions might they need to make? What are the potential implications for the companies involved?
Processed Question: How does the proposed acquisition of GreenTech by SolarPower minimize potential financial losses for both parties? What role do the $150 million breakup fee, covenants, and representations and warranties play in mitigating financial risks? Are there any other structural elements that are effective in minimizing financial risks? What are the potential implications for the companies involved, and are there other ways to structure the acquisition to further minimize financ

Processed Question: What risks does the $150 million breakup fee mitigate for GreenTech if the merger falls through, and how might these risks impact the company's financial stability?
Processed Question: What concessions might SolarPower and GreenTech need to make to get their merger approved, and how could these concessions affect the benefits of combining their technologies and operations?
Processed Question: How many regulatory bodies are reviewing the SolarPower and GreenTech merger, and what's the deal with the $150 million breakup fee? How do the representations, warranties, and covenants in the merger agreement help with regulatory compliance?
Processed Question: How is SolarPower addressing concerns about layoffs and preserving GreenTech's innovative culture after the acquisition?
Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate concerns about reduced competition, and what's the purpose of the $150 million breakup fee in their merge

**Prompt for the LLM**
```
You are in iritated teacher. Classify a student's question in the following types.
- Type_A: A question with which student extracts valuable insights, data points, or information.
- Type_B: A pedantic or a general knowledge question.
- Type_C: It would be hard to identify the subject of the conversation without the information in the passage. These types of questions are missing proper nouns.

Question: {question}

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer Format - Generate a json with the following fields
- "Type_of_question": <Fill with Type_A or Type_B or Type_C>
```

In [26]:
print("Filtered {} questions".format(len(questions) - len(re_written_questions)))

Filtered 0 questions


In [27]:
questions

["What's the deal with SolarPower buying GreenTech? How much is it worth and what kind of cost savings and revenue boost are they expecting over the next two years?",
 "What concerns has the FTC raised about the merger's impact on the renewable energy storage market, and how might this affect the deal's approval?",
 "What covenants did SolarPower and GreenTech agree to, to preserve GreenTech's value and ensure a smooth transition after the merger, and how do these covenants limit GreenTech's actions before the deal is closed?",
 "What regulatory bodies are scrutinizing the acquisition of GreenTech Inc. by SolarPower Corporation, and what are their concerns? How many scientists and engineers work at GreenTech's R&D center that SolarPower is committed to maintaining? What's the breakup fee if SolarPower fails to secure regulatory approvals?",
 'What types of assessments and audits are part of the due diligence process for the SolarPower and GreenTech merger, and how do they ensure all as

## Persona Re-Write

![Pipeline: Step 3](imgs/4.PNG)

The third and final step is re-writing the questions in the voice of your personas.

### Writing Style

First, lets extract the writing style from the persona descriptions

In [28]:
writing_styles = []

for persona in PERSONAS:
    style = json.loads(generator.writing_style(persona).strip())
    writing_styles.append({"persona": persona, "style": style['writing_style']})

**Prompt for the LLM**
```
Use the persona decription below to and articulate the Writing Style of the persona.

<Persona>
{persona}
</Persona>

Think step by step. Show your thinking.
Answer Format - Generate a json with the following fields
- "writing_style": <the writing style described in great detail in a paragraph>
```

In [29]:
writing_styles

[{'persona': '\n    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.\n    ',
  'style': "Joan's writing style is characterized by a tone of authority and expertise, reflecting their senior position as a financial analyst. Their language is formal and technical, often incorporating specialized terms from econometrics and finance. Given their reliance on a team for specific details, their writing may occasionally lack precision, with vague references or requests for further information. However, this is balanced by a broad and deep understanding of the general topic, allowing them to contextualize their analysis within the wider field. Their writing is likely to be dense with concepts and ideas, though the presen

### Re-Writing

Now we can use the writing style and the filtered questions to generate variants.

In [30]:
re_written_questions = []
        
def process_question(question):
    question_variants = []
    for style in writing_styles:
        re_write = generator.persona_rewrite(style['style'], question)
        question_variants.append({"new_question": re_write, "style": style, "original_question": question})
    print("Processed Question: "+ question)
    return question_variants

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    re_written_questions = [variant for result in executor.map(process_question, questions) for variant in result]

Processed Question: What concerns has the FTC raised about the merger's impact on the renewable energy storage market, and how might this affect the deal's approval?
Processed Question: How is SolarPower addressing concerns about the impact of the merger on GreenTech's R&D center and employees?
Processed Question: What's the deal with SolarPower buying GreenTech? How much is it worth and what kind of cost savings and revenue boost are they expecting over the next two years?
Processed Question: What concessions might SolarPower and GreenTech need to make to address EU concerns about reduced competition, and how could these concessions affect GreenTech's R&D priorities going forward?
Processed Question: What's worrying the FTC about SolarPower buying GreenTech, and how could this affect the renewable energy storage market?
Processed Question: How will the agreement's provisions, such as keeping GreenTech's R&D center independent and restricting its business operations, impact the compani

Processed Question: What are the potential risks and consequences if SolarPower fails to secure regulatory approvals, and how does the $150 million breakup fee help GreenTech in this scenario?
Processed Question: Won't the acquisition of GreenTech Inc. by SolarPower Corporation create a monopoly in the renewable energy storage market and stifle competition? How do you address concerns about market dominance and reduced competition in the industry?
Processed Question: How do the covenants restricting GreenTech's business operations before the deal closes help preserve its value and ensure a smooth transition, and what's the plan for enforcing them?
Processed Question: How might the restrictions on GreenTech's ability to incur new debt, issue shares, or change business operations impact its response to market changes or unexpected financial needs before the deal closes, and what financial risks could arise from these limitations?
Processed Question: What's the breakup fee SolarPower agre

Processed Question: What concessions will SolarPower and GreenTech likely need to make to address EU concerns about competition, and how will that impact their operations and revenue?
Processed Question: Don't you think the GreenTech acquisition will face major regulatory hurdles in the EU? How can the companies address these concerns to comply with EU regulations?
Processed Question: What safeguards are in place to protect GreenTech financially if the merger with SolarPower falls through due to regulatory issues?
Processed Question: How might the European Commission view SolarPower's commitment to maintaining GreenTech's R&D center as an independent entity? Could this be a positive factor in the merger review process, and why?
Processed Question: Who are the key stakeholders involved in the EU review process for the merger between SolarPower and GreenTech, and what are their main interests and concerns? How do these interests intersect and impact compliance with EU regulations, and wh

Processed Question: Does SolarPower's acquisition of GreenTech Inc. risk stifling competition and innovation in the renewable energy storage market, despite promised benefits to both companies?
Processed Question: How will regulatory scrutiny from the FTC and European Commission impact the merger's success, given the potential antitrust concerns and need for concessions to alleviate reduced competition?
Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate concerns about reduced competition, and what's the purpose of the $150 million breakup fee in their merger agreement?
Processed Question: What are the potential risks of the GreenTech merger to the EU market, and how will the European Commission address them in its review process?
Processed Question: How is SolarPower planning to preserve GreenTech's innovative culture and protect its employees, and what financial safeguards are in place if the merger fails to get regulatory approval?
Processed

**Prompt for the LLM**
```
Your task is to re-write the question like in the style of the persona below. 
Use the Writing Style from the persona. It is okay to make non-sensical questions if the persona requires it.

<Style>
{persona}
</Style>

<Constraints>
- The reformated question shouldn't leak any information about the persona.
- The question should have enough identifiers to be understood in a vacuum. Don't replace too many proper nouns with pronouns.
</Constraints>

Old Question: {question}

Answer format should be a json with the following fields:
- "new_question": contains the new question. 
```

In [31]:
re_written_questions

[{'new_question': '{\n"new_question": "Can you provide an assessment of the strategic rationale underlying the acquisition of GreenTech by SolarPower, including an evaluation of the expected synergies and projected impact on the combined entity\'s financial performance over the next 24 months, specifically with regards to cost savings and revenue growth?" \n}',
  'style': {'persona': '\n    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.\n    ',
   'style': "Joan's writing style is characterized by a tone of authority and expertise, reflecting their senior position as a financial analyst. Their language is formal and technical, often incorporating specialized terms from econometrics and finance. Given their re

## Conclusion

**Takeaway-** The pipeline above is an illustration of how one could go about using a LLM powerful LLM and an embedding model to generate synthetic data for evaluating retriever pipeline. It is not *the* pipeline that will solve all your synthetic data needs, but rather is a good starting point to stimulate thinking around synthesizing data to address your data needs.

**Points of Improvement**
* Basic building blocks like LLM as a judge can be used to create more filters. It is recommended to design more filters that are in line with your data. Additionally Reward Models can also be used as judges to add more types of filters.
* Frameworks like [Self-Discover](https://arxiv.org/pdf/2402.03620) can be used to improve the reasoning/critic capabilities for the LLM to adapt to the complextity of the underlying data
* More experimentations with personas can be done to further extract behavioural information for further targeting specific types of questions.
* This pipeline can be scaled up to generate triplets for finetuning your embedding model. This scaled up pipeline can be encorporated into an Ops pipeline to do automatic finetuning to continously improve you RAG pipelines.