# Tutorial - Generating Evaluation Data for Retrieval Pipelines

This notebooks outlines one of many usecases for synthetic data. This notebook guides you through the process of generating questions with which you can evaluate an Retrieval Pipeline. Let's get started.

Before we start generating the data, let's align on the larger process - To evaluate a retriever, you need two things, questions & the ground truth. From a synthetic data perspective we are flipping the problem on its head. We can start with a few documents, create chunks and then create questions that can be answered using those chunks. In this manner, we already know the chunks for each question and can create the evaluation pairs. If you want to learn more about naunces of evaluating retrieval pipelines, please do check out [this article](https://developer.nvidia.com/blog/evaluating-retriever-for-enterprise-grade-rag/).

For the sake of simplicity of explaination, the document ingestion and chunking pipeline has been skipped for this notebook and we will work with just 1 chunk to understand the core process.

**The core challenge we want to solve with synthetic data for evaluation is that we may not have a dataset that represents our users** To that end, let's think through a simple 3 step pipeline with which we can leverage various pesonsas to guide our evaluation data

![Pipeline](imgs/1.PNG)

Before diving into the specifics, let's look at the chunk and a few sample personas

In [1]:
CHUNK = """The proposed acquisition of GreenTech Inc. by SolarPower Corporation stands as one of the most notable transactions\
in the renewable energy sector this year. Valued at $3 billion, the deal aims to combine GreenTech's cutting-edge battery\
technology with SolarPower's extensive solar panel manufacturing and distribution network. The anticipated operational\
synergies are expected to result in a 20% reduction in production costs and a 15% increase in revenue over the next two years.\
However, the transaction is under intense scrutiny from regulatory bodies due to potential antitrust concerns.\
The Federal Trade Commission (FTC) has indicated that the merger could potentially create a monopoly in the renewable energy\
storage market, potentially stifling competition and innov|ation.

SolarPower has committed to maintaining GreenTech's research and development (R&D) center, which employs over 500 scientists\
and engineers, as an independent entity to preserve its innovative culture. Additionally, all existing employment contracts will\
be honored, alleviating concerns about potential layoffs. The merger agreement includes a $150 million breakup fee, payable to\
GreenTech if SolarPower fails to secure the necessary regulatory approvals, thereby mitigating financial risks for GreenTech\
should the deal fall through.

The agreement includes detailed representations and warranties, specifying the accuracy of financial statements,\
the absence of undisclosed liabilities, and compliance with applicable laws. It also entails a thorough indemnification process\
to protect both parties against potential breaches of these representations and warranties. SolarPower and GreenTech have\
agreed to covenants that restrict GreenTech from incurring new debt, issuing additional shares, or significantly altering\
business operations without SolarPower's consent prior to the deal’s closure. These covenants are designed to preserve the\
value of GreenTech and ensure a smooth transition post-merger. The agreement further outlines a comprehensive due diligence\
process, including environmental assessments and audits of GreenTech’s intellectual property portfolio, to ensure all assets\
and liabilities are accurately accounted for before the finalization of the transaction.

The European Commission is also reviewing the merger to assess its impact on the EU market, particularly regarding competition\
and market dominance. This evaluation involves submitting detailed filings that include market analyses, competitive impact\
assessments, and economic justifications for the merger. The review process requires both companies to respond promptly to\
inquiries and provide comprehensive documentation. Additionally, to secure approval, SolarPower and GreenTech may need to make\
concessions, such as divesting certain business units or assets, to alleviate concerns about reduced competition.\
Ensuring compliance with the EU Merger Regulation involves not only addressing competitive effects but also ensuring that\
the merger aligns with broader EU policies on market fairness and consumer protection.
"""

FILE_NAME = "GreenTech_Acquistion.txt"

PERSONAS = [
    """
    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.
    """,
    """
    Padma is a seasoned corporate litigator with over 10 years of experience in handling complex legal cases for large corporations. She has a no-nonsense approach and is known for her sharp analytical mind and attention to detail.
    """,
    """
    Aaron is an underconfident journalism major and thus doesn't probe the underlying material too deeply. He is still new to the english language so doesn't have that much profficiency. He also has a bad habit of sensationalizing things. 
    """ 
]


The above examples are about a paragraph about a merger. Let's say we have three types of audiences and we went about creating a few persona descriptions for them.

**Note-** Most of the nuts and bolts of the prompting the LLM have been bucketed into a few util classes. Feel free to checkout `prompts.py`, `Generator.py` & `DeDup.py` if you are curious about the implementation details

## Generate Questions

In [2]:
from Generator import *
import json
from DeDup import *
import concurrent.futures

generator = Generator()
dedup = Dedup()

![Pipeline: Step 1](imgs/2.PNG)
    
The first step is to generate questions. This step has 4 parts -

### Generating Points of Interest

As we are trying to tailor the questions to what our users may ask, therefore, we should extract the points of interest of our selected personas in the passage.

In [3]:
points_of_interest = []

for persona in PERSONAS:
    points_of_interest.extend(generator.extract_points_of_interest(persona, FILE_NAME, CHUNK)['list_of_interest'])
    print("Persona Processed: " + persona)

Persona Processed: 
    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.
    
Persona Processed: 
    Padma is a seasoned corporate litigator with over 10 years of experience in handling complex legal cases for large corporations. She has a no-nonsense approach and is known for her sharp analytical mind and attention to detail.
    
Persona Processed: 
    Aaron is an underconfident journalism major and thus doesn't probe the underlying material too deeply. He is still new to the english language so doesn't have that much profficiency. He also has a bad habit of sensationalizing things. 
    


**Prompt for the LLM**
```
You are given a Persona and a Passage. Your task is to immitate the persona and create a list interesting topics from the given passage.

<Persona>
{persona}
</Persona>

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "list_of_interest": [<fill with 1-5 word desription>]

Use Reflective Thinking: Step back from the problem, take the time for introspection and self-reflection. Examine personal biases, assumptions, and mental models that may influence problem-solving, and being open to learning from past experiences to improve future approaches. Show your thinking before giving an answer.
```

In [4]:
print(points_of_interest)

['Renewable Energy Market Trends', 'Antitrust Regulatory Scrutiny', 'Merger Synergies Analysis', 'EU Merger Regulation Compliance', 'GreenTech R&D Preservation', 'Breakup Fee Mitigation Strategy', 'Due Diligence Process', 'Market Dominance Assessment', 'Antitrust Regulatory Scrutiny', 'Merger Agreement Clauses', 'Competition Law Compliance', 'Intellectual Property Due Diligence', 'EU Merger Regulation Approval', 'Market Dominance Concerns', 'Operational Synergies Analysis', 'Breakup Fee Negotiation', 'Biggest Renewable Energy Deal', "GreenTech's Battery Technology", "SolarPower's Dominance Concerns", 'Regulatory Hurdles Ahead', 'EU Market Impact Review', "Merger's Job Security Promise", "GreenTech's R&D Future", 'Breakup Fee Safety Net']


### DeDuplicate Points of Interest

Next, since personas may or may not have overlapping interests, we need to deduplicate the points of interest. This deduplication is done by generating embeddings of each point of interest and then clustering these embedding vectors. Only 1 question is picked per cluster. You can set the distance threshold in `DeDup.py`.

In [5]:
deduped_points_of_interest = dedup.execute(points_of_interest)

  out = hierarchy.linkage(X, method=linkage, metric=affinity)


In [6]:
print("Elemintated {} duplicates".format(len(points_of_interest) - len(deduped_points_of_interest)))

Elemintated 1 duplicates


In [7]:
print(deduped_points_of_interest)

['Renewable Energy Market Trends', 'Antitrust Regulatory Scrutiny', 'Merger Synergies Analysis', 'EU Merger Regulation Compliance', 'GreenTech R&D Preservation', 'Breakup Fee Mitigation Strategy', 'Due Diligence Process', 'Market Dominance Assessment', 'Merger Agreement Clauses', 'Competition Law Compliance', 'Intellectual Property Due Diligence', 'EU Merger Regulation Approval', 'Market Dominance Concerns', 'Operational Synergies Analysis', 'Breakup Fee Negotiation', 'Biggest Renewable Energy Deal', "GreenTech's Battery Technology", "SolarPower's Dominance Concerns", 'Regulatory Hurdles Ahead', 'EU Market Impact Review', "Merger's Job Security Promise", "GreenTech's R&D Future", 'Breakup Fee Safety Net']


### Mapping Type of Question to Point of Interest

User's area of interest is not the only point driving diversity in the questions, it is also the type of questions that are asked. Depending on the topic and the underlying chunk, we need to map what type of question will can be asked for those specific areas of interest.

In [8]:
TYPES_OF_QUESTION = {
    "extractive": "Extractive, ie, the question can be answered from objective information present in the context.",
    "abstractive": "Abstractive, ie, futher reasoning needs to be done using the information in the passage to answer the question",
    "diagnostic": "Diagnostic, ie, the question is about constructing a diagnosis that can be infered from the context",
    "aggregative": "Aggregative, ie, some form of collectivization like making a group, or counting the number of items needs to be done using the information in context to answer the question.",
    "sentiment-driven": "Sentiment-Driven, ie, underlying sentiment about an event of a piece of information can be extracted."
}

In [9]:
def process_interest(interest):
    try:
        mapping = generator.extract_compatible_question_type(interest, list(TYPES_OF_QUESTION.values()), FILE_NAME, CHUNK)['list_of_extractable_types_of_questions']
        print("Interest Processed: "+ interest)
        return {"interest": interest, "q_type": [m.lower() for m in mapping]}
    except:
        print("Generation Error")
        return None

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    question_mapping = list(executor.map(process_interest, deduped_points_of_interest))

Interest Processed: Market Dominance Assessment
Interest Processed: Merger Agreement Clauses
Interest Processed: GreenTech R&D Preservation
Interest Processed: Due Diligence Process
Interest Processed: Breakup Fee Mitigation Strategy
Interest Processed: Renewable Energy Market Trends
Interest Processed: Antitrust Regulatory Scrutiny
Interest Processed: Merger Synergies Analysis
Interest Processed: EU Merger Regulation Compliance
Interest Processed: Competition Law Compliance
Interest Processed: Operational Synergies Analysis
Interest Processed: Intellectual Property Due Diligence
Interest Processed: EU Merger Regulation Approval
Interest Processed: Market Dominance Concerns
Interest Processed: Breakup Fee Negotiation
Interest Processed: GreenTech's Battery Technology
Interest Processed: SolarPower's Dominance Concerns
Interest Processed: Regulatory Hurdles Ahead
Interest Processed: Biggest Renewable Energy Deal
Interest Processed: Merger's Job Security Promise
Interest Processed: EU Ma

**Prompt for the LLM**
```
You are a teacher are trying to identify which is the most types of question that will test your student's capabilities that can be asked about "{interest}" from the Passage below.
Note that the type of question should be grounded in with the information in the passage, and should have to rely on being pedantic, or general knowledge.

<Types of Questions>
{types}
</Types of Questions>

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "reasoning" : show your reasoning
- "list_of_extractable_types_of_questions": [<extractive or abstractive or diagnostic or sentiment or aggregative>]
```

In [10]:
question_mapping = [qm for qm in question_mapping if qm is not None]
question_mapping

[{'interest': 'Renewable Energy Market Trends',
  'q_type': ['extractive',
   'abstractive',
   'diagnostic',
   'aggregative',
   'sentiment-driven']},
 {'interest': 'Antitrust Regulatory Scrutiny',
  'q_type': ['extractive',
   'abstractive',
   'diagnostic',
   'sentiment-driven',
   'aggregative']},
 {'interest': 'Merger Synergies Analysis',
  'q_type': ['extractive', 'abstractive', 'diagnostic', 'aggregative']},
 {'interest': 'EU Merger Regulation Compliance',
  'q_type': ['extractive', 'abstractive', 'diagnostic', 'aggregative']},
 {'interest': 'GreenTech R&D Preservation',
  'q_type': ['extractive', 'abstractive', 'aggregative', 'sentiment-driven']},
 {'interest': 'Breakup Fee Mitigation Strategy',
  'q_type': ['extractive', 'abstractive', 'aggregative', 'sentiment-driven']},
 {'interest': 'Due Diligence Process',
  'q_type': ['extractive', 'abstractive', 'diagnostic', 'aggregative']},
 {'interest': 'Market Dominance Assessment',
  'q_type': ['abstractive', 'diagnostic']},
 {'in

### Generating Questions

Finally, now that we have the mapping of type of question, point of interest and the underlying information, let's generate the questions.

In [11]:
def process_question_type(item):
    interest_questions = []
    for q_type in item['q_type']:
        try:
            questions = generator.generate_questions(FILE_NAME, CHUNK, item['interest'], TYPES_OF_QUESTION[q_type.lower()])
            interest_questions.extend(questions)
        except:
            print("Generation Error")
    print("Processed Interest: " + item['interest'])
    return interest_questions

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    raw_questions = [question for result in executor.map(process_question_type, question_mapping) for question in result]
    questions = [item for item in raw_questions if item != []]

Processed Interest: Market Dominance Assessment
Processed Interest: Merger Agreement Clauses
Processed Interest: Competition Law Compliance
Processed Interest: GreenTech R&D Preservation
Processed Interest: EU Merger Regulation Compliance
Processed Interest: Due Diligence Process
Processed Interest: Breakup Fee Mitigation Strategy
Processed Interest: Intellectual Property Due Diligence
Processed Interest: Renewable Energy Market Trends
Processed Interest: EU Merger Regulation Approval
Processed Interest: Market Dominance Concerns
Processed Interest: Antitrust Regulatory Scrutiny
Processed Interest: Operational Synergies Analysis
Processed Interest: SolarPower's Dominance Concerns
Processed Interest: GreenTech's Battery Technology
Processed Interest: Merger Synergies Analysis
Processed Interest: Breakup Fee Negotiation
Processed Interest: EU Market Impact Review
Processed Interest: Biggest Renewable Energy Deal
Processed Interest: GreenTech's R&D Future
Processed Interest: Merger's Job 

**Prompt for the LLM**
```
You are interviewing an expert. Generate 3 meaningful questions about {interest} from the given Passage. The questions should be {types}

These are questions for a viva, not an written examination.

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "generated_questions": [questions]
```

In [12]:
questions

['What is the expected reduction in production costs and increase in revenue over the next two years as a result of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and how will this be achieved through operational synergies?',
 'What are the concerns raised by the Federal Trade Commission (FTC) regarding the potential impact of the merger on the renewable energy storage market, and what measures may be required to alleviate these concerns?',
 'What specific covenants have SolarPower and GreenTech agreed to in order to preserve the value of GreenTech and ensure a smooth transition post-merger, and what is the purpose of these covenants in the context of the acquisition agreement?',
 'Considering the potential antitrust concerns raised by the FTC, how might the merger between SolarPower and GreenTech impact the overall competitiveness of the renewable energy storage market, and what concessions could the companies make to alleviate these concerns while still achievi

In [13]:
len(questions)

273

## Filter Questions

![Pipeline: Step 2](imgs/3.PNG)

Now that we have all the underlying questions, the next step is to filter from the generated questions.

### Semantic Dedup

First, we deduplicate the questions.

In [14]:
deduped_raw_questions = dedup.execute(questions)

  out = hierarchy.linkage(X, method=linkage, metric=affinity)


In [15]:
print("Elemintated {} duplicates".format(len(raw_questions) - len(deduped_raw_questions)))

Elemintated 6 duplicates


In [16]:
deduped_raw_questions

['What is the expected reduction in production costs and increase in revenue over the next two years as a result of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and how will this be achieved through operational synergies?',
 'What are the concerns raised by the Federal Trade Commission (FTC) regarding the potential impact of the merger on the renewable energy storage market, and what measures may be required to alleviate these concerns?',
 'What specific covenants have SolarPower and GreenTech agreed to in order to preserve the value of GreenTech and ensure a smooth transition post-merger, and what is the purpose of these covenants in the context of the acquisition agreement?',
 'Considering the potential antitrust concerns raised by the FTC, how might the merger between SolarPower and GreenTech impact the overall competitiveness of the renewable energy storage market, and what concessions could the companies make to alleviate these concerns while still achievi

### Relevance Filter

Second, we need to ensure that the generated questions are at least partially answerable by the chunk. We setup an LLM as a judge to analyze all the questions and filter per our given criteria.

In [17]:
relevance_filter = Relevance_Filter()

In [18]:
filter_flags = []

def process_question(question):
    try:
        flag = relevance_filter.execute(question, FILE_NAME, CHUNK)
        print("Processed Question: "+ question)
        return {"question": question, "flag": flag}
    except:
        return None

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(process_question, deduped_raw_questions))

filter_flags.extend(results)

Processed Question: Given the emphasis on preserving GreenTech's innovative culture and R&D capabilities, how might the combined entity prioritize and allocate resources for research and development in the renewable energy sector, and what potential implications could this have for the broader market trends and technological advancements in this space?
Processed Question: What specific covenants have SolarPower and GreenTech agreed to in order to preserve the value of GreenTech and ensure a smooth transition post-merger, and what is the purpose of these covenants in the context of the acquisition agreement?
Processed Question: Can you name at least three regulatory bodies that are involved in scrutinizing the acquisition of GreenTech Inc. by SolarPower Corporation, and what are their primary concerns regarding this transaction? Additionally, how do these concerns collectively impact the renewable energy market trends in terms of competition and innovation?
Processed Question: What is t

Processed Question: Can you enumerate the potential concessions that SolarPower and GreenTech may need to make to alleviate antitrust concerns, such as divesting business units or assets, and how these concessions could impact the overall value of the merger? Can you also group these concessions into categories, such as structural versus behavioral remedies, and discuss their implications for the companies involved?
Processed Question: In light of the European Commission's review of the merger, don't you agree that SolarPower and GreenTech may need to make significant concessions, such as divesting certain business units or assets, in order to alleviate concerns about reduced competition and ensure compliance with the EU Merger Regulation, which could ultimately impact the viability of the deal?
Processed Question: How many regulatory bodies are involved in reviewing the proposed acquisition, and what are the key areas of focus for each of these bodies? Can you aggregate the concerns r

Processed Question: Given the agreement's provisions, such as the $150 million breakup fee and the commitment to maintaining GreenTech's R&D center as an independent entity, how do these terms mitigate financial risks and alleviate concerns about potential layoffs, and what implications do these have for EU Merger Regulation compliance?
Processed Question: What is the primary concern of the European Commission in reviewing the merger between SolarPower Corporation and GreenTech Inc., as mentioned in the passage?
Processed Question: In light of the EU Merger Regulation's emphasis on market fairness and consumer protection, what concessions, such as divesting certain business units or assets, might SolarPower and GreenTech need to make to alleviate concerns about reduced competition, and how might these concessions affect the overall viability of the merger?
Processed Question: Can you outline the three primary areas of focus for the European Commission's review of the GreenTech acquisit

Processed Question: Given the comprehensive due diligence process and the covenants restricting GreenTech's business operations prior to the deal's closure, don't you think that GreenTech's ability to operate independently is already being compromised, despite SolarPower's commitment to maintaining its R&D center as an independent entity?
Processed Question: What is the purpose of the $150 million breakup fee in the merger agreement between SolarPower and GreenTech, and under what circumstances would it be payable to GreenTech?
Processed Question: What specific covenants have SolarPower and GreenTech agreed to, in order to preserve the value of GreenTech and ensure a smooth transition post-merger?
Processed Question: What is the scope of the due diligence process outlined in the agreement, and what specific assessments and audits are included to ensure all assets and liabilities are accurately accounted for before the finalization of the transaction?
Processed Question: Considering the

Processed Question: What are the potential antitrust concerns raised by the Federal Trade Commission (FTC) regarding the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and how might these concerns impact the transaction's approval?
Processed Question: Considering the potential antitrust concerns raised by the FTC, how do you think the merger between SolarPower and GreenTech could be structured to address the issue of creating a monopoly in the renewable energy storage market, while still achieving the desired operational synergies and cost reductions?
Processed Question: What role does the European Commission play in reviewing the merger, and what types of concessions might SolarPower and GreenTech need to make in order to secure approval under the EU Merger Regulation and address concerns about reduced competition in the EU market?
Processed Question: What measures have SolarPower and GreenTech agreed to implement in order to alleviate concerns about potential layof

Processed Question: What specific measures has SolarPower committed to in order to address concerns about the potential stifling of innovation and competition following the merger, particularly regarding GreenTech's research and development center?
Processed Question: Considering the European Commission's review of the merger, what concessions might SolarPower and GreenTech need to make to alleviate concerns about reduced competition, and how might these concessions affect the overall value and structure of the deal? Are there any specific business units or assets that might be divested, and what would be the strategic implications of such divestitures?
Processed Question: Given the agreement's provisions for maintaining GreenTech's R&D center as an independent entity and honoring existing employment contracts, how might the merger impact the innovative culture and talent retention within GreenTech? Are there any potential risks or challenges associated with integrating GreenTech's R&D

Processed Question: Don't you think that the $150 million breakup fee is too low, considering the potential risks and the size of the deal, and how do you think this fee was negotiated between SolarPower and GreenTech? Was there a power imbalance in the negotiation process that led to this amount being agreed upon?
Processed Question: Given the comprehensive due diligence process and the European Commission's review, what are the potential deal-breakers or areas of contention that could lead to the invocation of the breakup fee, and how might these factors influence the negotiation of the fee's amount or the merger agreement's overall structure?
Processed Question: How do you perceive the sentiment of the regulatory bodies, such as the FTC and the European Commission, towards the merger, and do you think that SolarPower and GreenTech have done enough to address their concerns and avoid the payment of the breakup fee? Are there any concessions that you think the companies could make to 

Processed Question: Given the concerns raised by regulatory bodies about the potential creation of a monopoly in the renewable energy storage market, how does GreenTech's battery technology currently position itself in the market, and what are the potential implications of this merger on the competitive landscape?
Processed Question: Can you enumerate the potential risks associated with the acquisition of GreenTech Inc. by SolarPower Corporation, and how many of these risks are directly related to regulatory approvals and competition concerns? Can you group these risks into categories and provide a count for each category?
Processed Question: What specific measures has SolarPower committed to in order to address concerns about the potential stifling of innovation and competition, particularly regarding GreenTech's research and development center?
Processed Question: What are the total number of employees that will be preserved as a result of the merger agreement between SolarPower Corp

Processed Question: Considering the European Commission's review of the merger, do you believe that SolarPower and GreenTech will be able to address the concerns about reduced competition and market dominance, potentially by making concessions such as divesting certain business units or assets, without compromising the overall value of the deal?
Processed Question: Don't you think that the proposed acquisition of GreenTech Inc. by SolarPower Corporation is facing significant regulatory hurdles, and the intense scrutiny from regulatory bodies could potentially jeopardize the entire deal, given the concerns about antitrust and market dominance? How do you perceive the risks involved in this transaction?
Processed Question: In light of the regulatory hurdles ahead, don't you think that the $150 million breakup fee payable to GreenTech if SolarPower fails to secure the necessary regulatory approvals is a significant risk for SolarPower, and could this lead to a re-evaluation of the merger 

Processed Question: What are your thoughts on the job security promise made by SolarPower to GreenTech's employees, considering the company's history of honoring employment contracts in previous mergers and acquisitions? Do you think this promise will alleviate concerns about potential layoffs and contribute to a smoother transition post-merger?
Processed Question: Can you count and list the various measures that SolarPower has committed to undertake to alleviate concerns about job security, such as maintaining GreenTech's R&D center as an independent entity, honoring existing employment contracts, and restricting GreenTech from incurring new debt or significantly altering business operations without SolarPower's consent, and explain how these measures will collectively impact the employees of GreenTech, including the 500 scientists and engineers employed at the R&D center, in terms of providing them job security and a smooth transition post-merger as mentioned in the passage (e.g. $15

Processed Question: How do you perceive the sentiment of GreenTech's stakeholders, including its 500 scientists and engineers, regarding the potential acquisition by SolarPower, and do you believe the commitment to maintaining the R&D center as an independent entity has alleviated their concerns about job security and the company's future direction?
Processed Question: Considering the intense scrutiny from regulatory bodies and potential antitrust concerns, do you think the breakup fee safety net will be sufficient to mitigate the financial risks for GreenTech, or could the company still face significant financial and reputational damage if the deal falls through due to regulatory hurdles?


**Prompt for the LLM**
```
You are a juror, and are tasked with giving a judgement if there is enough evidence in the passage to answer a given question.
- Make no assumptions or use your exisiting knowledge.
- The evidence should be in the passage. The existance of pointer to the evidence doesn't qualify as sufficently useful.

Question: {question}

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

<Judgements-Options>
- "Beyond a reasonable doubt" - There is enough evidence in the passage or the information in the passage can be used to completely answer the question beyond a reasonable doubt.
- "Somewhat relevant" - Only part of evidence required to completely answer, or to reason through get the answer is available in the passage. 
- "Not useful" - The passage doesn't contain enough information to answer the question.
</Judgement-Options>

Generate your answer in a json format with the fields below
- "Reasoning": 1-10 words of reasoning
- "Your_Decision": "fill with judgement option"
```

In [19]:
filter_flags = [ff for ff in filter_flags if ff is not None]

In [20]:
relevant_questions = []

for item in filter_flags:
    flag = item["flag"]
    if flag['Your_Decision'].lower() != "not useful":
        relevant_questions.append(item['question'])

In [21]:
print("Filtered {} questions".format(len(filter_flags) - len(relevant_questions)))

Filtered 69 questions


In [22]:
print(relevant_questions)

['What is the expected reduction in production costs and increase in revenue over the next two years as a result of the proposed acquisition of GreenTech Inc. by SolarPower Corporation, and how will this be achieved through operational synergies?', 'What are the concerns raised by the Federal Trade Commission (FTC) regarding the potential impact of the merger on the renewable energy storage market, and what measures may be required to alleviate these concerns?', 'What specific covenants have SolarPower and GreenTech agreed to in order to preserve the value of GreenTech and ensure a smooth transition post-merger, and what is the purpose of these covenants in the context of the acquisition agreement?', 'Considering the potential antitrust concerns raised by the FTC, how might the merger between SolarPower and GreenTech impact the overall competitiveness of the renewable energy storage market, and what concessions could the companies make to alleviate these concerns while still achieving 

### Conversational Re-Write

Third, we make sure that the questions aren't overly specific of have a mechanical connotation by re-writing in more natural conversational tone.

In [23]:
re_written_questions = []

def process_question(question):
    try:
        return generator.conversational_re_write(question, FILE_NAME, CHUNK)['re_written_question']
    except:
        return None

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    re_written_questions = list(executor.map(process_question, relevant_questions))

In [24]:
re_written_questions = [rq for rq in re_written_questions if rq is not None]

**Prompt for the LLM**
```
Your task is to make minor edits to Old_Question if needed to make it sound Conversational.
- Remove phrases like "based on the given passage/information..." by making it a does or what or how or why question.
- Questions shouldn't have all the identifiers for extracting information, ie, humans are imprecise, assume context is already there.

Old_Question: {question}

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer format - Generate a json with the following fields
- "re_written_question": <fill>
```

In [25]:
re_written_questions

['How will the acquisition of GreenTech Inc. by SolarPower Corporation reduce production costs and increase revenue over the next two years through operational synergies?',
 "What concerns does the FTC have about the merger's impact on the renewable energy storage market, and how might they be addressed?",
 "What covenants have SolarPower and GreenTech agreed to in order to preserve GreenTech's value and ensure a smooth transition after the merger, and what's their purpose?",
 'How might the SolarPower and GreenTech merger impact the competitiveness of the renewable energy storage market, and what concessions could the companies make to alleviate antitrust concerns while still achieving operational synergies and cost reductions?',
 "What are the potential implications of the European Commission's review of the GreenTech acquisition for future M&A activity in the renewable energy sector, particularly in terms of competition and consumer protection in the EU market?",
 'How might the mer

### Intelligence Filter

Lastly - we ensure that the questions getting asked are not pedantic, or basic.

In [26]:
intelligent_question_filter = Intelligent_Question_Filter()

In [27]:
questions = []

def filter_question(question):
    try:
        filter_flag = intelligent_question_filter.execute(question, FILE_NAME, CHUNK)
        if filter_flag['Type_of_question'] == "Type_A":
            print("Processed Question: "+ question)
            return question
        return None
    except:
        None
with concurrent.futures.ThreadPoolExecutor() as executor:
    filtered_questions = list(executor.map(filter_question, re_written_questions))

questions = [q for q in filtered_questions if q is not None]

Processed Question: What concessions might SolarPower and GreenTech need to make to get approval for the merger and comply with antitrust regulations?
Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate regulatory concerns about the merger, and how could these concessions impact the deal's benefits?
Processed Question: How is SolarPower addressing concerns about the merger's impact on GreenTech's research and development capabilities and employees?
Processed Question: Don't you think SolarPower and GreenTech will have to make some big concessions to get the European Commission's approval for the merger, like selling off parts of their business to avoid reducing competition?
Processed Question: How could the SolarPower and GreenTech merger stifle competition and innovation in renewable energy storage, and what concerns has the FTC raised about this? What are some ways the combined company might use its market power to hurt competitors and consum

Processed Question: What other EU policies does the EU Merger Regulation aim to ensure compliance with, beyond competition concerns?
Processed Question: What concessions might SolarPower and GreenTech need to make to address EU concerns about reduced competition, and how can they balance that with preserving the value of the merged entity?
Processed Question: How can SolarPower Corporation and GreenTech Inc. address antitrust concerns while still achieving the benefits of their merger, and what strategies can they use to mitigate these concerns and ensure compliance with applicable laws and regulations?
Processed Question: How do you think regulators like the FTC and European Commission view the SolarPower and GreenTech merger? Are their concerns about monopolies and reduced competition valid, and what can the companies do to address these concerns?
Processed Question: How do the agreement's terms, like the breakup fee and maintaining GreenTech's R&D center, reduce financial risks and 

Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate EU concerns about reduced competition, and what's the purpose of these concessions in the review process?
Processed Question: What types of intellectual property assets and liabilities need to be accounted for during due diligence, and how do they impact the merger agreement's representations, warranties, and indemnifications?
Processed Question: How many regulatory bodies are reviewing the merger, and what concerns do they need to address regarding intellectual property and competition during due diligence?
Processed Question: How does the European Commission feel about the merger, given the scrutiny and potential concessions required to alleviate competition concerns?
Processed Question: What commitments has SolarPower made to address concerns about GreenTech's research and development center and existing employment contracts, and how do these impact the EU Merger Regulation approval process

Processed Question: How will the due diligence process affect GreenTech's valuation, and what does this mean for the acquisition's success?
Processed Question: What are the regulatory bodies scrutinizing the GreenTech acquisition, and what concerns have they raised about its impact on competition and market dominance in renewable energy?
Processed Question: Do you think the SolarPower and GreenTech merger could stifle competition and innovation in the renewable energy storage market, given the FTC's antitrust concerns?
Processed Question: Do you think SolarPower is committed to keeping GreenTech's research and development center independent, and will this move really preserve the company's innovative culture and prevent layoffs?
Processed Question: Does the agreement protect both parties' interests and ensure a smooth transition after the merger? What concessions might be needed for regulatory approvals?
Processed Question: How will the acquisition of GreenTech by SolarPower Corporatio

Processed Question: How many regulatory bodies are reviewing the merger, and what are their main concerns about competition and market dominance in the EU? Are there any common concerns or differences between them?
Processed Question: What potential risks or liabilities could be uncovered during the due diligence process that might affect the merger's success or GreenTech employees' job security, and how might SolarPower address these challenges?
Processed Question: How will the benefits of the merger, such as reduced production costs and increased revenue, be shared among employees, shareholders, and customers?
Processed Question: What measures has SolarPower committed to in order to alleviate job security concerns for GreenTech employees, and how will these measures impact them post-merger?
Processed Question: What are the expected operational synergies from combining GreenTech's battery technology with SolarPower's solar panel manufacturing and distribution network, and how will it 

**Prompt for the LLM**
```
You are in iritated teacher. Classify a student's question in the following types.
- Type_A: A question with which student extracts valuable insights, data points, or information.
- Type_B: A pedantic or a general knowledge question.
- Type_C: It would be hard to identify the subject of the conversation without the information in the passage. These types of questions are missing proper nouns.

Question: {question}

<Passage>
The following information is from a file with the title "{file_name}".

{passage}
</Passage>

Answer Format - Generate a json with the following fields
- "Type_of_question": <Fill with Type_A or Type_B or Type_C>
```

In [28]:
print("Filtered {} questions".format(len(questions) - len(re_written_questions)))

Filtered 0 questions


In [29]:
questions

['How will the acquisition of GreenTech Inc. by SolarPower Corporation reduce production costs and increase revenue over the next two years through operational synergies?',
 "What concerns does the FTC have about the merger's impact on the renewable energy storage market, and how might they be addressed?",
 "What covenants have SolarPower and GreenTech agreed to in order to preserve GreenTech's value and ensure a smooth transition after the merger, and what's their purpose?",
 'How might the SolarPower and GreenTech merger impact the competitiveness of the renewable energy storage market, and what concessions could the companies make to alleviate antitrust concerns while still achieving operational synergies and cost reductions?',
 "What are the potential implications of the European Commission's review of the GreenTech acquisition for future M&A activity in the renewable energy sector, particularly in terms of competition and consumer protection in the EU market?",
 'How might the mer

## Persona Re-Write

![Pipeline: Step 3](imgs/4.PNG)

The third and final step is re-writing the questions in the voice of your personas.

### Writing Style

First, lets extract the writing style from the persona descriptions

In [30]:
writing_styles = []

for persona in PERSONAS:
    style = json.loads(generator.writing_style(persona).strip())
    writing_styles.append({"persona": persona, "style": style['writing_style']})

**Prompt for the LLM**
```
Use the persona decription below to and articulate the Writing Style of the persona.

<Persona>
{persona}
</Persona>

Think step by step. Show your thinking.
Answer Format - Generate a json with the following fields
- "writing_style": <the writing style described in great detail in a paragraph>
```

In [31]:
writing_styles

[{'persona': '\n    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.\n    ',
  'style': "Joan's writing style is characterized by a tone of authority and expertise, reflecting their senior position as a financial analyst. Their language is formal and technical, often incorporating econometric jargon and complex financial concepts. However, due to their reliance on a team of analysts for specific details, their writing may occasionally lack precision and specificity, with vague questions or requests for further information. Despite this, their writing demonstrates a deep understanding of the broader topic, with a focus on big-picture analysis and strategic recommendations. Their sentences are likely to be struct

### Re-Writing

Now we can use the writing style and the filtered questions to generate variants.

In [32]:
re_written_questions = []
        
def process_question(question):
    question_variants = []
    for style in writing_styles:
        try:
            re_write = generator.persona_rewrite(style['style'], question)
            question_variants.append({"new_question": re_write, "style": style, "original_question": question})
        except:
            continue
    print("Processed Question: "+ question)
    return question_variants

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    re_written_questions = [variant for result in executor.map(process_question, questions) for variant in result]

Processed Question: What concessions might SolarPower and GreenTech need to make to get approval for the merger and comply with antitrust regulations?
Processed Question: What concerns does the FTC have about the merger's impact on the renewable energy storage market, and how might they be addressed?
Processed Question: How will combining GreenTech's battery tech with SolarPower's solar panel network impact the renewable energy market, and what role will the preserved R&D center play in driving innovation?
Processed Question: What covenants have SolarPower and GreenTech agreed to in order to preserve GreenTech's value and ensure a smooth transition after the merger, and what's their purpose?
Processed Question: How will the acquisition of GreenTech Inc. by SolarPower Corporation reduce production costs and increase revenue over the next two years through operational synergies?
Processed Question: How might the SolarPower and GreenTech merger impact the competitiveness of the renewable 

Processed Question: How can SolarPower Corporation and GreenTech Inc. address antitrust concerns while still achieving the benefits of their merger, and what strategies can they use to mitigate these concerns and ensure compliance with applicable laws and regulations?
Processed Question: How do the agreement's terms, like the breakup fee and maintaining GreenTech's R&D center, reduce financial risks and address concerns about layoffs, and what does this mean for EU Merger Regulation compliance?
Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate EU concerns about reduced competition, and how could these concessions impact the merger's viability?
Processed Question: How will SolarPower preserve GreenTech's innovative culture and R&D capabilities after the merger?
Processed Question: How do the restrictions on GreenTech's business, the due diligence process, and the indemnification process work together to ensure the acquisition complies with EU 

Processed Question: What concessions might SolarPower and GreenTech need to make to get regulatory approvals, and how could these concessions affect the merger deal's value and structure?
Processed Question: What antitrust concerns does the FTC have about SolarPower buying GreenTech, and how might this affect the deal's approval?
Processed Question: How are SolarPower and GreenTech addressing concerns about layoffs and preserving GreenTech's value before the deal closes?
Processed Question: What concessions might SolarPower and GreenTech need to make to alleviate EU concerns about reduced competition, and how could these concessions impact the merger's overall value and strategic rationale?
Processed Question: How do the agreement's provisions help ensure compliance with laws and mitigate financial risks, and what are the potential consequences of non-compliance?
Processed Question: What's the European Commission's role in reviewing the SolarPower and GreenTech merger, and what concess

Processed Question: What are the potential deal-breakers or areas of contention that could lead to the breakup fee being invoked, and how might these factors influence the negotiation of the fee's amount or the merger agreement's overall structure?
Processed Question: What concerns does the FTC have about the merger's impact on the renewable energy storage market, and how are SolarPower and GreenTech addressing them?
Processed Question: What's the deal with SolarPower buying GreenTech? How much is it worth and what kind of cost savings and revenue boost are they expecting over the next two years?
Processed Question: How do you think regulators like the FTC and European Commission view the SolarPower and GreenTech merger? Have the companies done enough to address their concerns and avoid paying the breakup fee? What concessions could they make to ease these concerns?
Processed Question: What happens if SolarPower's acquisition of GreenTech doesn't get regulatory approval, and how might 

Processed Question: How might the European Commission's review of the SolarPower and GreenTech merger address concerns about creating a monopoly in the EU renewable energy storage market, and what concessions might be needed to alleviate these concerns?
Processed Question: How can SolarPower Corporation and GreenTech Inc. address antitrust concerns raised by the FTC and the European Commission, and what impact might this have on integrating GreenTech's battery technology with SolarPower's solar panel manufacturing and distribution network?
Processed Question: What parts of the SolarPower and GreenTech merger deal might the European Commission scrutinize closely, and how might the companies need to adjust their proposal to get approval?
Processed Question: How many regulatory bodies are reviewing the merger, and what are their main concerns about competition and market dominance in the EU? Are there any common concerns or differences between them?
Processed Question: What concessions mi

**Prompt for the LLM**
```
Your task is to re-write the question like in the style of the persona below. 
Use the Writing Style from the persona. It is okay to make non-sensical questions if the persona requires it.

<Style>
{persona}
</Style>

<Constraints>
- The reformated question shouldn't leak any information about the persona.
- The question should have enough identifiers to be understood in a vacuum. Don't replace too many proper nouns with pronouns.
</Constraints>

Old Question: {question}

Answer format should be a json with the following fields:
- "new_question": contains the new question. 
```

In [33]:
re_written_questions

[{'new_question': '{\n"new_question": "What are the projected efficiency gains and revenue enhancements resulting from the integration of GreenTech Inc.\'s operations into SolarPower Corporation\'s existing infrastructure, and how will these synergies impact the combined entity\'s bottom line over the next 24 months, assuming optimal resource allocation and streamlining of redundant processes?" \n}',
  'style': {'persona': '\n    Joan is a very senior financial analyst and focuses on using econometrics to recommend investment strategies. Joan is used to having a team of analysts who they can ask for information, so they may not be up to date with the speficis so they may ask vauge questions. However, they are very knowledgeable about the general topic.\n    ',
   'style': "Joan's writing style is characterized by a tone of authority and expertise, reflecting their senior position as a financial analyst. Their language is formal and technical, often incorporating econometric jargon and 

## Conclusion

**Takeaway-** The pipeline above is an illustration of how one could go about using a LLM powerful LLM and an embedding model to generate synthetic data for evaluating retriever pipeline. It is not *the* pipeline that will solve all your synthetic data needs, but rather is a good starting point to stimulate thinking around synthesizing data to address your data needs.

**Points of Improvement**
* Basic building blocks like LLM as a judge can be used to create more filters. It is recommended to design more filters that are in line with your data. Additionally Reward Models can also be used as judges to add more types of filters.
* Frameworks like [Self-Discover](https://arxiv.org/pdf/2402.03620) can be used to improve the reasoning/critic capabilities for the LLM to adapt to the complextity of the underlying data
* More experimentations with personas can be done to further extract behavioural information for further targeting specific types of questions.
* This pipeline can be scaled up to generate triplets for finetuning your embedding model. This scaled up pipeline can be encorporated into an Ops pipeline to do automatic finetuning to continously improve you RAG pipelines.