Generate evaluation triplets from studies 


1. parse document 
2. extract claims that are accosiated with refereces via LLM and save in json with referenced paper as name
3. generate topic of claim like (sustanability, business, money, or something like that)
4. repeat step 1-3 for multiple reports from diverse topics that care be releated tho like (genai, sustability, business report)
5. bundle 2-3 claims together from multiple related documents into one claim via LLM
6. generate question for bundled claim 
7. store question, claim and added references in json objects
8. repeat step 6 and 7




In [1]:

from langchain_core.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from pydantic import BaseModel, Field
import os
from typing import List
import tiktoken
import json

from gen_ai_hub.proxy.langchain.openai import ChatOpenAI
from gen_ai_hub.proxy.core.proxy_clients import get_proxy_client

proxy_client = get_proxy_client('gen-ai-hub')
llm = ChatOpenAI(proxy_model_name='gpt-4o', proxy_client=proxy_client)

INPUT_DIR = "input_references/"
DATASET = "evaluation_dataset_references.json"
TEMP = "temp_reference/"

## Load Documents

In [17]:
documents = []
for file in os.listdir(INPUT_DIR):
    if file.endswith(".md"):
        file_path = os.path.join(INPUT_DIR, file)
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
            documents.append(Document(page_content=content, metadata={"source": file}))
combined_documents_content = "\n\n".join([doc.page_content for doc in documents])

In [18]:
encoding = tiktoken.encoding_for_model("gpt-4o")
tokens = encoding.encode(combined_documents_content)
print(len(tokens))

951199


## Extract claims with references using an LLM chain

In [None]:
class Response(BaseModel):
    claim: str = Field(description="Claim"),
    references: List[str] = Field(description="List of references")
    dois: List[str] = Field(description="List of DOIs corresponding to each reference in the same order as references, or empty if not found")

class Responses(BaseModel):
    responses: list[Response] = Field(description="List of responses")

parser = PydanticOutputParser(pydantic_object=Responses)

extract_prompt = PromptTemplate(
    input_variables=["text"],
    partial_variables={"format_output": parser.get_format_instructions()},
    template="""You are a knowledgeable assistant specializing in academic research analysis. Your task is to identify and extract **research-based claims** from the provided document. Additionally, you will match each inline citation in the claim to its corresponding entry in the references section and extract the DOI from that entry.

**Definition of a Research-Based Claim:**
- A statement that presents a finding, result, or implication derived from empirical studies.
- Typically includes analysis, comparisons, or conclusions drawn from data.
- Supported by one or more inline citations in the format "(Author et al., YEAR), (Author and Author, YEAR) or (Author, YEAR)".
- **The extracted claim should exclude any reference information or inline citations, presenting a standalone statement.**

**Exclude the following from extraction:**
- Definitions or historical accounts.
- General statements without empirical support.
- Descriptions of concepts without presenting findings.

**Tasks:**
1. **Extract each research-based claim without any reference information or inline citations.** The claim should be a standalone statement free of author names, years, or citation markers.
2. For each claim, extract the list of inline citations that appear in that claim.
3. For each inline citation (Author et al., YEAR), match it to the corresponding full reference in the references section (if available within the provided text).
4. From the matched full reference, extract the DOI (if present).
5. If no DOI is found for a given citation, return an empty string for that DOI.

<document>
{text}
</document>

**The output MUST strictly adhere to the following JSON format and include only this JSON without any additional text:**
{format_output}
""", 
)

extract_chain = extract_prompt | llm | parser

claims = []
for doc in documents:
    response = extract_chain.invoke({"text": doc.page_content})
    for res in response.responses:
        claim_dict = res.dict()
        claim_dict["doc_source"] = doc.metadata["source"]
        claims.append(claim_dict)


print("Number of claims: ", len(claims))


In [19]:
for claim in claims:    
    print(json.dumps(claim, indent=4, ensure_ascii=False))

{
    "claim": "The accounting sector of a company can promote environmental conservation through environmental costs, and at the same time, improve performance when implementing a Corporate Sustainability Management System (CSMS).",
    "references": [
        "Endiana et al., 2020"
    ],
    "dois": [
        "10.13106/jafeb.2020.vol7.no12.731"
    ],
    "doc_source": "s10668-023-02933-7.md"
}
{
    "claim": "Allocating appropriate environmental costs through CSMS can effectively improve the company’s financial performance.",
    "references": [
        "Endiana et al., 2020"
    ],
    "dois": [
        "10.13106/jafeb.2020.vol7.no12.731"
    ],
    "doc_source": "s10668-023-02933-7.md"
}
{
    "claim": "A proper application of CSMS, with the disclosure of environmental activities and costs, can enhance customer loyalty.",
    "references": [
        "Endiana et al., 2020"
    ],
    "dois": [
        "10.13106/jafeb.2020.vol7.no12.731"
    ],
    "doc_source": "s10668-023-02933-7

In [None]:
with open(TEMP+"related_work_single_claims.json", "w", encoding="utf-8") as f:
    json.dump(claims, f, indent=4, ensure_ascii=False)

In [6]:
with open(TEMP+"related_work_single_claims.json", "r", encoding="utf-8") as f:
    claims = json.load(f)

print(len(claims))

count = sum(
    1
    for entry in claims
    if any(doi.strip() for doi in entry.get('dois', []))
)

print("Number of claims with at least one non-empty DOI:", count)


85
Number of claims with at least one non-empty DOI: 77


In [5]:
filtered_claims = []
unique_dois = set()

for claim in claims:
    valid_dois = []
    for doi in claim["dois"]:
        if doi:
            clean_doi = doi.replace("https://doi.org/", "")
            valid_dois.append(clean_doi)
    
    if valid_dois:
        filtered_claims.append(claim)
        unique_dois.update(valid_dois)

with open("unique_dois2.txt", "w", encoding="utf-8") as f:
    for doi in sorted(unique_dois):
        f.write(doi + "\n")

print("Total extracted claims: ", len(claims))
print("Filtered claims with at least one DOI: ", len(filtered_claims))
print("Total unique DOIs saved: ", len(unique_dois))


Total extracted claims:  85
Filtered claims with at least one DOI:  77
Total unique DOIs saved:  48


## Fetch and filter cited Reports 

In [None]:
!python -m PyPaperBot --doi-file=unique_dois.txt --dwn-dir=input_references

add all non fetched papers manual

## Categorize the claims 

In [7]:
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser

class CategoryResponse(BaseModel):
    category: str = Field(description="The category of the given claim")

parser = PydanticOutputParser(pydantic_object=CategoryResponse)

extract_prompt = PromptTemplate(
    input_variables=["claim", "categories"],
    partial_variables={"format_output": parser.get_format_instructions()},
    template="""You have a list of existing categories:
{categories}

Please classify the following claim into one of these categories if it fits well. 
If it does not fit into any of the listed categories, create a new category name.
Here are some examples for Categories you can use, or create new ones if you need to:
"Business", "Sustainability", "Technology", "Economics", "Innovation", etc.

Claim: {claim}

Return only the final chosen or newly created category as a JSON object following the format instructions.

{format_output}
"""
)

extract_chain = extract_prompt | llm | parser

categories = []
claims_w_category = []
for item in filtered_claims:
    categories_str = ", ".join(categories) if categories else "No categories yet."
    response = extract_chain.invoke({"claim": item["claim"], "categories": categories_str})
    chosen_category = response.category.strip()
    if chosen_category not in categories:
        categories.append(chosen_category)

    item['category'] = chosen_category
    claims_w_category.append(item)

print("Final Claims with Categories:", claims_w_category)
print("All Discovered Categories:", categories)
print("Amount of categories:", len(categories))

Final Claims with Categories: [{'claim': 'The accounting sector of a company can promote environmental conservation through environmental costs, and at the same time, improve performance when implementing a Corporate Sustainability Management System (CSMS).', 'references': ['Endiana et al., 2020'], 'dois': ['10.13106/jafeb.2020.vol7.no12.731'], 'doc_source': 's10668-023-02933-7.md', 'category': 'Sustainability'}, {'claim': 'Allocating appropriate environmental costs through CSMS can effectively improve the company’s financial performance.', 'references': ['Endiana et al., 2020'], 'dois': ['10.13106/jafeb.2020.vol7.no12.731'], 'doc_source': 's10668-023-02933-7.md', 'category': 'Sustainability'}, {'claim': 'A proper application of CSMS, with the disclosure of environmental activities and costs, can enhance customer loyalty.', 'references': ['Endiana et al., 2020'], 'dois': ['10.13106/jafeb.2020.vol7.no12.731'], 'doc_source': 's10668-023-02933-7.md', 'category': 'Sustainability'}, {'claim

In [8]:
from collections import Counter, defaultdict

# 1. Number of Claims in Each Category
category_counts = Counter(item['category'] for item in claims_w_category)
print("\nNumber of Claims in Each Category:")
for category, count in category_counts.items():
    print(f" - {category}: {count}")

# 2. Number of Claims per Paper
claims_per_paper = Counter(item['doc_source'] for item in claims_w_category)
print("\nNumber of Claims per Paper:")
for doc_source, count in claims_per_paper.items():
    print(f" - {doc_source}: {count}")

# 3. Number of Each Category per Paper
categories_per_paper = defaultdict(Counter)
for item in claims_w_category:
    doc_source = item['doc_source']
    category = item['category']
    categories_per_paper[doc_source][category] += 1

print("\nNumber of Each Category per Paper:")
for doc_source, category_counter in categories_per_paper.items():
    print(f"\nPaper: {doc_source}")
    for category, count in category_counter.items():
        print(f"  - {category}: {count}")



Number of Claims in Each Category:
 - Sustainability: 22
 - Corporate Governance: 3
 - Business: 3
 - Economics: 3
 - Technology: 35
 - Innovation: 11

Number of Claims per Paper:
 - s10668-023-02933-7.md: 18
 - s11301-021-00211-2 (1).md: 4
 - ssrn-3708495 (2).md: 5
 - Use_Cases_for_Generative_AI_in_organizations.md: 50

Number of Each Category per Paper:

Paper: s10668-023-02933-7.md
  - Sustainability: 17
  - Corporate Governance: 1

Paper: s11301-021-00211-2 (1).md
  - Corporate Governance: 1
  - Sustainability: 2
  - Business: 1

Paper: ssrn-3708495 (2).md
  - Economics: 2
  - Sustainability: 3

Paper: Use_Cases_for_Generative_AI_in_organizations.md
  - Technology: 35
  - Innovation: 11
  - Economics: 1
  - Business: 2
  - Corporate Governance: 1


## Bundle and Merge Claims

In [9]:
from collections import defaultdict
from langchain.prompts import PromptTemplate

for idx, claim in enumerate(claims_w_category):
    claim["id"] = idx

claims_by_category = defaultdict(list)
for c in claims_w_category:
    claims_by_category[c["category"]].append(c)

all_claims = [c for cat_list in claims_by_category.values() for c in cat_list]

merge_prompt = PromptTemplate(
    input_variables=["claims_text"],
    template="""You are a writing assistant.
Merge the following claims into a single cohesive claim that captures all their key points but don't merely summarize them. Integrate all elements so that the unified claim represents them collectively:

Claims:
{claims_text}

Return one paragraph unifying these claims.
"""
)

group_size = 3
merged_claims = []

def get_other_category_claim(current_category, used_claims):
    for c in all_claims:
        if c["category"] != current_category and c["id"] not in used_claims:
            return c
    return None

def get_other_document_claim(current_doc_source, used_claims):
    for c in all_claims:
        if c["doc_source"] != current_doc_source and c["id"] not in used_claims:
            return c
    return None

used_claims = set()

for cat, cat_claims in claims_by_category.items():
    for i in range(0, len(cat_claims), group_size):
        group = cat_claims[i:i+group_size]

        # if not enough claims in this category, try adding another category claim
        if len(group) < group_size:
            other_claim = get_other_category_claim(cat, used_claims)
            if other_claim is not None:
                group.append(other_claim)
                used_claims.add(other_claim["id"])

        if len(group) < 2:
            continue

        # check category diversity
        group_categories = {g["category"] for g in group}
        if len(group_categories) == 1:
            other_claim = get_other_category_claim(cat, used_claims)
            if other_claim is not None:
                replaced = group.pop()
                group.append(other_claim)
                used_claims.add(other_claim["id"])

        # check document diversity
        group_doc_sources = {g["doc_source"] for g in group}
        if len(group_doc_sources) == 1:
            current_doc_source = group[0]["doc_source"]
            other_doc_claim = get_other_document_claim(current_doc_source, used_claims)
            if other_doc_claim is not None:
                replaced = group.pop()
                group.append(other_doc_claim)
                used_claims.add(other_doc_claim["id"])

        # mark all claims in this group as used
        for g in group:
            used_claims.add(g["id"])

        claims_text = "\n".join([f"- {g['claim']}" for g in group])
        merged_claim = (merge_prompt | llm).invoke({"claims_text": claims_text})

        all_refs = []
        all_cats = []
        all_doc_sources = []
        for g in group:
            all_refs.extend(g.get("references", []))
            all_cats.append(g["category"])
            all_doc_sources.append(g["doc_source"])

        all_refs = list(set(all_refs))
        all_cats = list(set(all_cats))
        all_doc_sources = list(set(all_doc_sources))

        merged_claims.append({
            "original_claims": group,
            "merged_claim": merged_claim.content.strip(),
            "categories": all_cats, 
            "merged_references": all_refs, 
            "doc_sources": all_doc_sources 
        })

print("Merged Claims:")
for mc in merged_claims:
    print(mc)


Merged Claims:
{'original_claims': [{'claim': 'The accounting sector of a company can promote environmental conservation through environmental costs, and at the same time, improve performance when implementing a Corporate Sustainability Management System (CSMS).', 'references': ['Endiana et al., 2020'], 'dois': ['10.13106/jafeb.2020.vol7.no12.731'], 'doc_source': 's10668-023-02933-7.md', 'category': 'Sustainability', 'id': 0}, {'claim': 'Allocating appropriate environmental costs through CSMS can effectively improve the company’s financial performance.', 'references': ['Endiana et al., 2020'], 'dois': ['10.13106/jafeb.2020.vol7.no12.731'], 'doc_source': 's10668-023-02933-7.md', 'category': 'Sustainability', 'id': 1}, {'claim': 'Both CSR performance and environmental performance increase financial performance.', 'references': ['Vishwanathan et al., 2020', 'Busch and Friede, 2018a', 'Friede et al., 2015', 'Wang et al., 2016', 'Orlitzky et al., 2003', 'Frooman, 1997', 'Albertini, 2013'], 

In [14]:

question_prompt = PromptTemplate(
    input_variables=["merged_claim"],
    template=
    """You have the following merged claim:
    {merged_claim}
    
    Generate a single, question that, if answered, would be naturally resolved by this claim.
    This question should:
    - Integrate insights from multiple logically connected business domains (e.g., finance, sustainability, technology, operations).
    - Require synthesizing and validating information potentially from multiple sections or sources, reflecting the complexity of cross-document or intra-document retrieval.
    - Demand abstract reasoning and strategic-level interpretation, going beyond simple fact retrieval.
    
    The resulting question should be global in scope, and should not be answerable without the information provided in the claim.
    """
)

final_dataset = []

for mc in merged_claims:
    question_response = (question_prompt | llm).invoke({"merged_claim": mc["merged_claim"]})
    question = question_response.content.strip()
    
    entry = {
        "question": question,
        "ground_truth": mc["merged_claim"],
        "references": mc["merged_references"],
        "categories": mc["categories"],
        "doc_sources": mc["doc_sources"]
    }
    
    final_dataset.append(entry)


with open(DATASET, "w", encoding="utf-8") as f:
  json.dump({"responses": final_dataset}, f, indent=4, ensure_ascii=False)


for entry in final_dataset:
    print(json.dumps(entry, indent=4, ensure_ascii=False))

{
    "question": "How can the integration of a Corporate Sustainability Management System (CSMS) in the accounting sector strategically enhance a company's financial performance while simultaneously advancing environmental conservation and Corporate Social Responsibility (CSR)?",
    "ground_truth": "Implementing a Corporate Sustainability Management System (CSMS) in the accounting sector can simultaneously advance environmental conservation and enhance company performance by strategically allocating environmental costs. This approach not only bolsters financial outcomes but also elevates Corporate Social Responsibility (CSR) and environmental performance, creating a synergistic effect that reinforces the company's overall financial health. By integrating environmental considerations into accounting practices, companies can achieve a sustainable balance between ecological responsibility and economic success.",
    "references": [
        "Frooman, 1997",
        "Endiana et al., 2020"

In [15]:
print(len(final_dataset))

27
