In [None]:
Rel_Prompt = '''You are a Relevance Evaluation Agent, an expert in assessing the relevance of retrieved text chunks from a vector
database against an input query for the task of generating limitations of scientific articles. Your task is to evaluate the relevance
of 10 retrieved text chunks against an input query, which consists of a scientific paper (including sections: Abstract, Introduction,
Methodology, Related Work, Experiment and Results, Limitations, and Future Work) and its rewritten version. For each chunk, assign a
relevance score from 1 (least relevant) to 10 (most relevant) based on semantic and contextual alignment with the input query, and
provide a brief justification for the score.

Input:

Input Query: [The full text of the original scientific paper and its rewritten version]
Retrieved Text Chunks: A list of 10 text chunks, each with a unique identifier and content, formatted as:

Chunk 1: [chunk_id_1]: [retrieved_text_1]
Chunk 2: [chunk_id_2]: [retrieved_text_2]
Chunk 3: [chunk_id_3]: [retrieved_text_3]
Chunk 4: [chunk_id_4]: [retrieved_text_4]
Chunk 5: [chunk_id_5]: [retrieved_text_5]
Chunk 6: [chunk_id_6]: [retrieved_text_6]
Chunk 7: [chunk_id_7]: [retrieved_text_7]
Chunk 8: [chunk_id_8]: [retrieved_text_8]
Chunk 9: [chunk_id_9]: [retrieved_text_9]
Chunk 10: [chunk_id_10]: [retrieved_text_10]

Instructions:

Evaluate Relevance: For each of the 10 retrieved text chunks, assess its relevance to the input query based on semantic and contextual
alignment with the original and rewritten scientific paper. Consider how closely the chunk matches key concepts, arguments, or details
in the query.

Assign Relevance Score:
High Scores (8–10): The chunk has strong semantic and contextual alignment with the input query, closely matching key concepts or details.
Prioritize chunks containing limitations (e.g., study constraints, challenges) or methodological summaries (e.g., study design, methods),
boosting their score by 1–2 points if they align well with the query.

Medium Scores (4–7): The chunk has moderate semantic and contextual alignment, containing relevant but less central content (e.g., results,
general context, or partial methodological details).

Low Scores (1–3): The chunk has minimal or no semantic and contextual alignment, such as unrelated content, generic statements, or
off-topic information.

Prioritize Limitations and Methodology: Chunks explicitly discussing limitations (e.g., sample size, data constraints, scope issues) or
methodological summaries (e.g., study design, experimental setup) are highly relevant. Boost their score by 1–2 points if they align
well with the input query, compared to other relevant content.

Provide Justification: For each chunk, include a brief justification explaining the assigned score, referencing the chunk’s semantic and
contextual alignment with the input query and noting whether it contains limitations or methodological summaries.

Do Not Modify Text: Evaluate each chunk as provided, without modifying or paraphrasing the retrieved text.

Handle Irrelevant Chunks: If a chunk is unrelated to the input query or lacks meaningful content, assign a score of 1 with an appropriate
justification.

Workflow:
Plan: Review the input query (original and rewritten paper) and the 10 retrieved text chunks to understand their content and context.

Reasoning:
Step 1: For each chunk, identify its main topic or content (e.g., limitations, methodology, results, background).
Step 2: Compare the chunk’s content to the input query, assessing semantic and contextual alignment with the paper’s sections
(e.g., Limitations, Methodology).
Step 3: Assign a relevance score (1–10) based on alignment, prioritizing limitations and methodological summaries.
Step 4: Write a brief justification for the score, explaining the chunk’s relevance and any priority given to limitations or methodology.
Step 5: Verify the score and justification are accurate and consistent with the chunk’s content and the input query.

Analyze: Use text analysis tools to confirm semantic alignment (e.g., keyword matching for “limitation,” “constraint,” “methodology,” “sample size”) and assess relevance to the input query.
Reflect: Ensure scores and justifications are fair, consistent, and reflect the chunk’s alignment with the query, re-evaluating any ambiguous cases.
Continue: Iterate until all 10 chunks are evaluated with scores and justifications.

Tool Use:
Use text analysis tools to identify limitation-related or methodology-related keywords (e.g., “limited,” “constraint,” “sample size,” “methodology”) and assess semantic similarity between chunks and the input query.
Use semantic similarity checks to confirm alignment between the chunk and the query’s key concepts.

Chain of Thoughts: Document the reasoning process internally for each chunk. For example:
“This chunk mentions a small sample size, a limitation, and aligns closely with the query’s focus, so it receives a high score (9).”
“This chunk discusses results without addressing limitations or methodology, so it receives a medium score (6).”
“This chunk is generic and unrelated to the query’s specific content, so it receives a low score (1).”

Output Format: The output must be in strict JSON format, containing an array of 10 objects, one for each retrieved text chunk, with the
following structure for each object:
"Chunk_number": [Chunk number, e.g., "Chunk 1", "Chunk 2", ..., "Chunk 10"]
"relevance_score": [Integer from 1 to 10]
"justification": [Brief explanation of the score, referencing the chunk’s semantic and contextual alignment with the query and any emphasis on limitations or methodological summaries]

Example: Input: Input Query: [Full text of the original scientific paper and its rewritten version] Retrieved Text Chunks:

Chunk 1: chunk_001: The study was limited by a small sample size, which may affect generalizability.
Chunk 2: chunk_002: The experiment used a randomized controlled trial design to test the algorithm.
Chunk 3: chunk_003: The experiment achieved a 20% improvement in processing speed.
...
Chunk 10: chunk_010: Data processing is a key challenge in modern research.

Output: [ { "Chunk_number": "Chunk 1", "relevance_score": 9, "justification": "The chunk has strong semantic and contextual alignment with
the input query, explicitly discussing a limitation (small sample size), which is a high-priority element for limitation generation." },
{ "Chunk_number": "Chunk 2", "relevance_score": 8, "justification": "The chunk aligns well with the input query by describing the
methodological approach, a high-priority element, though it is slightly less central than limitations-related content." },
{ "Chunk_number": "Chunk 3", "relevance_score": 6, "justification": "The chunk has moderate semantic and contextual alignment,
discussing experimental results, but lacks focus on limitations or methodology, resulting in a mid-range score." },
...
{ "Chunk_number": "Chunk 10", "relevance_score": 3, "justification": "The chunk provides generic background information with minimal
semantic and contextual alignment to the input query’s specific concepts or arguments." } ] '''

In [None]:
def make_retriever_for_docs(docs, k=20):
    # FAISS
    faiss_store = FAISS.from_documents(docs, hf_emb)
    faiss_r     = faiss_store.as_retriever(search_kwargs={"k": k})

    # BM25
    bm25_r      = BM25Retriever.from_documents(docs)
    bm25_r.k    = k

    # ensemble (50/50 weight)
    return EnsembleRetriever(
        retrievers=[faiss_r, bm25_r],
        weights=[0.5, 0.5]
    )

In [None]:
MAX_CONTEXT_TOKENS = 127_000
MODEL_NAME         = "gpt-4o-mini"

def count_tokens(text: str, model: str = MODEL_NAME) -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text, disallowed_special=()))

def truncate_for_context(query: str,passages: list[str],
    max_tokens: int = MAX_CONTEXT_TOKENS,
    model: str = MODEL_NAME,
) -> list[str]:
    enc       = tiktoken.encoding_for_model(model)
    # allow all specials
    q_tokens  = enc.encode(query, disallowed_special=())
    budget    = max_tokens - len(q_tokens)
    kept, used = [], 0
    for p in passages:
        p_toks = enc.encode(p, disallowed_special=())
        if used + len(p_toks) > budget:
            if budget - used > 0:
                kept.append(enc.decode(p_toks[:(budget - used)]))
            break
        kept.append(p)
        used += len(p_toks)
    return kept


def ensure_passages_within_budget(
    query: str,
    passages: list[str],
    max_tokens: int = MAX_CONTEXT_TOKENS,
    model: str = MODEL_NAME,
) -> list[str]:
    """
    Returns `passages` truncated so that
    count_tokens(query + concat(passages)) ≤ max_tokens.
    """
    # count full size
    total = count_tokens(query + "\n\n".join(passages), model=model)
    if total <= max_tokens:
        return passages

    print(f"Truncating context ({total} tokens)…")
    # truncate_for_context only returns the shorter passages list
    return truncate_for_context(query, passages, max_tokens=max_tokens, model=model)


In [None]:
import os
import base64
import re
import json
import time
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.text_splitter import CharacterTextSplitter
from rank_bm25 import BM25Okapi
from langchain.vectorstores import FAISS
from langchain.retrievers import BM25Retriever
from langchain.embeddings import HuggingFaceEmbeddings
from openai import AzureOpenAI, RateLimitError
import tiktoken

# ─── Constants & Helpers ───────────────────────────────────────────────────
MAX_CONTEXT_TOKENS = 127_000
MODEL_NAME = "gpt-4o-mini"

def count_tokens(text: str, model: str = MODEL_NAME) -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text, disallowed_special=()))

df['top20_docs']            = ''
df['top20_suffixes']        = ''
df['top20_col_names']       = ''
df['top20_ref_ids']         = ''
df['retrieved_text_llm_asses'] = ''
df['faiss_bm_top_5']        = ''

start = time.time()

for i in range(len(df)): # len(df)
    row = df.iloc[i]
    print("i is", i)

    # 1) Embed own text
    inputs = {
        col: row[col].strip()
        for col in main_cols
        if isinstance(row.get(col), str) and row[col].strip()
    }
    if not inputs:
        continue
    own_emb = embed_model.encode(
        " ".join(inputs.values()), convert_to_tensor=True
    )

    # 2) Gather all ref sims
    ref_sims = []
    for ref_id in range(1, 116):
        texts = [
            row[f"neurips_ref_{ref_id}_{suf}"].strip()
            for suf in ref_suffixes
            if isinstance(row.get(f"neurips_ref_{ref_id}_{suf}"), str)
               and row[f"neurips_ref_{ref_id}_{suf}"].strip()
        ]
        if not texts:
            continue
        ref_emb = embed_model.encode(
            " ".join(texts), convert_to_tensor=True
        )
        sim = cosine_similarity(
            ref_emb.cpu().numpy().reshape(1, -1),
            own_emb.cpu().numpy().reshape(1, -1)
        )[0][0]
        ref_sims.append((ref_id, sim))
    if not ref_sims:
        continue

    # 3) Dump & chunk
    selected = [rid for rid, _ in ref_sims]
    keep_cols = [
        f"neurips_ref_{rid}_{suf}"
        for rid in selected
        for suf in ref_suffixes
        if isinstance(row.get(f"neurips_ref_{rid}_{suf}"), str)
           and row[f"neurips_ref_{rid}_{suf}"].strip()
    ]
    tmp_csv = "df_rag_train.csv"
    df.loc[[i], keep_cols].dropna(axis=1, how="all") \
        .to_csv(tmp_csv, index=False)

    df_rag = pd.read_csv(tmp_csv)
    lc_docs = []
    for col in df_rag.columns:
        txt = df_rag.at[0, col]
        if isinstance(txt, str) and txt.strip():
            suf = col.rsplit("_", 1)[-1]
            lc_docs.append(Document(
                page_content=txt.strip(),
                metadata={"column_suffix": suf, "col_name": col}
            ))

    splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=32)
    chunked = splitter.split_documents(lc_docs)

    print(f"Number of chunks created: {len(chunked)}")

    # 4) Build query
    input_query = (
        f"Scientific paper:\n{row['response_string_extensive']}\n\n"
        f"Rewritten version of scientific paper:\n{row['Input_Query_rewrite']}"
    )

    # 5) Ensemble top-20 then slice
    ensemble = make_retriever_for_docs(chunked, k=20)
    all_hits = ensemble.get_relevant_documents(input_query)
    top20 = all_hits[:20]

    top20_texts, top20_sufs, top20_cols, top20_refs = [], [], [], []
    for doc in top20:
        top20_texts.append(doc.page_content)
        top20_sufs.append(doc.metadata["column_suffix"])
        top20_cols.append(doc.metadata["col_name"])
        top20_refs.append(int(doc.metadata["col_name"].split("_")[2]))

    df.at[i, "top20_docs"]      = top20_texts
    df.at[i, "top20_suffixes"]  = top20_sufs
    df.at[i, "top20_col_names"] = top20_cols
    df.at[i, "top20_ref_ids"]   = top20_refs

    # 6) Batch LLM calls of 10 chunks each with token budget
    all_llm_scores = []
    prefix = (
        f"{Rel_Prompt}\n\n"
        f"Input Query:\n{input_query}\n\n"
        "Here are up to 10 retrieved text chunks:\n"
    )
    prefix_len = count_tokens(prefix, model=MODEL_NAME)
    question = (
        "\\nOn a scale of 1–10, how relevant is each chunk to the above Input Query? Respond with JSON array with Chunk Number, Score, and Justification for each chunk."
    )
    question_len = count_tokens(question, model=MODEL_NAME)

    for batch_start in (0, 10):
        batch_docs  = top20[batch_start:batch_start+10]
        batch_texts = [d.page_content for d in batch_docs]

        # truncate passages to fit
        available = MAX_CONTEXT_TOKENS - prefix_len - question_len
        kept, used = [], 0
        enc = tiktoken.encoding_for_model(MODEL_NAME)
        for p in batch_texts:
            toks = enc.encode(p, disallowed_special=())
            if used + len(toks) > available:
                break
            kept.append(p)
            used += len(toks)

        chunks_list = "\n\n".join(
            f"Chunk {batch_start+idx+1}: {text}"
            for idx, text in enumerate(kept)
        )
        prompt = prefix + chunks_list + question

        raw = run_critic(prompt)
        m = re.search(r"```json\s*(\[[\s\S]*?\])\s*```", raw)
        json_text = m.group(1) if m else raw
        try:
            batch_scores = json.loads(json_text)
        except json.JSONDecodeError:
            batch_scores = [
                s.strip().strip('"')
                for s in json_text.strip("[] \n").split(",")
                if s.strip()
            ]
        all_llm_scores.extend(batch_scores)

    df.at[i, "retrieved_text_llm_asses"] = all_llm_scores

    if i == 0 or i==20:
        print('all_llm_scores', all_llm_scores)

    # 7) Baseline top-5 final
    final5 = top20[:5]
    df.at[i, "faiss_bm_top_5"] = "\n\n".join(d.page_content for d in final5)

# 8) Save and timing
df.to_csv("/df_neurips_lim_OR_with_cited_data_rerank_LLM_rag_listwise.csv",index=False)
print("Total time:", time.time() - start)


In [None]:
agent_prompts = {
    "Extractor": Extractor,
    "Analyzer":  Analyzer,
    "Reviewer":  Reviewer,
    "Citation":  Citation,
}
agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]

parsed_feedback_rows = []
metrics = ['Depth','Originality','Actionability','Topic_Coverage']
judge_feedback = []
generated_limitations = []


df["failed_parse_assessment"] = ""

# Initialize df_all before the loop
df_all = pd.DataFrame()

for i in range(len(df)): # len(df)
    print(f"\nProcessing row {i}")
    row = df.iloc[i]

    # build per‐row CSV
    inputs = {col: row[col].strip() for col in main_cols
              if isinstance(row.get(col), str) and row[col].strip()}
    if not inputs:
        print(f"Row {i} has no valid main sections."); continue
    own_text = " ".join(inputs.values())
    own_emb  = embed_model.encode(own_text, convert_to_tensor=True)

    keep_cols = []
    # 2) Measure similarity for each referenced paper
    ref_sims = []
    for ref_id in range(1, 116):
        # gather all non‐empty sections for this reference
        texts = []
        for suf in ref_suffixes:
            col = f"neurips_ref_{ref_id}_{suf}"
            t   = row.get(col, "")
            if isinstance(t, str) and t.strip():
                texts.append(t.strip())
        if not texts:
            continue
        # concatenate and embed
        ref_text = " ".join(texts)
        ref_emb  = embed_model.encode(ref_text, convert_to_tensor=True)
        # cosine similarity requires numpy arrays
        sim = cosine_similarity(
            ref_emb.cpu().numpy().reshape(1, -1),
            own_emb.cpu().numpy().reshape(1, -1)
        )[0][0]
        ref_sims.append((ref_id, sim))

    if not ref_sims:
        print(" No references to filter; skipping row.")
        continue

    # 3) Gap‐based filtering on the full ref_sims list
    # Sort descending by similarity, compute adjacent gaps, pick those before the largest gap
    ref_sims.sort(key=lambda x: x[1], reverse=True)
    sims = [sim for _, sim in ref_sims]
    # print("sims are",sims)
    if len(sims) < 2:
        # Fewer than 2 refs → keep them all
        selected = [rid for rid, _ in ref_sims]
    else:
        # compute gaps between adjacent sims
        gaps = [sims[i] - sims[i+1] for i in range(len(sims) - 1)]
        # find the index of the largest jump
        max_gap_idx = gaps.index(max(gaps))
        # keep every ref up through that jump
        selected = [rid for rid, _ in ref_sims[: max_gap_idx + 1]]
    # print("gaps are", gaps)
    # print(f"Selected {len(selected)} refs by gap filtering")
    # print(f" Selected {len(selected)} refs after threshold+gap filtering (mean={mean_sim:.3f})")

    # 5) rebuild your keep_cols using only those selected ref_ids
    keep_cols = []
    for ref_id in selected:
        for suf in ref_suffixes:
            col = f"neurips_ref_{ref_id}_{suf}"
            t   = row.get(col, "")
            if isinstance(t, str) and t.strip():
                keep_cols.append(col)

    keep_cols = [
    c for c in keep_cols
    if c in df.columns and isinstance(row[c], str) and row[c].strip()
    ]
    # print("keep cols",keep_cols)

    # print("keep cols", keep_cols)
    csv_path = "/media/ibrahim/Extreme SSD/Limitations Data/RAG/RAG1/df_rag_train.csv"
    # drop the column which has 'NaN' value
    df.loc[[i], keep_cols].dropna(axis=1, how="all").to_csv(csv_path, index=False)

    # load & chunk
    df_rag = pd.read_csv(csv_path)
    lc_docs = []
    for col in df_rag.columns:
        text = df_rag.loc[0, col]
        if isinstance(text, str) and text.strip():
            lc_docs.append(
                Document(
                    page_content=text.strip(),
                    metadata={"source_column": col}
                )
            )
    # lc_docs  = CSVLoader(file_path=csv_path).load()
    chunked  = CharacterTextSplitter(chunk_size=512, chunk_overlap=64).split_documents(lc_docs)
    # retrieve top‐k docs
    retriever = make_retriever_for_docs(chunked, k=3)
    # calling 'get_relevant_documents' from langchain.retriever
    docs      = retriever.get_relevant_documents(row["response_string_all"])
    # 'passages' contains the relevant documents from vector database
    passages  = [d.page_content for d in docs]
    # passages_str = "\n\n".join(passages)

    # build query (input paper + system prompt)
    query = "Here are the all sections of a paper: " + row["response_string_all"] + '\n\n' + system_prompt
    # print("query is",query)

    # tokenize + truncate if needed, passages contains retrieved text
    passages = [p.replace("<|endoftext|>", "") for p in passages]

    passages = ensure_passages_within_budget(query, passages)
    # assemble final prompt
    # prompt = (
    #     "Context:\n" + "\n\n".join(passages) +
    #     "\n\nQuestion: " + row["response_string_all"] +
    #     "\nAnswer (limitations):"
    # )
    retrieved_text = "\n\n".join(passages)  # passages contains the content from cited papers (vector database)
    # print("retrieved text is", retrieved_text)
    # ── Store them in df ──
    df.at[i, "query"] = query
    df.at[i, "retrieved_text"] = retrieved_text

    extractor_agent = run_critic(Extractor + query)
    analyzer_agent = run_critic(Analyzer + query)
    reviewer_agent = run_critic(Reviewer + query)
    citation_agent = run_critic(Citation + retrieved_text)

    # ── Store each agent’s output in df ──
    df.at[i, "extractor_agent"] = extractor_agent
    df.at[i, "analyzer_agent"]  = analyzer_agent
    df.at[i, "reviewer_agent"]  = reviewer_agent
    df.at[i, "citation_agent"]  = citation_agent

    agent_texts = {
        "Extractor": extractor_agent,
        "Analyzer":  analyzer_agent,
        "Reviewer":  reviewer_agent,
        "Citation":  citation_agent
    }

    combined, row_scores, raw_judge = llm_assessment(
        agent_texts=agent_texts,
        agent_prompts=agent_prompts,
        metrics=metrics
    )
    # If parsing failed (combined is empty), store raw_judge in a new column
    if not combined:
        df.at[i, "failed_parse_assessment"] = raw_judge
        # Optionally, skip further processing for this row:
        continue

    flattened_records = []
    # storing the value
    wide_rec = {"df": i}

    for agent_name, data in combined.items():
        # 1) Top‐level total_score
        wide_rec[f"{agent_name}_total_score"] = data.get("total_score", None)

        # 2) Per‐metric fields from data["evaluation"]
        evaluation = data.get("evaluation", {})
        for metric in metrics:
            metric_info = evaluation.get(metric, {})
            # Numeric score
            wide_rec[f"{agent_name}_{metric}_score"] = metric_info.get("score", None)
            # Strengths, issues, suggestions
            wide_rec[f"{agent_name}_{metric}_strengths"]   = metric_info.get("strengths", "")
            wide_rec[f"{agent_name}_{metric}_issues"]      = metric_info.get("issues", "")
            wide_rec[f"{agent_name}_{metric}_suggestions"] = metric_info.get("suggestions", "")

    # Now write each key/value from wide_rec back into df at row i:
    for col_name, value in wide_rec.items():
        # Skip the "df" entry, since that just equals i
        if col_name == "df":
            continue
        df.at[i, col_name] = value

df.to_csv("/media/ibrahim/Extreme SSD/Limitations Data/NeurIPS_new/self_feedback/df_neurips_self_feedback.csv",index=False)

Feedback 2nd time

In [None]:
Regenerate_PROMPT = ''' You are task is generating limitations based on Judge Agent feedback.

[Criterion]: Strengths: [strengths]. Issues: [issues]. Suggestions: [suggestions]. Can you generate limitations focusing on Strengths and minimize Issues ?
Ensure alignment with your role.'''

In [None]:

metrics = ['Depth','Originality','Actionability','Topic_Coverage']
AGENT_BASE_PROMPTS = {
     "Extractor":  Extractor,   # the prefix string you use when calling run_critic
     "Analyzer":   Analyzer,
     "Reviewer":   Reviewer,
     "Citation":   Citation
}
# Assume df now has columns for each agent’s scores, strengths, issues, suggestions,
# as well as "prompt" and "text_blob".

agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]

for i in range(len(df)): # len(df)
    row = df.iloc[i]

    for agent in agents:
        # 1) Collect all metrics for this agent that have score < 8
        feedback_parts = []
        for metric in metrics:
            score_col = f"{agent}_{metric}_score"
            score = row.get(score_col, None)
            if score is not None and score < 8:
                strengths_col   = f"{agent}_{metric}_strengths"
                issues_col      = f"{agent}_{metric}_issues"
                suggestions_col = f"{agent}_{metric}_suggestions"

                strengths   = row.get(strengths_col, "")
                issues      = row.get(issues_col, "")
                suggestions = row.get(suggestions_col, "")

                feedback_parts.append(
                    f"{metric} Feedback:\n"
                    f"Strengths: {strengths}\n"
                    f"Issues: {issues}\n"
                    f"Suggestions: {suggestions}"
                )

        # 2) If there is at least one low‐score metric, build a feedback blob and regenerate
        if feedback_parts:
            feedback_blob = "\n\n".join(feedback_parts)

            full_prompt = (
                AGENT_BASE_PROMPTS[agent] +
                row["prompt"] +
                "\n\n" +
                Regenerate_PROMPT +
                "\n\n" +
                feedback_blob +
                "\n\n" +
                row["text_blob"]
            )

            regenerated_output = run_critic(full_prompt)

            # 3) Store the regenerated result back into df under a new column
            regen_col = f"{agent}_regenerated"
            df.at[i, regen_col] = regenerated_output

df.to_csv("df_neurips_self_feedback.csv",index=False)

### Master agent

In [None]:
COORDINATOR_PROMPT = '''
    You are a **Master Coordinator**, an expert in scientific communication and synthesis. Your task is to integrate limitations provided by four agents:
    1. **Extractor** (explicit limitations from the article),
    2. **Analyzer** (inferred limitations from critical analysis),
    3. **Reviewer** (limitations from an open review perspective),
    4. **Citation** (limitations based on cited papers).

    **Goals**:
    1. Combine all limitations into a cohesive, non-redundant list.
    2. Ensure each limitation is clearly stated, scientifically valid, and aligned with the article’s content.
    3. Prioritize author-stated limitations, supplementing with inferred, peer-review, or citation-based limitations if they add value.
    4. Resolve discrepancies between agents’ outputs by cross-referencing the article and cited papers, using tools to verify content.
    5. Format the final list in a clear, concise, and professional manner, suitable for a scientific review or report, with citations for external sources.

    **Workflow** (inspired by SYS_PROMPT_SWEBENCH):
    1. **Plan**: Outline how to synthesize limitations, identify potential redundancies, and resolve discrepancies.
    2. **Analyze**: Combine limitations, prioritizing author-stated ones, and verify alignment with the article.
    3. **Reflect**: Check for completeness, scientific rigor, and clarity; resolve discrepancies using article content or tools.
    4. **Continue**: Iterate until the list is comprehensive, non-redundant, and professionally formatted.

    **Output Format**:
    - Numbered list of final limitations.
    - For each: Clear statement, brief justification, and source in brackets (e.g., [Author-stated], [Inferred], [Peer-review-derived], [Cited-papers]).
    - Include citations for external sources (e.g., web/X posts, cited papers) in the format [Source Name](ID).
    **Tool Use**:
    - Use text extraction tools to verify article content.
    - Use citation lookup tools to cross-reference cited papers.
    - Use web/X search tools to resolve discrepancies involving external context.

    **Input**: '''


In [None]:
import pandas as pd

# taking the self-feedback if it exists otherwise acutal one
def master_agent(extractor_text, analyzer_text, reviewer_text, citation_text):
    """
    Takes the outputs of the four specialized agents and produces
    the final coordinated limitations via a GPT call.
    """
    coord_prompt = (
        COORDINATOR_PROMPT
        + f"**Extractor Agent**:\n{extractor_text}\n\n"
        + f"**Analyzer Agent**:\n{analyzer_text}\n\n"
        + f"**Reviewer Agent**:\n{reviewer_text}\n\n"
        + f"**Citation Agent**:\n{citation_text}\n\n"
        + "Produce a single, numbered list of final limitations, noting each source in brackets."
    )
    return run_critic(coord_prompt)

In [None]:
import pandas as pd

# Ensure pandas recognizes NaN
import numpy as np

# Add a column to store the master agent’s output
df["final_limitations"] = ""

for idx, row in df.iterrows():
    # 1) Choose the extractor text: regenerated if available, otherwise original
    if pd.notna(row["Extractor_regenerated"]):
        extractor_text = row["Extractor_regenerated"]
    else:
        extractor_text = row["extractor_agent"]

    # 2) Analyzer text
    if pd.notna(row["Analyzer_regenerated"]):
        analyzer_text = row["Analyzer_regenerated"]
    else:
        analyzer_text = row["analyzer_agent"]

    # 3) Reviewer text
    if pd.notna(row["Reviewer_regenerated"]):
        reviewer_text = row["Reviewer_regenerated"]
    else:
        reviewer_text = row["reviewer_agent"]

    # 4) Citation text
    if pd.notna(row["Citation_regenerated"]):
        citation_text = row["Citation_regenerated"]
    else:
        citation_text = row["citation_agent"]

    # 5) Call the master_agent with the chosen texts
    final = master_agent(
        extractor_text=extractor_text,
        analyzer_text=analyzer_text,
        reviewer_text=reviewer_text,
        citation_text=citation_text
    )

    # 6) Store the result back into df
    df.at[idx, "final_limitations"] = final
