### LLM Judge

In [None]:
JUDGE_PROMPT = ''' You are a Judge Agent, an expert in evaluating scientific text quality with a focus on limitation generation for
scientific articles. Your task is to assess the outputs of four agents—Extractor (explicit limitations from the article), Analyzer
(inferred limitations from critical analysis), Reviewer (peer-review limitations), and Citation (limitations based on cited papers).
For each agent’s output, assign a numerical score (0–100) and provide specific feedback based on defined criteria. The evaluation is
reference-free, relying on the output’s inherent quality and alignment with each agent’s role.

Evaluation Criteria:
Depth: How critical and insightful is the limitation? Does it reveal significant issues in the study’s design, findings, or implications?
(20% weight)



Originality: Is the limitation a generic critique or a novel, context-specific insight? (20% weight)

Actionability: Can researchers realistically address the limitation in future work? Does it provide clear paths for improvement?
(30% weight)

Topic Coverage: How broadly does the set of limitations cover relevant aspects (e.g., methodology, scope for Extractor/Analyzer; peer
review standards for Reviewer; cited paper gaps for Citation)? (30% weight)

Workflow: Plan: Review each agent’s role and expected output (Extractor: explicit limitations; Analyzer: inferred methodological gaps;
Reviewer: peer-review critiques; Citation: cited paper gaps). Identify tools (e.g., text analysis, citation lookup) to verify content
if needed.

Reasoning: Let’s think step by step to evaluate each output: Step 1: Read the agent’s output and confirm its alignment with the agent’s
role. Step 2: Assess each criterion (Depth, Originality, Actionability, Topic Coverage), noting strengths and weaknesses. Step 3: Assign
a score (0–10) for each criterion based on quality, then calculate the weighted total (0–100). Step 4: Generate feedback for each
criterion, specifying what was done well and what needs improvement. Step 5: Verify the evaluation by cross-checking with the article
or cited papers using tools, if necessary.

Analyze: Use tools to verify article or cited paper content to ensure accurate evaluation (e.g., confirm Extractor’s quotes, Citation’s
references). Reflect: Ensure the score and feedback are fair, consistent, and actionable. Re-evaluate if any criterion seems misjudged.
Continue: Iterate until the evaluation is complete for all agents.

Tool Use: Use text analysis tools to verify article content (e.g., Extractor’s quotes, Analyzer’s methodology). Use citation lookup
tools to confirm cited paper details (e.g., Citation’s references). Use web/X search tools to validate Reviewer’s external context,
if needed.

Chain of Thoughts: Document the evaluation process explicitly. For example: “The Extractor’s output identifies a limitation but
lacks critical insight, reducing Depth.” “The Analyzer’s limitation is generic, affecting Originality.” “The Reviewer’s output is
actionable but misses ethical considerations, limiting Topic Coverage.” This narrative ensures transparency and justifies the score
and feedback.

Scoring: For each criterion, assign a score (0–10) based on quality: 0–3: Poor (major issues, e.g., superficial, generic, not actionable,
narrow coverage). 4–6: Fair (moderate issues, e.g., somewhat insightful, partially actionable, incomplete coverage). 7–8: Good
(minor issues, e.g., mostly critical, slightly generic, broadly actionable). 9–10: Excellent (no issues, e.g., highly insightful,
novel, clearly actionable, comprehensive coverage).

Calculate the total score: Sum (criterion score × weight), where weights are Depth (0.2), Originality (0.2), Actionability (0.3),
Topic Coverage (0.3).

Example: Depth (8 × 0.2 = 1.6), Originality (7 × 0.2 = 1.4), Actionability (9 × 0.3 = 2.7), Topic Coverage (6 × 0.3 = 1.8),
Total = (1.6 + 1.4 + 2.7 + 1.8) × 10 = 75.

Input:
Extractor Agent: [extractor_agent output]
Analyzer Agent: [analyzer_agent output]
Reviewer Agent: [reviewer_agent output]
Citation Agent: [citation_agent output]

Output Format: The output must strictly be in JSON format, starting with ```json\n{...}.
For each agent (Extractor, Analyzer, Reviewer, Citation), provide a JSON object with the following structure:

{
  "agent": "[Agent Name]",
  "total_score": [Numerical score, 0–100],
  "evaluation": {
    "Depth": {
      "score": [0–10],
      "strengths": "[What was done well]",
      "issues": "[Problems identified]",
      "suggestions": "[How to improve]"
    },
    "Originality": {
      "score": [0–10],
      "strengths": "[What was done well]",
      "issues": "[Problems identified]",
      "suggestions": "[How to improve]"
    },
    "Actionability": {
      "score": [0–10],
      "strengths": "[What was done well]",
      "issues": "[Problems identified]",
      "suggestions": "[How to improve]"
    },
    "Topic_Coverage": {
      "score": [0–10],
      "strengths": "[What was done well]",
      "issues": "[Problems identified]",
      "suggestions": "[How to improve]"
    }
  }
}'''

### assessment and lim generation

In [None]:
import re
import json

def llm_assessment(agent_texts: dict,
                   agent_prompts: dict,
                   metrics=None):
    """
    Performs the LLM assessment (collective judge) to generate scores and feedback for each agent.
    If parsing fails, returns the raw LLM response in a third return value.

    Returns:
      combined: dict mapping agent name to parsed JSON evaluation data (empty if parse failed)
      row_scores: dict of per-agent score keys (scores or None)
      raw_response: the unparsed LLM output string
    """
    if metrics is None:
        metrics = ['Depth','Originality','Actionability','Topic_Coverage']

    # 1) Fire off the collective judge prompt
    raw_response = run_critic(
        JUDGE_PROMPT +
        "".join(f"**{agent} Agent**:\n{agent_texts[agent]}\n\n"
                for agent in agent_prompts) +
        JUDGE_PROMPT
    )
    # print("collective judge response:\n", raw_response)

    # 2) Extract JSON-fenced blocks
    blocks = re.findall(r"```json\n(.*?)```", raw_response, re.DOTALL)

    if not blocks:
        # No JSON blocks found at all → return empty combined and scores, plus raw text
        # print("⚠️ Warning: No JSON-fenced sections found in collective_judge.")
        return {}, {f"{agent}_score": None for agent in agent_prompts}, raw_response

    combined = {}
    for b in blocks:
        try:
            parsed = json.loads(b)
            agent_name = parsed.get("agent")
            if agent_name:
                combined[agent_name] = parsed
            # else:
            #     print("⚠️ Warning: JSON block missing 'agent' field:", b)
        except json.JSONDecodeError:
            # print("⚠️ Warning: Failed to parse JSON block:", b)
            # skip it
            pass

    # If combined is still empty, parsing failed entirely
    if not combined:
        # print("⚠️ Warning: Parsed no valid agent entries. Returning empty scores.")
        return {}, {f"{agent}_score": None for agent in agent_prompts}, raw_response

    # 3) Build row_scores from combined
    row_scores = {}
    for agent, data in combined.items():
        row_scores[f"{agent}_score"] = data.get("total_score")

    return combined, row_scores, raw_response


### limitation generation

In [None]:
system_prompt = '''You are a helpful, respectful, and honest assistant for generating limitations or shortcomings of a research paper.
 Generate limitations or shortcomings for the following passages from the scientific paper.'''

In [None]:
agent_prompts = {
    "Extractor": Extractor,
    "Analyzer":  Analyzer,
    "Reviewer":  Reviewer,
    "Citation":  Citation,
}
agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]

parsed_feedback_rows = []
metrics = ['Depth','Originality','Actionability','Topic_Coverage']
judge_feedback = []
generated_limitations = []


df["failed_parse_assessment"] = ""

# Initialize df_all before the loop
df_all = pd.DataFrame()

for i in range(len(df)): # len(df)
    print(f"\nProcessing row {i}")
    row = df.iloc[i]

    # build per‐row CSV
    inputs = {col: row[col].strip() for col in main_cols
              if isinstance(row.get(col), str) and row[col].strip()}
    if not inputs:
        print(f"Row {i} has no valid main sections."); continue
    own_text = " ".join(inputs.values())
    own_emb  = embed_model.encode(own_text, convert_to_tensor=True)

    keep_cols = []
    # 2) Measure similarity for each referenced paper
    ref_sims = []
    for ref_id in range(1, 116):
        # gather all non‐empty sections for this reference
        texts = []
        for suf in ref_suffixes:
            col = f"neurips_ref_{ref_id}_{suf}"
            t   = row.get(col, "")
            if isinstance(t, str) and t.strip():
                texts.append(t.strip())
        if not texts:
            continue
        # concatenate and embed
        ref_text = " ".join(texts)
        ref_emb  = embed_model.encode(ref_text, convert_to_tensor=True)
        # cosine similarity requires numpy arrays
        sim = cosine_similarity(
            ref_emb.cpu().numpy().reshape(1, -1),
            own_emb.cpu().numpy().reshape(1, -1)
        )[0][0]
        ref_sims.append((ref_id, sim))

    if not ref_sims:
        print(" No references to filter; skipping row.")
        continue

    ref_sims.sort(key=lambda x: x[1], reverse=True)
    sims = [sim for _, sim in ref_sims]
    # print("sims are",sims)
    if len(sims) < 2:
        # Fewer than 2 refs → keep them all
        selected = [rid for rid, _ in ref_sims]
    else:
        # compute gaps between adjacent sims
        gaps = [sims[i] - sims[i+1] for i in range(len(sims) - 1)]
        # find the index of the largest jump
        max_gap_idx = gaps.index(max(gaps))
        # keep every ref up through that jump
        selected = [rid for rid, _ in ref_sims[: max_gap_idx + 1]]

    keep_cols = []
    for ref_id in selected:
        for suf in ref_suffixes:
            col = f"neurips_ref_{ref_id}_{suf}"
            t   = row.get(col, "")
            if isinstance(t, str) and t.strip():
                keep_cols.append(col)

    keep_cols = [
    c for c in keep_cols
    if c in df.columns and isinstance(row[c], str) and row[c].strip()
    ]

    csv_path = "df_rag_train.csv"
    # drop the column which has 'NaN' value
    df.loc[[i], keep_cols].dropna(axis=1, how="all").to_csv(csv_path, index=False)

    # load & chunk
    df_rag = pd.read_csv(csv_path)
    lc_docs = []
    for col in df_rag.columns:
        text = df_rag.loc[0, col]
        if isinstance(text, str) and text.strip():
            lc_docs.append(
                Document(
                    page_content=text.strip(),
                    metadata={"source_column": col}
                )
            )
    # lc_docs  = CSVLoader(file_path=csv_path).load()
    chunked  = CharacterTextSplitter(chunk_size=512, chunk_overlap=64).split_documents(lc_docs)
    # retrieve top‐k docs
    retriever = make_retriever_for_docs(chunked, k=3)
    # calling 'get_relevant_documents' from langchain.retriever
    docs      = retriever.get_relevant_documents(row["response_string_all"])
    # 'passages' contains the relevant documents from vector database
    passages  = [d.page_content for d in docs]
    # passages_str = "\n\n".join(passages)

    # build query (input paper + system prompt)
    query = "Here are the all sections of a paper: " + row["response_string_all"] + '\n\n' + system_prompt
    # print("query is",query)

    # tokenize + truncate if needed, passages contains retrieved text
    passages = [p.replace("<|endoftext|>", "") for p in passages]

    passages = ensure_passages_within_budget(query, passages)

    retrieved_text = "\n\n".join(passages)  # passages contains the content from cited papers (vector database)
    # print("retrieved text is", retrieved_text)
    # ── Store them in df ──
    df.at[i, "query"] = query
    df.at[i, "retrieved_text"] = retrieved_text

    extractor_agent = run_critic(Extractor + query)
    analyzer_agent = run_critic(Analyzer + query)
    reviewer_agent = run_critic(Reviewer + query)
    citation_agent = run_critic(Citation + retrieved_text)

    # ── Store each agent’s output in df ──
    df.at[i, "extractor_agent"] = extractor_agent
    df.at[i, "analyzer_agent"]  = analyzer_agent
    df.at[i, "reviewer_agent"]  = reviewer_agent
    df.at[i, "citation_agent"]  = citation_agent

    agent_texts = {
        "Extractor": extractor_agent,
        "Analyzer":  analyzer_agent,
        "Reviewer":  reviewer_agent,
        "Citation":  citation_agent
    }

    combined, row_scores, raw_judge = llm_assessment(
        agent_texts=agent_texts,
        agent_prompts=agent_prompts,
        metrics=metrics
    )
    # If parsing failed (combined is empty), store raw_judge in a new column
    if not combined:
        df.at[i, "failed_parse_assessment"] = raw_judge
        # Optionally, skip further processing for this row:
        continue

    flattened_records = []
    # storing the value
    wide_rec = {"df": i}

    for agent_name, data in combined.items():
        # 1) Top‐level total_score
        wide_rec[f"{agent_name}_total_score"] = data.get("total_score", None)

        # 2) Per‐metric fields from data["evaluation"]
        evaluation = data.get("evaluation", {})
        for metric in metrics:
            metric_info = evaluation.get(metric, {})
            # Numeric score
            wide_rec[f"{agent_name}_{metric}_score"] = metric_info.get("score", None)
            # Strengths, issues, suggestions
            wide_rec[f"{agent_name}_{metric}_strengths"]   = metric_info.get("strengths", "")
            wide_rec[f"{agent_name}_{metric}_issues"]      = metric_info.get("issues", "")
            wide_rec[f"{agent_name}_{metric}_suggestions"] = metric_info.get("suggestions", "")

    # Now write each key/value from wide_rec back into df at row i:
    for col_name, value in wide_rec.items():
        # Skip the "df" entry, since that just equals i
        if col_name == "df":
            continue
        df.at[i, col_name] = value

df.to_csv("self_feedback/df_neurips_self_feedback.csv",index=False)

### parsing feedback with regex

In [None]:
import pandas as pd
df = pd.read_csv("df_neurips_self_feedback.csv")

In [None]:
import re
import pandas as pd

# ─── 2. Define which agents and metrics to extract ───

agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]
metrics = ["Depth", "Originality", "Actionability", "Topic_Coverage"]
fields = ["score", "strengths", "issues", "suggestions"]

# ─── 3. Create all target columns in advance ───

for agent in agents:
    for metric in metrics:
        for field in fields:
            col_name = f"{agent}_{metric}_{field}"
            df[col_name] = None  # initialize with None (becomes NaN in pandas)

# ─── 4. Iterate over each row and use regex to fill in the new columns ───

for idx, cell in df["failed_parse_assessment"].items():
    if not isinstance(cell, str):
        # If the cell is not a string, continue (all new columns stay None/NaN)
        continue

    # 4a) We assume the JSON is wrapped in ```json … ```. Extract the {...} portion
    match = re.search(r"\{.*\}", cell, flags=re.DOTALL)
    if not match:
        # If no braces‐enclosed content is found, skip this row
        continue

    inner_json = match.group(0)

    #    We use non‐greedy `.*?` and DOTALL to allow line breaks.
    for agent in agents:
        for metric in metrics:
            pattern = (
                rf'"{agent}"\s*:\s*\{{'                              # "Extractor": {
                rf'.*?"{metric}"\s*:\s*\{{'                           #    "Depth": {
                rf'.*?"score"\s*:\s*(?P<{agent}_{metric}_score>\d+)'   #       "score": 0
                rf'.*?"strengths"\s*:\s*"(?P<{agent}_{metric}_strengths>.*?)"'  # "strengths": "…"
                rf'.*?"issues"\s*:\s*"(?P<{agent}_{metric}_issues>.*?)"'       # "issues": "…"
                rf'.*?"suggestions"\s*:\s*"(?P<{agent}_{metric}_suggestions>.*?)"' # "suggestions": "…"
                rf'.*?\}}'                                             #    }
            )

            m = re.search(pattern, inner_json, flags=re.DOTALL)
            if m:
                # Extract each named group if it matched
                df.at[idx, f"{agent}_{metric}_score"] = m.group(f"{agent}_{metric}_score")
                df.at[idx, f"{agent}_{metric}_strengths"] = m.group(f"{agent}_{metric}_strengths")
                df.at[idx, f"{agent}_{metric}_issues"] = m.group(f"{agent}_{metric}_issues")
                df.at[idx, f"{agent}_{metric}_suggestions"] = m.group(f"{agent}_{metric}_suggestions")
            # If regex didn’t match, leave those columns as None/NaN


### regenerate

In [None]:
Regenerate_PROMPT = '''

You are tasked with generating limitations based on feedback from the Judge Agent.
Feedback Structure:
Strengths: [strengths]
Issues: [issues]
Suggestions: [suggestions]
Task: Create a set of limitations that:

Builds upon the identified strengths to reinforce positive aspects.
Minimizes the impact of issues by addressing them constructively.
Incorporates suggestions to ensure actionable improvements.
Ensure the limitations are clear, concise, and aligned with your role as [specify role, e.g., a content generator, analyst, etc.].'''

In [None]:
# Identify all columns ending with "_score"
score_cols = [col for col in df.columns if col.endswith("_score")]

# For each such column, count how many entries are < 8
for col in score_cols:
    # Ensure the column is numeric (coerce non-numeric to NaN)
    numeric_series = pd.to_numeric(df[col], errors="coerce")
    count_lt_8 = (numeric_series < 8).sum()
    print(f"{col}: {count_lt_8} rows with value < 8")


In [None]:
import pandas as pd

# ─── 1. Definitions ───

metrics = ["Depth", "Originality", "Actionability", "Topic_Coverage"]

AGENT_BASE_PROMPTS = {
    "Extractor": Extractor,
    "Analyzer":  Analyzer,
    "Reviewer":  Reviewer,
    "Citation":  Citation
}


agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]
for agent in agents:
    for metric in metrics:
        score_col = f"{agent}_{metric}_score"
        df[score_col] = pd.to_numeric(df[score_col], errors="coerce")

# ─── 3. Initialize regenerated‐output columns ───

for agent in agents:
    df[f"{agent}_regenerated"] = None

# ─── 4. Loop over each row ───

for i, row in df.iterrows():
    print("i is",i)
    input_string   = row.get("query", "")
    retrieved_text = row.get("retrieved_text", "")

    for agent in agents:
        feedback_parts = []

        # 4a. Gather feedback for any metric with score < 8
        for metric in metrics:
            score = row.get(f"{agent}_{metric}_score", None)
            # if any of them or multiple less than 8, add them and send with feedback for regenerate
            if pd.notna(score) and score < 8:
                strengths   = row.get(f"{agent}_{metric}_strengths", "")
                issues      = row.get(f"{agent}_{metric}_issues", "")
                suggestions = row.get(f"{agent}_{metric}_suggestions", "")

                feedback_parts.append(
                    f"{metric} Feedback:\n"
                    f"  Strengths: {strengths}\n"
                    f"  Issues: {issues}\n"
                    f"  Suggestions: {suggestions}"
                )

        # 4b. If any metric is < 8, build regeneration prompt
        if feedback_parts:
            feedback_blob = "\n\n".join(feedback_parts)

            # Choose the correct “seed text” based on agent
            if agent == "Citation":
                seed_text = retrieved_text + system_prompt
            else:
                seed_text = input_string

            full_prompt = (
                AGENT_BASE_PROMPTS[agent]
                + seed_text
                + "\n\n"
                + Regenerate_PROMPT
                + "\n\n"
                + feedback_blob
            )
            # 4c. Call the LLM to regenerate and store the result
            regenerated_output = run_critic(full_prompt)
            df.at[i, f"{agent}_regenerated"] = regenerated_output

df.to_csv("df_neurips_self_feedback.csv",index=False)

### Judge

In [None]:
# call judge LLM and store the raw response in dataframe
agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]

for i, row in df.iterrows():
    for agent in agents:
        regen_col = f"{agent}_regenerated"
        feedback_col = f"self_feedback_{agent}"

        # If the regenerated text is exactly None, store 0
        if row.get(regen_col) is None:
            df.at[i, feedback_col] = 0
            continue

        # Otherwise, build a single‐agent judge prompt
        single_agent_text = row[regen_col]
        prompt = (
            JUDGE_PROMPT
            + f"**{agent} Agent**:\n{single_agent_text}\n\n"
        )

        # Call run_critic and store the response
        df.at[i, feedback_col] = run_critic(prompt)


### parsing feedback with regex (again)

In [None]:
import re
import pandas as pd

# ─── 1. Define agents, metrics, and fields ───

agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]
metrics = ["Depth", "Originality", "Actionability", "Topic_Coverage"]
fields = ["score", "strengths", "issues", "suggestions"]

# ─── 2. Create new target columns with suffix "_2" ───

for agent in agents:
    for metric in metrics:
        for field in fields:
            col_name = f"{agent}_{metric}_{field}_2"
            df[col_name] = None  # initialize with None (will appear as NaN)

# ─── 3. Iterate over each row and each agent-specific self_feedback ───

for idx in df.index:
    for agent in agents:
        feedback_col = f"self_feedback_{agent}"
        cell = df.at[idx, feedback_col]

        # If the agent-specific feedback is exactly 0, skip parsing for this agent
        if cell == 0:
            continue

        # Otherwise, only proceed if it's a non-empty string
        if not isinstance(cell, str):
            continue

        # Extract JSON block between the first "{" and the last "}"
        match = re.search(r"\{.*\}", cell, flags=re.DOTALL)
        if not match:
            continue

        inner_json = match.group(0)

        # Attempt to extract each metric's fields from that JSON blob
        for metric in metrics:
            pattern = (
                rf'"{agent}"\s*:\s*\{{'                                     # "Extractor": {
                rf'.*?"{metric}"\s*:\s*\{{'                                  #   "Depth": {
                rf'.*?"score"\s*:\s*(?P<{agent}_{metric}_score_2>\d+)'      #      "score": 0
                rf'.*?"strengths"\s*:\s*"(?P<{agent}_{metric}_strengths_2>.*?)"'  # "strengths": "…"
                rf'.*?"issues"\s*:\s*"(?P<{agent}_{metric}_issues_2>.*?)"'          # "issues": "…"
                rf'.*?"suggestions"\s*:\s*"(?P<{agent}_{metric}_suggestions_2>.*?)"'# "suggestions": "…"
                rf'.*?\}}'                                                    #   }
            )

            m = re.search(pattern, inner_json, flags=re.DOTALL)
            if m:
                df.at[idx, f"{agent}_{metric}_score_2"]       = m.group(f"{agent}_{metric}_score_2")
                df.at[idx, f"{agent}_{metric}_strengths_2"]   = m.group(f"{agent}_{metric}_strengths_2")
                df.at[idx, f"{agent}_{metric}_issues_2"]      = m.group(f"{agent}_{metric}_issues_2")
                df.at[idx, f"{agent}_{metric}_suggestions_2"] = m.group(f"{agent}_{metric}_suggestions_2")
            # If no match, leave those columns as None/NaN


### measure score

In [None]:
# score first time
import numpy as np

# Identify all columns ending with "_score"
score_cols = [col for col in df.columns if col.endswith("_score")]

# Replace zeros with NaN so they are excluded from the mean
score_averages = df[score_cols].replace(0, np.nan).mean()

print("Average values for '_score' columns (excluding zeros):")
print(score_averages)


Average values for '_score' columns (excluding zeros):
Extractor_Depth_score             6.173729
Extractor_Originality_score       5.334746
Extractor_Actionability_score     7.131356
Extractor_Topic_Coverage_score    7.250965
Analyzer_Depth_score              7.992537
Analyzer_Originality_score        7.370647
Analyzer_Actionability_score      8.631841
Analyzer_Topic_Coverage_score     8.121891
Reviewer_Depth_score              8.109453
Reviewer_Originality_score        7.246269
Reviewer_Actionability_score      8.266169
Reviewer_Topic_Coverage_score     8.333333
Citation_Depth_score              7.828358
Citation_Originality_score        6.962687
Citation_Actionability_score      8.236318
Citation_Topic_Coverage_score     8.303483
dtype: float64


In [None]:
import numpy as np
import pandas as pd

# Identify all columns ending with "_score_2"
score2_cols = [col for col in df.columns if col.endswith("_score_2")]

averages = {}

for col2 in score2_cols:
    # Derive the corresponding base column by removing the trailing "_2"
    col1 = col2[:-2]

    # 1) Take the "_2" values, coercing to numeric and replacing 0 with NaN
    s2 = pd.to_numeric(df[col2], errors='coerce').replace(0, np.nan)

    # 2) If the base column exists, use its nonzero values where s2 is NaN
    if col1 in df.columns:
        s1 = pd.to_numeric(df[col1], errors='coerce').replace(0, np.nan)
        # Fill NaNs in s2 with the values from s1
        pref = s2.fillna(s1)
    else:
        pref = s2

    # 3) Compute the mean ignoring NaNs
    averages[col2] = pref.mean()

# Convert to a pandas Series for nicer output (optional)
avg_series = pd.Series(averages)
print("Average values for '_score_2' columns (fallback to base column if needed):")
print(avg_series)


Average values for '_score_2' columns (fallback to base column if needed):
Extractor_Depth_score_2             6.817955
Extractor_Originality_score_2       6.032419
Extractor_Actionability_score_2     7.725686
Extractor_Topic_Coverage_score_2    7.315920
Analyzer_Depth_score_2              8.144279
Analyzer_Originality_score_2        7.898010
Analyzer_Actionability_score_2      8.743781
Analyzer_Topic_Coverage_score_2     8.054726
Reviewer_Depth_score_2              8.467662
Reviewer_Originality_score_2        7.475124
Reviewer_Actionability_score_2      8.810945
Reviewer_Topic_Coverage_score_2     8.631841
Citation_Depth_score_2              8.674129
Citation_Originality_score_2        8.077114
Citation_Actionability_score_2      8.417910
Citation_Topic_Coverage_score_2     8.850746
dtype: float64


### regenerate limitations (again)

In [None]:
Regenerate_PROMPT = '''

You are tasked with generating limitations based on feedback from the Judge Agent.
Feedback Structure:
Strengths: [strengths]
Issues: [issues]
Suggestions: [suggestions]
Task: Create a set of limitations that:

Builds upon the identified strengths to reinforce positive aspects.
Minimizes the impact of issues by addressing them constructively.
Incorporates suggestions to ensure actionable improvements.
Ensure the limitations are clear, concise, and aligned with your role as [specify role, e.g., a content generator, analyst, etc.].'''

In [None]:
# Identify all columns ending with "_score"
score_cols = [col for col in df.columns if col.endswith("_score_2")]

# For each such column, count how many entries are < 8
for col in score_cols:
    # Ensure the column is numeric (coerce non-numeric to NaN)
    numeric_series = pd.to_numeric(df[col], errors="coerce")
    count_lt_8 = (numeric_series < 8).sum()
    print(f"{col}: {count_lt_8} rows with value < 8")


In [None]:
import pandas as pd

# ─── 1. Definitions ───

metrics = ["Depth", "Originality", "Actionability", "Topic_Coverage"]

AGENT_BASE_PROMPTS = {
    "Extractor": Extractor,
    "Analyzer":  Analyzer,
    "Reviewer":  Reviewer,
    "Citation":  Citation
}

# These should be defined elsewhere in your script:
# Regenerate_PROMPT = "Regenerate your previous answer by considering the feedback below:\n"
# run_critic(...) should also be defined.

# ─── 2. Ensure “_score_2” columns are numeric ───

agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]
for agent in agents:
    for metric in metrics:
        score2_col = f"{agent}_{metric}_score_2"
        if score2_col in df.columns:
            df[score2_col] = pd.to_numeric(df[score2_col], errors="coerce")

# ─── 3. Initialize regenerated‐2 columns ───

for agent in agents:
    df[f"{agent}_regenerated_2"] = None

# ─── 4. Loop over each row, regenerating based on “_score_2” feedback ───

for i, row in df.iterrows():
    input_string   = row.get("query", "")
    retrieved_text = row.get("retrieved_text", "")

    for agent in agents:
        feedback_parts = []

        # 4a. Gather feedback for any metric where “_score_2” < 8
        for metric in metrics:
            score2 = row.get(f"{agent}_{metric}_score_2", None)
            if pd.notna(score2) and score2 < 8:
                strengths2   = row.get(f"{agent}_{metric}_strengths_2", "")
                issues2      = row.get(f"{agent}_{metric}_issues_2", "")
                suggestions2 = row.get(f"{agent}_{metric}_suggestions_2", "")

                feedback_parts.append(
                    f"{metric} Feedback:\n"
                    f"  Strengths: {strengths2}\n"
                    f"  Issues: {issues2}\n"
                    f"  Suggestions: {suggestions2}"
                )

        # 4b. If any metric_2 is < 8, build regeneration prompt
        if feedback_parts:
            feedback_blob = "\n\n".join(feedback_parts)

            # Choose “seed text” based on agent
            if agent == "Citation":
                seed_text = retrieved_text + system_prompt
            else:
                seed_text = input_string + system_prompt

            full_prompt = (
                AGENT_BASE_PROMPTS[agent]
                + seed_text
                + "\n\n"
                + Regenerate_PROMPT
                + "\n\n"
                + feedback_blob
            )
            # 4c. Call the LLM to regenerate and store in “_regenerated_2”
            regenerated_output = run_critic(full_prompt)
            df.at[i, f"{agent}_regenerated_2"] = regenerated_output


df.to_csv(
    "df_neurips_self_feedback1.csv",
    index=False
)


### Judge (again)

In [None]:
import pandas as pd

agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]

# 1) Create the new "self_feedback_2_<Agent>" columns up front
for agent in agents:
    df[f"self_feedback_2_{agent}"] = None

# 2) Loop over each row and each agent
for i, row in df.iterrows():
    for agent in agents:
        regen_col   = f"{agent}_regenerated_2"
        feedback_col = f"self_feedback_2_{agent}"

        # If the regenerated text is NaN or None, store 0
        if pd.isna(row.get(regen_col)):
            df.at[i, feedback_col] = 0
            continue

        # Otherwise, build a single‐agent judge prompt
        single_agent_text = row[regen_col]
        prompt = (
            JUDGE_PROMPT
            + f"**{agent} Agent**:\n{single_agent_text}\n\n"
        )
        # Call run_critic and store the raw response
        df.at[i, feedback_col] = run_critic(prompt)

df.to_csv(
    "df_neurips_self_feedback1.csv",
    index=False
)


### parsing (again)

In [None]:
import re
import pandas as pd

agents = ["Extractor", "Analyzer", "Reviewer", "Citation"]
metrics = ["Depth", "Originality", "Actionability", "Topic_Coverage"]
fields = ["score", "strengths", "issues", "suggestions"]

# 1) Create the new "_3" columns as before
for agent in agents:
    for metric in metrics:
        for field in fields:
            col_name = f"{agent}_{metric}_{field}_3"
            df[col_name] = None

# 2) Loop over each row and parse self_feedback_2_<Agent>
for idx in df.index:
    for agent in agents:
        feedback_col = f"self_feedback_2_{agent}"
        cell = df.at[idx, feedback_col]

        # If it’s exactly 0, skip
        if cell == 0:
            continue

        # Otherwise it must be a string containing JSON
        if not isinstance(cell, str):
            continue

        # Extract the JSON object (including braces) in case there’s extra text
        m_obj = re.search(r"\{.*\}", cell, flags=re.DOTALL)
        if not m_obj:
            continue

        json_blob = m_obj.group(0)

        # Now look for the “evaluation” block, then inside it each metric
        # e.g.  "evaluation": { "Depth": { "score": 5, "strengths": "…", … }, … }
        for metric in metrics:
            # Build a pattern that finds, inside "evaluation", the right metric
            pattern = (
                rf'"evaluation"\s*:\s*\{{'                # "evaluation": {
                rf'.*?"{metric}"\s*:\s*\{{'                #    "Depth": {
                rf'.*?"score"\s*:\s*(?P<{agent}_{metric}_score_3>\d+),'
                                                      #       "score": 5,
                rf'.*?"strengths"\s*:\s*"(?P<{agent}_{metric}_strengths_3>.*?)",'
                                                      #       "strengths": "…",
                rf'.*?"issues"\s*:\s*"(?P<{agent}_{metric}_issues_3>.*?)",'
                                                      #       "issues": "…",
                rf'.*?"suggestions"\s*:\s*"(?P<{agent}_{metric}_suggestions_3>.*?)"'
                                                      #       "suggestions": "…"
                rf'.*?\}}'                                 #    }
            )

            m = re.search(pattern, json_blob, flags=re.DOTALL)
            if m:
                df.at[idx, f"{agent}_{metric}_score_3"]       = m.group(f"{agent}_{metric}_score_3")
                df.at[idx, f"{agent}_{metric}_strengths_3"]   = m.group(f"{agent}_{metric}_strengths_3")
                df.at[idx, f"{agent}_{metric}_issues_3"]      = m.group(f"{agent}_{metric}_issues_3")
                df.at[idx, f"{agent}_{metric}_suggestions_3"] = m.group(f"{agent}_{metric}_suggestions_3")
            # If no match for that metric, leave the _3 columns as None/NaN


### measure score (LLM metrics)

In [None]:
import numpy as np
import pandas as pd

# Identify all columns ending with "_score_3"
score3_cols = [col for col in df.columns if col.endswith("_score_3")]

averages_3 = {}

for col3 in score3_cols:
    # Derive the corresponding "_score_2" and "_score_1" column names
    col2 = col3.replace("_score_3", "_score_2")
    col1 = col3.replace("_score_3", "_score_1")

    # 1) Pull "_score_3" values, coerce to numeric, replace 0 with NaN
    s3 = pd.to_numeric(df[col3], errors="coerce").replace(0, np.nan)

    # 2) If "_score_2" exists, pull it similarly
    if col2 in df.columns:
        s2 = pd.to_numeric(df[col2], errors="coerce").replace(0, np.nan)
    else:
        s2 = pd.Series([np.nan] * len(df), index=df.index)

    # 3) If "_score_1" exists, pull it similarly
    if col1 in df.columns:
        s1 = pd.to_numeric(df[col1], errors="coerce").replace(0, np.nan)
    else:
        s1 = pd.Series([np.nan] * len(df), index=df.index)

    # 4) Fill NaNs in s3 from s2, then fill any remaining NaNs from s1
    pref = s3.fillna(s2).fillna(s1)

    # 5) Compute the mean ignoring NaNs
    averages_3[col3] = pref.mean()

# Convert to a pandas Series for readability
avg_series_3 = pd.Series(averages_3)
print("Average values for '_score_3' columns (with fallback to '_score_2' then '_score_1'):")
print(avg_series_3)


Average values for '_score_3' columns (with fallback to '_score_2' then '_score_1'):
Extractor_Depth_score_3             7.457711
Extractor_Originality_score_3       6.475124
Extractor_Actionability_score_3     8.296020
Extractor_Topic_Coverage_score_3    7.778607
Analyzer_Depth_score_3              7.708904
Analyzer_Originality_score_3        7.369863
Analyzer_Actionability_score_3      8.684932
Analyzer_Topic_Coverage_score_3     7.876712
Reviewer_Depth_score_3              7.919643
Reviewer_Originality_score_3        6.919643
Reviewer_Actionability_score_3      8.763393
Reviewer_Topic_Coverage_score_3     7.888393
Citation_Depth_score_3              8.556270
Citation_Originality_score_3        7.935691
Citation_Actionability_score_3      8.382637
Citation_Topic_Coverage_score_3     8.790997
dtype: float64


### master agent

In [None]:
COORDINATOR_PROMPT = '''
    You are a **Master Coordinator**, an expert in scientific communication and synthesis. Your task is to integrate limitations provided by four agents:
    1. **Extractor** (explicit limitations from the article),
    2. **Analyzer** (inferred limitations from critical analysis),
    3. **Reviewer** (limitations from an open review perspective),
    4. **Citation** (limitations based on cited papers).

    **Goals**:
    1. Combine all limitations into a cohesive, non-redundant list.
    2. Ensure each limitation is clearly stated, scientifically valid, and aligned with the article’s content.
    3. Prioritize author-stated limitations, supplementing with inferred, peer-review, or citation-based limitations if they add value.
    4. Resolve discrepancies between agents’ outputs by cross-referencing the article and cited papers, using tools to verify content.
    5. Format the final list in a clear, concise, and professional manner, suitable for a scientific review or report, with citations for external sources.

    **Workflow** (inspired by SYS_PROMPT_SWEBENCH):
    1. **Plan**: Outline how to synthesize limitations, identify potential redundancies, and resolve discrepancies.
    2. **Analyze**: Combine limitations, prioritizing author-stated ones, and verify alignment with the article.
    3. **Reflect**: Check for completeness, scientific rigor, and clarity; resolve discrepancies using article content or tools.
    4. **Continue**: Iterate until the list is comprehensive, non-redundant, and professionally formatted.

    **Output Format**:
    - Numbered list of final limitations.
    - For each: Clear statement, brief justification, and source in brackets (e.g., [Author-stated], [Inferred], [Peer-review-derived], [Cited-papers]).
    - Include citations for external sources (e.g., web/X posts, cited papers) in the format [Source Name](ID).
    **Tool Use**:
    - Use text extraction tools to verify article content.
    - Use citation lookup tools to cross-reference cited papers.
    - Use web/X search tools to resolve discrepancies involving external context.

    **Input**: '''


In [None]:
import pandas as pd

# taking the self-feedback if it exists otherwise acutal one
def master_agent(extractor_text, analyzer_text, reviewer_text, citation_text):
    """
    Takes the outputs of the four specialized agents and produces
    the final coordinated limitations via a GPT call.
    """
    coord_prompt = (
        COORDINATOR_PROMPT
        + f"**Extractor Agent**:\n{extractor_text}\n\n"
        + f"**Analyzer Agent**:\n{analyzer_text}\n\n"
        + f"**Reviewer Agent**:\n{reviewer_text}\n\n"
        + f"**Citation Agent**:\n{citation_text}\n\n"
        + "Produce a single, numbered list of final limitations, noting each source in brackets."
    )
    return run_critic(coord_prompt)

In [None]:
import pandas as pd

# Ensure pandas recognizes NaN
import numpy as np

# Add a column to store the master agent’s output
df["final_limitations"] = ""

for idx, row in df.iterrows():
    # 1) Choose the extractor text: regenerated if available, otherwise original
    if pd.notna(row["Extractor_regenerated"]):
        extractor_text = row["Extractor_regenerated"]
    else:
        extractor_text = row["extractor_agent"]

    # 2) Analyzer text
    if pd.notna(row["Analyzer_regenerated"]):
        analyzer_text = row["Analyzer_regenerated"]
    else:
        analyzer_text = row["analyzer_agent"]

    # 3) Reviewer text
    if pd.notna(row["Reviewer_regenerated"]):
        reviewer_text = row["Reviewer_regenerated"]
    else:
        reviewer_text = row["reviewer_agent"]

    # 4) Citation text
    if pd.notna(row["Citation_regenerated"]):
        citation_text = row["Citation_regenerated"]
    else:
        citation_text = row["citation_agent"]

    # 5) Call the master_agent with the chosen texts
    final = master_agent(
        extractor_text=extractor_text,
        analyzer_text=analyzer_text,
        reviewer_text=reviewer_text,
        citation_text=citation_text
    )

    # 6) Store the result back into df
    df.at[idx, "final_limitations"] = final


### Evaluations (measuring Ground Truth coverage)

In [None]:
# convert list to string and split
def process_single_limitation(limitation_text: str) -> list[str]:
    """
    Split the text on '**' and return the segments
    that occur before each '**'.
    """
    parts = limitation_text.split("**")
    # parts at even indices (0,2,4,…) are the “previous” segments
    prev_texts = [
        part.strip()
        for idx, part in enumerate(parts)
        if idx % 2 == 0    # even indices
           and part.strip()  # non-empty
    ]
    return prev_texts

# Apply to your DataFrame column
df_generated_limitations_1["final"] = (
    df_generated_limitations_1["final"]
    .apply(process_single_limitation)
)


In [None]:
import ast
import pandas as pd

def enumerate_and_filter(cell):
    """
    Given a cell that is either:
      - A Python list of strings, or
      - A string repr of such a list,
    this will:
      1. turn it into a list of sublists,
      2. remove any sublist equal to ['-'],
      3. prefix each remaining sublist's string with its 1-based index,
      4. return a new list-of-lists.
    """
    # 1) Parse string repr if necessary
    if isinstance(cell, str):
        try:
            lst = ast.literal_eval(cell)
        except Exception:
            # not a literal list → treat the entire cell as one string
            lst = [cell]
    else:
        lst = cell

    # 2) Ensure list-of-lists
    lol = []
    for item in lst:
        if isinstance(item, list):
            lol.append(item)
        else:
            # assume it's a bare string
            lol.append([str(item)])

    # 3) Filter out ['-'] sublists
    filtered = [sub for sub in lol if not (len(sub) == 1 and sub[0].strip() == "-")]

    # 4) Enumerate: prefix each sublist’s only element with "i. "
    enumerated = [[f"{i+1}. {sub[0]}"] for i, sub in enumerate(filtered)]

    return enumerated

# Example usage on your DataFrame
df_generated_limitations_2['generated_limitations_1'] = (df_generated_limitations_2['generated_limitations_1'].apply(enumerate_and_filter))

# Remove the first sublist in each list-of-lists
df_generated_limitations_2['generated_limitations_1'] = (df_generated_limitations_2['generated_limitations_1']
    .apply(lambda lol: lol[1:] if isinstance(lol, list) and len(lol) > 0 else [])
)

In [None]:
def process_single_limitation(limitation_text):
    # Split into different limitations (separated by \n\n)
    limitations = limitation_text.split('\n\n')
    processed_limitations = []

    for limitation in limitations:
        # Remove numbering (e.g., "1. **Limited Literature Review**" → "**Limited Literature Review**")
        cleaned_limitation = limitation.split('. ', 1)[-1] if '. ' in limitation else limitation

        # Split into sentences (using '.')
        sentences = [s.strip() for s in cleaned_limitation.split('.') if s.strip()]

        if sentences:
            processed_limitations.append(sentences)

    return processed_limitations

# df_generated_limitations_2['generated_limitations_1'] = df_generated_limitations_2['generated_limitations_1'].apply(process_single_limitation)
df_lim['Lim_and_OR_ground_truth_final'] = df_lim['Lim_and_OR_ground_truth_final'].apply(process_single_limitation)

In [None]:
# add numbering of LLM generated limitations
def add_numbering_to_limitations(list_of_lists):
    if not isinstance(list_of_lists, list):
        return list_of_lists  # Skip if not a list

    numbered_list = []
    for idx, sublist in enumerate(list_of_lists, start=1):
        if sublist:  # Ensure sublist is not empty
            # Add numbering to the first element of the sublist
            numbered_sublist = [f"{idx}. {sublist[0]}"] + sublist[1:]
            numbered_list.append(numbered_sublist)
    return numbered_list

# Apply to the column (modifies existing column)
df_lim['Lim_and_OR_ground_truth_final'] = df_lim['Lim_and_OR_ground_truth_final'].apply(add_numbering_to_limitations)
# df_generated_limitations_2['generated_limitations_1']  = df_generated_limitations_2['generated_limitations_1'] .apply(add_numbering_to_limitations)

In [None]:
def remove_empty_limitation_entries(entries):
    """
    Remove sublists where the only element indicates
    'does not mention any limitations' (case-insensitive).
    """
    if not isinstance(entries, list):
        return entries

    filtered = []
    target_phrase = "does not mention any limitations"
    for sublist in entries:
        if isinstance(sublist, list) and len(sublist) == 1:
            text = sublist[0].lower()
            # skip any sublist whose sole item matches the target phrase
            if target_phrase in text:
                continue
        filtered.append(sublist)
    return filtered

# Apply to your DataFrame column
df_generated_limitations_2["generated_limitations_1"] = (
    df_generated_limitations_2["generated_limitations_1"]
    .apply(remove_empty_limitation_entries)
)


In [None]:
import re

def remove_source_entries(entries):
    """
    Remove sublists where any string in the sublist contains any of the specified keywords
    (case-insensitive comparison).
    """
    if not isinstance(entries, list):
        return entries  # Return non-list entries as-is

    # keywords = ['author-stated', 'inferred', 'peer-review-derived']  # Lowercase for matching

    keywords = [
        'the authors did not outline any limitations',
        'the article does not explicitly',
        'the authors do not provide any specific limitations'
    ]

    filtered = []
    for sublist in entries:
        if isinstance(sublist, list):
            # Convert entire sublist to lowercase string for case-insensitive search
            sublist_text = ' '.join(map(str, sublist)).lower()
            # Check if NONE of the keywords are in the sublist text
            if not any(keyword in sublist_text for keyword in keywords):
                filtered.append(sublist)
        else:
            filtered.append(sublist)  # Keep non-list items

    return filtered

# Apply to DataFrame column
df_generated_limitations_2["generated_limitations_1"] = df_generated_limitations_2["generated_limitations_1"].apply(
    lambda lst: remove_source_entries(lst) if isinstance(lst, list) else lst
)

In [None]:
# organize numbers
def renumber_limitations(limitations_list):
    """
    Reorganizes numbered limitations to be sequential (1, 2, 3, ...)
    while preserving all other content.
    """
    if not isinstance(limitations_list, list):
        return limitations_list

    renumbered = []
    for i, sublist in enumerate(limitations_list, start=1):
        if isinstance(sublist, list) and len(sublist) > 0:
            # Process the first item in the sublist (where the number appears)
            first_item = sublist[0]

            # Remove existing numbering (e.g., "2. :" → ":")
            content = re.sub(r'^\d+\.\s*:\s*', '', first_item)

            # Add new numbering
            renumbered_item = f"{i}. : {content}"

            # Reconstruct the sublist with the renumbered first item
            new_sublist = [renumbered_item] + sublist[1:]
            renumbered.append(new_sublist)
        else:
            renumbered.append(sublist)  # Keep non-list or empty entries

    return renumbered

# Apply to the DataFrame column
df_generated_limitations_2["generated_limitations_1"] = df_generated_limitations_2["generated_limitations_1"].apply(
    lambda lst: renumber_limitations(lst) if isinstance(lst, list) else lst
)

In [None]:
# remove future work
import re

def remove_future_entries(entries):
    """
    Given a list of lists (where each sub-list contains strings),
    return a new list omitting sublists where the first item contains
    'future' inside **double asterisks** (case-insensitive).
    """
    filtered = []
    for sublist in entries:
        if isinstance(sublist, list) and len(sublist) > 0:
            first_item = sublist[0]
            # Check if 'future' appears inside **...** (case-insensitive)
            if not re.search(r'\*\*.*future.*\*\*', first_item, re.IGNORECASE):
                filtered.append(sublist)
        else:
            filtered.append(sublist)  # Keep non-list entries as-is
    return filtered

# Apply to the DataFrame column
df_lim["Lim_and_OR_ground_truth_final"] = df_lim["Lim_and_OR_ground_truth_final"].apply(
    lambda lst: remove_future_entries(lst) if isinstance(lst, list) else lst
)

def remove_here_are_the(entries):
    """
    Given a list of lists (where each sub-list contains strings),
    return a new list omitting any sub-list that starts with "1. Here are the"
    (case-insensitive and whitespace-tolerant).
    """
    filtered = []
    for sublist in entries:
        if isinstance(sublist, list) and len(sublist) > 0:
            first_item = sublist[0].strip().lower()  # Clean whitespace and lowercase
            if not first_item.startswith("1. here are the"):
                filtered.append(sublist)
        else:
            filtered.append(sublist)  # Keep non-list entries as-is
    return filtered

# Apply to every row in the DataFrame
df_lim["Lim_and_OR_ground_truth_final"] = df_lim["Lim_and_OR_ground_truth_final"].apply(
    lambda lst: remove_here_are_the(lst) if isinstance(lst, list) else lst
)

In [None]:
df_lim['combined3'] = [[] for _ in range(len(df_lim))]

# Generate combinations for each row (skip indices 153 and 385)
for i in range(len(df_lim)):
    # if i in {153, 385}:  # Skip these indices
    #     continue  # Leaves df_lim['combined3'][i] as an empty list

    combined_list = []
    list1 = df_lim["Lim_and_OR_ground_truth_final"][i]
    list2 = df_generated_limitations_2["generated_limitations_1"][i]

    # Generate all possible combinations
    for item1 in list1:
        for item2 in list2:
            combined_list.append((item1, item2))

    # Store the first 100 combinations (or all if fewer)
    df_lim.at[i, 'combined3'] = combined_list[:100]  # Truncate if needed

In [None]:
import time
import os
from openai import OpenAI

start_time = time.time()
os.environ['OPENAI_API_KEY'] = ''
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

all_generated_summary = []

for i in range(len(df_lim)): # len(df_lim)
    # Skip rows 153 and 385
    # if i in [153, 385]:
    #     all_generated_summary.append([])  # Add empty list for these indices
    #     continue

    generated_summary = []
    for description1, description2 in df_lim['combined3'][i]: # df_lim['combined3']
        prompt = '''Check whether 'list2' contains a topic or limitation from 'list1' or 'list1' contains a topic or limitation from 'list2'.
        Your answer should be "Yes" or "No" \n. List 1:''' + str(description1) + "List2: " + str(description2)
        summary_text = ""
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            stream=True,
            temperature=0
        )

        for chunk in stream:
            summary_chunk = chunk.choices[0].delta.content or ""
            summary_text += summary_chunk

        summary_chunks = []
        summary_chunks.append(summary_text)
        generated_summary.append((summary_chunks, "list1", description1, "list2", description2))

    all_generated_summary.append(generated_summary)
    time.sleep(1)

end_time = time.time()
print(f"Total runtime: {end_time - start_time:.2f} seconds")

In [None]:
# ground truth coverage
data = []
row_num = 1  # Start row_num from 1, increment for each sublist

# Extract data from nested_list2
for sublist in all_generated_summary:
    for is_match, list1_label, ground_truth, list2_label, llm_generated in sublist:
        # Each tup is in the form of (list1, s1, s2, s3, s4)
        # Append data to list as a dictionary to maintain column order
        data.append({
            'row_num': row_num,
            'is_match': is_match[0],
            'ground_truth': ground_truth,
            'llm_generated': llm_generated
        })
    row_num += 1  # Increment row_num for each new sublist

# Create DataFrame from the list of dictionaries
df4 = pd.DataFrame(data)


import re

# Update the function to handle lists in each row
def extract_first_number_from_list(row):
    for text in row:  # Iterate through each string in the list
        match = re.match(r'^(\d+)', text)
        if match:
            return int(match.group(1))
    return None  # Return None if no number is found

# Apply the updated function to the 'ground_truth' column
df4['section'] = df4['ground_truth'].apply(extract_first_number_from_list)

# Initialize variables
current_section = None
section_has_yes = False
ck = 0

# Iterate through the DataFrame
for index, row in df4.iterrows():
    # Check if we are still in the same section
    if row['section'] == current_section:
        # Check if there is a 'Yes' in 'is_match'
        if row['is_match'] == 'Yes':
            section_has_yes = True
    else:
        # We've reached a new section, check if the last section had a 'Yes'
        if section_has_yes:
            ck += 1
        # Reset for new section
        current_section = row['section']
        section_has_yes = (row['is_match'] == 'Yes')

# Check the last section after exiting the loop
if section_has_yes:
    ck += 1
print(ck)

# total number of unique ground truth

# Calculate consecutive blocks where 'ground_truth' is the same
unique_blocks = df4['ground_truth'].ne(df4['ground_truth'].shift()).cumsum()

# Group by these blocks and count each group
group_counts = df4.groupby(unique_blocks)['ground_truth'].agg(['count'])

# Output the results
print("Number of unique consecutive 'ground_truth' texts and their counts:")
print(group_counts)


In [None]:
# LLM generated coverage

def extract_first_number(text):
    import re
    # Check if the input is a list
    if isinstance(text, list):
        # Join the list elements into a single string
        text = " ".join(text)
    # Use regex to extract the first number
    match = re.match(r'^(\d+)', text)
    return int(match.group(1)) if match else None

# Apply the updated function to extract numbers
df4['order'] = df4['llm_generated'].apply(extract_first_number)

# Sort the DataFrame by 'row_num' and then by the extracted order
df_recall = df4.sort_values(by=['row_num', 'order'])

# Reset index for clean indices in the new DataFrame
df_recall = df_recall.reset_index(drop=True)

# Reorder the columns by placing 'llm_generated' before 'ground_truth'
df_recall = df_recall[['row_num', 'is_match', 'llm_generated', 'ground_truth', 'section', 'order']]

# Initialize variables
current_order = None
group_has_yes = False
ck = 0  # Count of order groups with at least one 'Yes'

# Iterate through the DataFrame
for index, row in df_recall.iterrows():
    # Check if we're still in the same order group
    if row['order'] == current_order:
        # Check if current row has 'Yes'
        if row['is_match'] == 'Yes':
            group_has_yes = True
    else:
        # We've entered a new order group
        # First check if previous group had any 'Yes'
        if group_has_yes:
            ck += 1
        # Reset for new order group
        current_order = row['order']
        group_has_yes = (row['is_match'] == 'Yes')

# Check the last group after loop ends
if group_has_yes:
    ck += 1

print("Number of order groups with at least one 'Yes':", ck)

# total number of unique ground truth

# Calculate consecutive blocks where 'ground_truth' is the same
unique_blocks = df_recall['llm_generated'].ne(df_recall['llm_generated'].shift()).cumsum()

# Group by these blocks and count each group
group_counts = df_recall.groupby(unique_blocks)['llm_generated'].agg(['count'])

# Output the results
print("Number of unique consecutive 'ground_truth' texts and their counts:")
print(group_counts)