Skip to content

Gap Analysis Agent#86

Merged
NISH1001 merged 39 commits into
developfrom
feature/gap-agent
Sep 4, 2025
Merged

Gap Analysis Agent#86
NISH1001 merged 39 commits into
developfrom
feature/gap-agent

Conversation

@movinam
Copy link
Copy Markdown
Collaborator

@movinam movinam commented Jul 28, 2025

Summary:

This PR addresses Issue #54 and #22. It introduces the gap agent to model scientific literature using knowledge graphs to analyse gaps in the content found via the Search Tool.

Details:

  • Fetch paper details via Semantic Scholar.
  • Groups sections and classifies them under key section categories.
  • A knowledge graph is created using the paper's metadata and section data.
  • For a given pre-defined / user-defined gap, the graph is traversed to select appropriate nodes.
  • Local answers are generated per node for the selected gap.
  • The local answers are given to a final writer to produce the final output.

Usage:

from akd.agents.gap_analysis.gap_analysis import GapAgent, GapAgentConfig, GapInputSchema
from akd.tools.scrapers import (
    DoclingScraperConfig
)
from akd.tools.search import SearxNGSearchTool, SearxNGSearchToolInputSchema
from akd.tools.search import (SemanticScholarSearchToolConfig)

docling_config = DoclingScraperConfig(do_table_structure=True, pdf_mode='accurate', export_type='html', debug=False)
s2_config = SemanticScholarSearchToolConfig(api_key="", debug=False, external_id="ARXIV", fields = ["paperId", "title", "externalIds", "isOpenAccess", "openAccessPdf"])

# Search
query = 'methods and estimation to map landslides in Nepal'
search_tool = SearxNGSearchTool.from_params(engines=['arxiv'], debug=True)
# Category has to be set to None to make the tool adhere to the engines
search_output = await search_tool.arun(SearxNGSearchToolInputSchema(queries=[query], category=None, max_results=5))

gap_agent_config = GapAgentConfig(docling_config=docling_config,
                                  s2_tool_config=s2_config,
                                  model_name='gpt-4o-mini', api_key="",
                                  debug=True)
gap_agent = GapAgent(gap_agent_config)


# for pre-defined gaps
gap_output_1 = await gap_agent.arun(GapInputSchema(search_results=search_output.results, gap="evidence"))
# for user-defined gaps
defined_gap = "Are there findings that have not been replicated or independently validated across different studies?"
gap_output_2 = await gap_agent.arun(GapInputSchema(search_results=search_output.results, gap=defined_gap))

# For generated output
output = gap_output_1.output
# For generated answers per selected node
attributed_source_answers = gap_output_1.attributed_source_answers
# For the graph created from the ingested papers
G = gap_output_1.G

Limitations:

  • The section grouper and classifier rely on content parsed using HTML. Need to add support for markdown format.
  • The agent is restricted to using Docling to make use of it's export_to_html functionality. This will be replaced with composite parser once markdown support is added to the section grouper.
  • It is recommended to run the agent with the search engine set to Arxiv to ensure PDFs are available since the code relies on SearchResultItems' pdf_url having a value.

Checks

@NISH1001
Copy link
Copy Markdown
Collaborator

@movinam thanks for the PR. will look sometime today and test as well.

Comment thread akd/agents/gap_analysis/gap_analysis.py Outdated
Copy link
Copy Markdown
Collaborator

@NISH1001 NISH1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass 1 comments

Comment thread akd/agents/gap_analysis/parsing_utils.py
Comment thread akd/configs/gap_config.py Outdated
@movinam movinam marked this pull request as ready for review August 5, 2025 12:14
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 3, 2025

❌ Tests failed (exit code: 1)

📊 Test Results

  • Passed: 219
  • Failed: 6
  • Warnings: 110
  • Coverage: 70%

Branch: feature/gap-agent
PR: #86
Commit: 79e9dd71d7f7a7b5ec2fe82143d15c4bcc93215d

📋 Full coverage report and logs are available in the workflow run.

Comment thread akd/agents/gap_analysis/gap_analysis.py Outdated
Comment thread akd/agents/gap_analysis/gap_analysis.py Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 3, 2025

❌ Tests failed (exit code: 1)

📊 Test Results

  • Passed: 219
  • Failed: 6
  • Warnings: 110
  • Coverage: 70%

Branch: feature/gap-agent
PR: #86
Commit: aa2aaae4839eedafb4609a435f995a821c70abd6

📋 Full coverage report and logs are available in the workflow run.

Comment thread akd/agents/gap_analysis/gap_analysis.py Outdated
Copy link
Copy Markdown
Collaborator

@NISH1001 NISH1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On networkx Graph serialization

Comment thread akd/agents/gap_analysis/gap_analysis.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 3, 2025

❌ Tests failed (exit code: 1)

📊 Test Results

  • Passed: 219
  • Failed: 6
  • Warnings: 110
  • Coverage: 70%

Branch: feature/gap-agent
PR: #86
Commit: 39d63b6a8b62baad1e7ca0dcf3026ccc568abc01

📋 Full coverage report and logs are available in the workflow run.

Comment thread akd/agents/gap_analysis/gap_analysis.py Outdated
Comment thread akd/agents/gap_analysis/gap_analysis.py
Comment thread akd/agents/gap_analysis/prompts.py Outdated
Comment thread akd/agents/gap_analysis/prompts.py
Comment thread akd/agents/gap_analysis/prompts.py Outdated
Comment thread akd/agents/gap_analysis/prompts.py
Comment thread akd/agents/gap_analysis/graph_utils.py Outdated
Copy link
Copy Markdown
Collaborator

@NISH1001 NISH1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Pass II

Copy link
Copy Markdown
Collaborator

@NISH1001 NISH1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass III

Comment thread akd/agents/gap_analysis/gap_analysis.py
@NISH1001
Copy link
Copy Markdown
Collaborator

NISH1001 commented Sep 3, 2025

@movinam Also one thing:
Can we also get some runtime stats for running the gap agent. Say, x seconds per 5 papers etc? That will help us know where can we use the gap agent in the akd workflow. If fast, we could pteotnailly integrate it to deep search every iteration. If slow, we need to rethink it as one-time process that runs in the background in backend....

Any time usage stats will help us refine the scope.

@movinam
Copy link
Copy Markdown
Collaborator Author

movinam commented Sep 3, 2025

Runtime statistics are dependent on the number of search results, the length of the papers, docling configuration and how long docling takes to process a batch of papers.

Using the default settings, it takes approximately 3-5 minutes for 5 papers and 5-7 minutes for 10 papers. We can do a better analysis per domain when we benchmark the agent.

I think to integrate to deep search we have to restrict the gap agent to just the abstracts. I can add that as an enhancement in a future PR.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 3, 2025

❌ Tests failed (exit code: 1)

📊 Test Results

  • Passed: 219
  • Failed: 6
  • Warnings: 112
  • Coverage: 70%

Branch: feature/gap-agent
PR: #86
Commit: be495d5295b70ccf317d50ddec3bde9b172f0f6f

📋 Full coverage report and logs are available in the workflow run.

@NISH1001
Copy link
Copy Markdown
Collaborator

NISH1001 commented Sep 4, 2025

Runtime statistics are dependent on the number of search results, the length of the papers, docling configuration and how long docling takes to process a batch of papers.

Using the default settings, it takes approximately 3-5 minutes for 5 papers and 5-7 minutes for 10 papers. We can do a better analysis per domain when we benchmark the agent.

I think to integrate to deep search we have to restrict the gap agent to just the abstracts. I can add that as an enhancement in a future PR.

I think running gap only at abstract level will defeats the purpose and will not justify the gaps. We should probably need

  1. lighter scraper to read pdf (other than VLM, maybe traditional fitz way)
  2. reducer/summarizer of the paper that compresses the paper contexts nicely (I presume this is also causing the large runtime since we're using full text. Summarizer will be say just compression and probably just getting summary of each sections. Abstract might not capture the nuances for detail gap analysis.

This is a topic of discussion for another thread though.

@NISH1001 NISH1001 merged commit e9b08e4 into develop Sep 4, 2025
2 checks passed
@NISH1001 NISH1001 deleted the feature/gap-agent branch September 4, 2025 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic Knowledge Graph Creation from Literature for Gap Identification Identify Research Gaps existing in the literature

2 participants