# Automated Literature Review Agent

**Capstone Project ‚Äî Google 5-Day AI Agents Intensive**

This project implements an AI-driven multi-agent system that:

1. Asks the user for a research topic  
2. Searches top computational biology journals  
3. Retrieves the latest relevant papers  
4. Summarizes them using Google Gemini  
5. Converts summaries into *blog-ready articles*  

The system simulates a research assistant capable of literature review, scientific summarization, and blog article generation.


# Section 1 ‚Äî Installation and Configuration

In [18]:
## Install Dependencies if needed

!pip install google-generativeai python-dotenv fpdf requests beautifulsoup4 ddgs
print("‚úÖ Dependencies installed.")

‚úÖ Dependencies installed.


In [28]:
# 1 - Configuration
import os
import re
import requests
import google.generativeai as genai
from ddgs import DDGS
from IPython.display import Markdown
import json
from bs4 import BeautifulSoup
from typing import Dict, List
print("‚úÖ Configuration complete.")

‚úÖ Configuration complete.


In [20]:
from kaggle_secrets import UserSecretsClient

try:
    GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    print("‚úÖ Setup and authentication complete.")
except Exception as e:
    print(
        f"üîë Authentication Error: Please make sure you have added 'GOOGLE_API_KEY' to your Kaggle secrets. Details: {e}"
    )

‚úÖ Setup and authentication complete.


In [21]:
if "GOOGLE_API_KEY" not in os.environ:
    raise ValueError("‚ùå GOOGLE_API_KEY missing. Make sure secrets were loaded.")

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
print("‚úÖ Gemini API configured successfully!")

‚úÖ Gemini API configured successfully!


# Section 2 ‚Äî System Architecture

### 1. **User Interaction Agent**
- Prompts the user: *"What subject are you interested in?"*  
- Sends the selected topic to the Search Agent.  
- Initiates the end-to-end workflow.

---

### 2. **Search Agent**
Searches top open-access computational biology and public-health journals, such as:

- PLOS Computational Biology  
- eLife  
- bioRxiv ‚Äì Computational Biology  
- Genome Research  
- Bioinformatics  
- Nature Biotechnology (when open-access)

It generates queries like:

> "infectious disease modeling"  
> "antimicrobial resistance transmission"

Returns candidate papers with titles, URLs, and metadata.

---

### 3. **Extraction Agent**
Retrieves and cleans scientific content from each article:

- Downloads HTML using `requests`  
- Extracts readable text with `BeautifulSoup`  
- Removes menus, navigation bars, and boilerplate  
- Returns clean, structured text  

If text cannot be accessed, the agent skips the paper automatically.

---

### 4. **Summarization Agent**
Uses **Gemini 2.5 Flash** to produce structured scientific summaries:

- ~200-word explanation  
- Core methods  
- Key findings  
- Scientific significance  

Summaries are optimized for clarity and accuracy.

---

### 5. **Evaluation Agent**
Performs quality checks before a summary moves to the Blog Writer Agent.

Checks include:
- Minimum sentence count  
- Presence of scientific keywords:
  - method / approach  
  - result / findings  
  - conclusion / significance  
- Detection of missing or empty summaries  
- Flagging potential hallucinations  

Issues are appended to the final output for transparency.

---

### 6. **Blog Writer Agent**
Transforms technical summaries into accessible blog-style explanations:

- Friendly, conversational tone  
- Clear interpretations without losing accuracy  
- Suitable for broad audiences  

The scientific meaning remains intact‚Äîonly the style changes.

---

### 7. **Orchestrator Agent**
Coordinates the entire agent workflow:

1. Receives user topic  
2. Calls Search Agent  
3. Passes URLs to Extraction Agent  
4. Sends extracted text to Summarization Agent  
5. Runs the Evaluation Agent  
6. Sends approved summaries to Blog Writer Agent  
7. Returns all outputs to the user  

Acts as the ‚Äúproject manager,‚Äù ensuring each step executes in sequence and handles missing or invalid papers gracefully.

---


# Section 3 ‚Äî Agent Implementations

In [22]:
TOP_JOURNALS = [
    #"nature.com",
    #"nature.com/nbt",          # Nature Biotechnology
    #"nature.com/nmeth",        # Nature Methods
    "journals.plos.org/ploscompbiol",
    "academic.oup.com/bioinformatics",
    "elifesciences.org"
]

In [23]:
# Search Agent with publication date extraction
class SearchAgent:
    """Searches top computational biology journals for relevant papers."""

    def search_papers(self, subject, max_results=5):
        query_results = []
        with DDGS() as ddgs:
            for journal in TOP_JOURNALS:
                query = f"{subject} site:{journal} latest research paper"
                for r in ddgs.text(query, max_results=max_results):
                    query_results.append({
                        "title": r.get("title"),
                        "link": r.get("href"),
                        "journal": journal,
                        "date": r.get("date")  # <-- added date
                    })
        return query_results


In [24]:
# -------- Short Scientific Summary Prompt & Summarization Agent --------
from google.generativeai import GenerativeModel

SHORT_SUMMARY_PROMPT = """
You are a scientific summarizer. Produce a concise scientific summary (3‚Äì4 sentences max).

Required structure (do not label sections, just write text):
- What the paper studies
- The main method or approach
- Key findings or results
- Brief implication or significance

ABSOLUTELY DO NOT guess content. 
If the paper text is missing, empty, or non-scientific, return EXACTLY:
"SKIP"

Keep summary extremely short and factual.
"""

class SummarizationAgent:
    """
    Uses Gemini to generate short, scientific summaries of full-text papers.
    Only returns a summary when the paper text is long enough to be meaningful.
    """
    def __init__(self, model_name: str = "gemini-2.5-flash-lite"):
        self.model = GenerativeModel(model_name)

    def summarize_paper(self, paper_title: str, paper_text: str):
        # Guard against missing or unusable text
        if not paper_text:
            return None

        clean_text = paper_text.strip()
        # Very short text is usually metadata / landing page only ‚Üí treat as no access
        if len(clean_text) < 800:
            return None

        prompt = (
            SHORT_SUMMARY_PROMPT
            + "\n\nPaper Title: " + (paper_title or "Unknown title")
            + "\n\nPaper Text:\n"
            + clean_text[:5000]
        )

        try:
            response = self.model.generate_content(prompt)
            summary = (response.text or "").strip()
        except Exception:
            return None

        # Respect explicit SKIP instruction
        if summary.upper().strip() == "SKIP":
            return None

        # Post-process: keep at most 4 sentences to enforce shortness
        sentences = re.split(r"(?<=[.!?])\s+", summary)
        summary_short = " ".join(sentences[:4]).strip()

        return summary_short or None


In [25]:
# Paper Text Extraction Agent
class PaperTextExtractor:
    """
    Fetches web pages and extracts main textual content.
    Returns None when text is not accessible or clearly insufficient.
    """
    def __init__(self, min_chars: int = 800):
        self.min_chars = min_chars

    def extract_text(self, url: str):
        if not url:
            return None
        try:
            resp = requests.get(url, timeout=20)
            resp.raise_for_status()
        except Exception:
            return None

        html = resp.text
        soup = BeautifulSoup(html, "html.parser")

        # Remove obviously non-content tags
        for tag in soup(["script", "style", "nav", "footer", "header", "noscript"]):
            tag.decompose()

        text = soup.get_text(separator=" ")
        # Normalize whitespace
        text = " ".join(text.split())

        if len(text) < self.min_chars:
            return None

        return text


### Orchestrator (Conceptual)

This notebook uses a *functional orchestration* pattern rather than a
dedicated `OrchestratorAgent` class.

The main pipeline cell:
- Searches for papers  
- Extracts text  
- Summarizes using LLM  
- Evaluates with EvaluationAgent  
- Generates PDF  

This fulfills the role of an orchestrator without requiring a separate class.



### Evaluation Agent

The **Evaluation Agent** is responsible for performing lightweight quality checks
on each blog-ready summary generated by the pipeline. It verifies:

- That the summary is not empty  
- That it contains at least a minimum number of sentences (default: 3)  
- That it mentions core scientific concepts such as *methods*, *results*, or *conclusions*  

The agent returns a simple dictionary with a boolean `is_pass` flag and a list of
`issues`. If any issues are detected, they are appended to the end of the summary
before exporting to the PDF report.


In [29]:

# Evaluation Agent
class EvaluationAgent:
    '''
    Agent that performs lightweight quality checks on blog-ready summaries.

    It verifies:
        - Non-empty text
        - Minimum number of sentences (>= min_sentences)
        - Presence of key scientific keywords: method, result, conclusion
    '''

    def __init__(self, min_sentences: int = 3):
        self.min_sentences = min_sentences

    def evaluate(self, text: str) -> Dict:
        '''
        Evaluate a summary and return a dict with pass/fail and issues.

        Args:
            text: The generated blog-ready summary.

        Returns:
            {
                "is_pass": bool,
                "issues": List[str]
            }
        '''
        issues: List[str] = []

        if not text or not text.strip():
            issues.append("Empty summary.")
            return {"is_pass": False, "issues": issues}

        # Rough sentence count by splitting on punctuation
        sentences = [s.strip() for s in re.split(r"[.!?]+", text) if s.strip()]
        if len(sentences) < self.min_sentences:
            issues.append(f"Too short: only {len(sentences)} sentence(s).")

        lowered = text.lower()
        keywords = ["method", "approach", "result", "finding", "conclusion"]
        if not any(k in lowered for k in keywords):
            issues.append("Missing method/result/conclusion keywords.")

        return {
            "is_pass": len(issues) == 0,
            "issues": issues,
        }


In [30]:

# User Prompt + Prepare Agents
subject = input("What subject are you interested in? ").strip()
if not subject:
    raise ValueError("Please provide a non-empty subject.")

# Instantiate agents
search_agent = SearchAgent()
extraction_agent = PaperTextExtractor()
summarization_agent = SummarizationAgent()
evaluation_agent = EvaluationAgent(min_sentences=3)

# Search for candidate papers in top journals
papers = search_agent.search_papers(subject, max_results=5)
print(f"Found {len(papers)} candidate papers.")


What subject are you interested in?  longevity


Found 15 candidate papers.


In [31]:

# Run the summarization pipeline and generate blog-ready summaries

from google.generativeai import GenerativeModel

blog_model = GenerativeModel("gemini-2.5-flash")

final_summaries = []
evaluation_results = []

for paper in papers:
    try:
        url = paper.get("link")
        title = paper.get("title") or "Untitled Paper"

        full_text = extraction_agent.extract_text(url)
        if not full_text:
            # Skip papers without accessible text
            continue

        # Blog-writing prompt
        blog_prompt = f"""You are a science communicator. Write a clear, engaging, blog-ready summary 
based ONLY on the text below.

Your output must include:

1. A bold title line containing the paper title  
2. A clickable URL below it  
3. A 2‚Äì4 paragraph blog-style explanation that covers:
   - What the study is about
   - Why it matters
   - How the researchers approached the problem
   - The key findings
   - The broader significance
4. The publication date

Do NOT guess missing details. Focus on what can be inferred from the text.

Paper Title: {title}
Paper URL: {url}
Paper Date: {paper.get("date")}

Paper Text:
{full_text[:6000]}
"""  # end of blog_prompt

        response = blog_model.generate_content(blog_prompt)
        blog_text = (response.text or "").strip()

        # Evaluate the blog-ready summary
        eval_result = evaluation_agent.evaluate(blog_text)
        evaluation_results.append(eval_result)

        # Append evaluation issues (if any) to the entry
        entry = blog_text
        if not eval_result["is_pass"]:
            issues_str = "; ".join(eval_result["issues"])
            entry += f"\n\n[Evaluation issues: {issues_str}]"

        final_summaries.append(entry)

    except Exception as e:
        print(f"Skipping due to error: {e}")
        continue

# Print blog-ready summaries
for s in final_summaries:
    print(s)
    print("\n---\n")

# Simple evaluation summary
num_pass = sum(1 for e in evaluation_results if e.get("is_pass"))
print("Number of blog-ready summaries:", len(final_summaries))
print("Number of summaries passing evaluation:", num_pass)


**Identifying longevity associated genes by integrating gene expression and curated annotations**
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008429

Aging is a complex biological process, and the specific genetic mechanisms that control it are still largely a mystery. While scientists have identified some genes that influence lifespan, traditional methods of discovery‚Äîlike individually altering genes in model organisms‚Äîare often slow and expensive. Recent efforts have turned to machine learning to classify genes as either "pro-longevity" (extending lifespan) or "anti-longevity" (shortening lifespan), but it hasn't been clear which computational approaches or data types are most effective for this task.

This study aimed to address these challenges by systematically comparing different machine learning methods and data types to predict gene longevity status. Researchers evaluated five popular classification algorithms using gene ontology and gene express

In [32]:
# Save blog-ready summaries to PDF (Unicode-safe, no duplicate titles)
from fpdf import FPDF
import unicodedata
import re

def clean_unicode(text):
    """Convert unicode to closest ASCII and remove unsupported chars."""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')

def remove_duplicate_title(entry):
    """
    Prevent any repeated title/URL blocks.
    The entry already contains title + date + URL from Gemini.
    """
    entry = entry.strip()
    entry = re.sub(r'\n{3,}', '\n\n', entry)
    return entry

if len(final_summaries) == 0:
    print("No summaries available to save.")
else:
    pdf = FPDF()
    pdf.set_auto_page_break(auto=True, margin=15)
    pdf.add_page()
    pdf.set_font("Arial", size=12)

    # üî• This prints only what Gemini already produced ‚Äî NO extra title
    pdf.multi_cell(0, 10, clean_unicode(f"Blog-Ready Summaries for: {subject}\n\n"))

    for entry in final_summaries:
        cleaned_entry = clean_unicode(remove_duplicate_title(entry))

        # **THIS is the key line:** We write ONLY the summary itself
        pdf.multi_cell(0, 10, cleaned_entry + "\n\n")

    pdf_path = "/kaggle/working/literature_review.pdf"
    pdf.output(pdf_path)

    print("PDF saved to:", pdf_path)


PDF saved to: /kaggle/working/literature_review.pdf



# Conclusion

This multi-agent system automates the process of:
- Searching top computational biology journals  
- Fetching the latest papers  
- Summarizing them using Gemini  
- Producing blog-ready articles  

It demonstrates:
‚úî Workflow orchestration  
‚úî Multi-agent collaboration  
‚úî Real-world scientific utility  
‚úî Google Gemini integration  



In [33]:
# Simple check helper for final_summaries (optional)
if "final_summaries" in globals():
    print(f"final_summaries is defined with {len(final_summaries)} entries.")
else:
    print("final_summaries is not defined yet. Run the pipeline cells first.")


final_summaries is defined with 6 entries.
