In [None]:
"""
Literature Research Assistant – Architecture Notes
--------------------------------------------------------------------------------

My goal with this project is to build a multi agent workflow that behaves like
a lightweight, semi automated literature review assistant.

Highlevel pipeline (as I currently have it):

1. Intent Agent
   - Takes my messy natural language question and turns it into a structured
     ResearchIntent (Pydantic model).
   - This is where I make all constraints explicit: topic, population, dates,
     study types, risk factors, etc.
   - Design choice: I keep this "conservative" so the agent does NOT invent
     constraints I didn't ask for.

2. Semantic Scholar Search Agent
   - Translates the ResearchIntent into a Semantic Scholar query string.
   - Calls the Semantic Scholar search API (not the graph API) to fetch a first
     batch of candidate papers.
   - Design choice: I keep the query fairly simple (topic + a few extras) to
     avoid overconstraining and missing relevant papers.

3. Filter Agent (Screening)
   - Uses an LLM to screen titles/abstracts against my intent.
   - Outputs two lists: kept vs rejected papers.
   - Design choice: this is the quick way to emulate initial abstract
     screening in a scoping review.

4. Quality Ranking Agent (Academic Graph, no LLM)
   - Uses Semantic Scholar's Academic Graph API (paper/batch endpoint) to
     fetch metadata like citations, influential citations, venue, year, etc.
   - Computes a quality_score in range (0,1) plus human readable reasons.
   - Design choice: I rely on metadata here instead of asking
     the LLM to "guess quality", so this is more reproducible and interpretable.

5. Info Extraction Agent
   - Guided by intent.target_entities, extracts specific entities from the
     top ranked papers (e.g., genes, pathways, risk factors).
   - Returns a list of ExtractedInfo objects with:
       - paper_id
       - entities: {category -> (entities)}
       - entity_notes: {entity -> note}
   - Design choice: the model is constrained to categories I define 
     instead of making up arbitrary ones.

6. Trend Analysis Agent
   - Looks at extracted entities across all papers and synthesizes:
       - consensus_findings: where multiple papers agree
       - conflicts: where papers disagree
   - Design choice: this is my meta-analysis lite step, it operates only on
     the structured extraction output, not on full texts

7. Human Checkpoint Agent
   - CLI-based loop where I/(the user) can:
       - inspect the current intent in JSON form
       - optionally edit fields (topic, population, dates, etc.)
       - stop the pipeline if something is way off
   - Design choice: this keeps me in the loop and prevents the system from
     running off a misinterpreted question.

8. Output Formatting Agent
   - Takes the final state (intent, consensus, conflicts, ranked papers, my
     checkpoint notes) and writes a Markdown mini report.
   - Design choice: I use this primarily for structure and wording.

Key data structures I rely on:
------------------------------
- ResearchIntent (Pydantic model):
    My structured representation of the question + constraints. The most
    important field I customized is:
      - target_entities: Dict[str, List[str]]
        Example:
          {
             "genes": ["BRCA1", "BRCA2"],
             "risk_factors": ["family history", "BMI"]
          }

- ResearchState (TypedDict):
    The "shared memory" object the supervisor passes between agents.
    It carries user_query, intent, raw/filtered/ranked papers, extracted_info,
    consensus/conflicts, checkpoint_notes, and formatted_output.

Philosophy / guiding principles:
--------------------------------
- Be conservative about what the agents are allowed to assume.
- Make constraints explicit in the ResearchIntent instead of hiding them in
  prompts.
- Separate responsibilities clearly:
     search vs screening vs quality vs extraction vs synthesis vs formatting
- Keep a human checkpoint before the final write up.
- Use the Academic Graph for quality instead of asking the LLM to
  make them up.
"""

In [93]:
# Putting API keys into the notebook's environment variables 
# so that the rest of the code can read them

import os
os.environ["SEMANTIC_SCHOLAR_API_KEY"] = "insert" 
os.environ["OPENAI_API_KEY"] = "insert"

In [94]:
# Create core_setup in notebook module 
# Conceptually: I'm pretending I have a core_setup.py file, but I'm
# building it dynamically inside this notebook so I can do the following:
## from core_setup import llm, ResearchIntent, ResearchState, ...
# in later cells without copy/pasting model definitions everywhere

import types, sys
core_setup = types.ModuleType("core_setup")
sys.modules["core_setup"] = core_setup

# Put all shared types, state structures, and LLM here so all agents
# share a single source of truth.
from typing import TypedDict, List, Dict, Optional, Any
from dataclasses import dataclass, field
import os
from langchain_openai import ChatOpenAI

# ----------------------------- LLM ------------------------------------
## Here I'm defining a single base LLM instance that all agents can use.
## This makes it easier to change the model/temperature in one place.
core_setup.llm = ChatOpenAI(
    model="gpt-5-mini",  # lightweight model for agent orchestration
    temperature=0.1,     # slightly creative but still controlled
)


# -------------------- PAPER + EXTRACTED TYPES -----------------------
## Paper: the minimal structure I use throughout the pipeline to
## represent an article returned from Semantic Scholar.
class Paper(TypedDict, total=False):
    id: str                   # Semantic Scholar paperId
    title: str
    abstract: str
    year: int
    venue: str
    authors: List[str]
    citation_count: int       # citationCount from S2
    url: Optional[str]

core_setup.Paper = Paper


# RankedPaper: same as Paper but with an added quality score + reasons
# after passing through my quality ranking agent.
class RankedPaper(Paper, total=False):
    quality_score: float            # numeric quality score in (0,1)
    quality_reasons: List[str]      # human readable explanations

core_setup.RankedPaper = RankedPaper


# ExtractedInfo: what my info extraction agent pulls from each paper.
# I designed this to be generic so it can handle different entity types.
class ExtractedInfo(TypedDict, total=False):
    paper_id: str
    # generic mapping: category -> list of entities
    # (e.g., "genes": ("BRCA1", "BRCA2"))
    entities: Dict[str, List[str]]
    # entity_notes can hold short human readable notes per entity
    # (e.g., "BRCA1": "consistently associated with higher risk")
    entity_notes: Dict[str, str]

core_setup.ExtractedInfo = ExtractedInfo


# ---------------------- CONSENSUS / CONFLICT TYPES ------------------
## These are higher level findings aggregated across papers.

class ConsensusFinding(TypedDict, total=False):
    entity: str                      # e.g., "BRCA1"
    evidence_support_papers: List[str]  # paper_ids backing the finding
    summary: str                     # narrative summary of consensus

core_setup.ConsensusFinding = ConsensusFinding


class ConflictFinding(TypedDict, total=False):
    entity: str
    supporting_papers: List[str]      # papers that support the effect
    contradicting_papers: List[str]   # papers that contradict it
    notes: str                        # how I interpret the conflict

core_setup.ConflictFinding = ConflictFinding


# ------------------------ INTENT MODEL -------------------------------
## This is my structured representation of a user’s research question.
## I use Pydantic so LangChain can parse LLM output into this schema
## and I get handy .model_dump() / validation.

from pydantic import BaseModel

class ResearchIntent(BaseModel):
    # A concise natural language question, e.g.,
    # "What genetic and lifestyle risk factors influence breast cancer
    #  risk in young women?"
    high_level_question: str = ""

    # Overall topic (shorter than high_level_question). This is often
    # what I use as the core of the Semantic Scholar search query.
    topic: str = ""

    # Optional population constraint, I only fill this if the user
    # explicitly mentions it (e.g., "women under 40", "adults with SLE").
    population: Optional[str] = None

    # Optional date range (e.g., "2015-2025", "last 5 years").
    date_range: Optional[str] = ""

    # Outcomes the user cares about (incidence, mortality, biomarkers…)
    outcomes: Optional[List[str]] = []

    # Requested study designs, if any (cohort, RCTs, meta-analyses, etc.).
    study_types: Optional[List[str]] = []

    # Terms that must appear in search (genes, pathways, technical keywords).
    must_include_terms: Optional[List[str]] = []

    # Terms to exclude (e.g., "mouse", "in vitro", "case reports").
    exclude_terms: Optional[List[str]] = []

    # Free text notes from me/user to the agent/downstream steps.
    notes_for_supervisor: Optional[str] = ""

    # If I want a very specific boolean search string, I can override
    # the auto generated query here. Otherwise this stays empty.
    semantic_scholar_query: Optional[str] = ""

    # Depth of review: quick, medium, or deep.
    depth: str = "medium"

    # Important: this is a dictionary, not a list.
    # Conceptually: this is what I want the info extraction agent to
    # look for in the papers. Example structure:
    #
    #   {
    #       "genes": ("BRCA1", "BRCA2"),
    #       "risk_factors": ("family history", "BMI")
    #   }
    #
    # Keys are categories, values are example entities.
    target_entities: Dict[str, List[str]] = {}

core_setup.ResearchIntent = ResearchIntent


# ------------------------- RESEARCH STATE ----------------------------
## This is the global state my supervisor passes between agents.
class ResearchState(TypedDict, total=False):
    user_query: str
    intent: ResearchIntent
    raw_papers: List[Paper]              # direct Semantic Scholar hits
    filtered_papers: List[Paper]         # after my screening agent
    ranked_papers: List[RankedPaper]     # quality ranked subset
    extracted_info: List[ExtractedInfo]  # entities per paper
    consensus_findings: List[ConsensusFinding]
    conflicts: List[ConflictFinding]
    checkpoint_notes: Optional[str]      # notes from human checkpoint
    formatted_output: str                # final Markdown report

core_setup.ResearchState = ResearchState


In [95]:
# ----------------------------
# INTENT AGENT – DESIGN NOTES
# ----------------------------
# This agent is my beginning of the pipeline.
#
# From my perspective:
# - I usually start with a messy question like:
#     "What are the leading genetic and lifestyle risk factors for breast cancer
#      in young adults?"
# - I want the system to pin down that question into structured fields:
#     topic, population, outcomes, date_range, study_types, etc.
# - I also want to tell downstream agents what to extract via target_entities,
#   but in a controlled way.
#
# Key design decisions:
# - I treat target_entities as a DICT (category -> examples), not a flat list.
#   This gives me more semantic control, e.g.:
#       {"genes": ["BRCA1", "BRCA2"], "risk_factors": ["BMI", "family history"]}
# - The prompt insists on being conservative:
#     - If the user didn't clearly specify something, we leave it blank or default.
#     - This reduces the risk that the system "imagines" constraints.
# - I parse the LLM output into a Pydantic ResearchIntent so that:
#     - I get validation and a consistent shape of data
#     - Downstream code doesn't have to worry whether it's a dict or model.


In [96]:
# Intent Agent


# Here I'm translating a natural language user query into my
# structured ResearchIntent object using the LLM

from core_setup import llm, ResearchIntent, ResearchState
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

intent_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            # I’m explaining to the LLM exactly what I want it to do.
            "You are a research planner. Convert the user query into a structured ResearchIntent.\n"
            "Be conservative and explicit about constraints.\n\n"
            "You MUST return ONLY valid JSON with the following keys EXACTLY matching the ResearchIntent schema:\n"
            "  high_level_question (str)\n"
            "  topic (str)\n"
            "  population (str or null)\n"
            "  date_range (str)\n"
            "  outcomes (list of str)\n"
            "  study_types (list of str)\n"
            "  must_include_terms (list of str)\n"
            "  exclude_terms (list of str)\n"
            "  notes_for_supervisor (str)\n"
            "  semantic_scholar_query (str)\n"
            "  depth (str: 'quick' | 'medium' | 'deep')\n"
            "  target_entities (object/dict)\n\n"
            "IMPORTANT — target_entities must be a DICTIONARY, not a list.\n"
            "Structure target_entities as a JSON object where:\n"
            "- Keys are category names (e.g. 'genes', 'risk_factors', 'biomarkers').\n"
            "- Values are lists of example entities or synonyms for that category (e.g. ['BRCA1', 'BRCA2']).\n\n"
            "Rules for target_entities:\n"
            "- Include ONLY categories clearly implied by the user.\n"
            "- If the user mentions genes, biomarkers, pathways, risk factors, etc., use those as category names.\n"
            "- Do NOT guess extra categories.\n"
            "- If nothing is implied, return an empty object (an empty JSON dictionary).\n\n"
            "General field rules:\n"
            "- high_level_question should summarize WHAT the user wants answered.\n"
            "- topic is the main subject (disease, exposure, outcome).\n"
            "- population only if user explicitly defines it.\n"
            "- date_range only if explicitly mentioned.\n"
            "- must_include_terms and exclude_terms only if clearly present.\n"
            "- semantic_scholar_query only if user provides a boolean-style literal query.\n"
            "- depth defaults to 'medium' unless user indicates otherwise.\n"
            "- NEVER invent fields or constraints.\n"
        ),
        (
            "user",
            # I’m providing the raw user query as {user_query}.
            "User query:\n{user_query}\n\n"
            "Return the ResearchIntent JSON object now."
        ),
    ]
)

# Let the parser return a plain dict, I’ll wrap it into ResearchIntent myself.
intent_parser = JsonOutputParser(pydantic_object=None)

def run_intent_agent(user_query: str) -> ResearchIntent:
    """
    Take the raw user query string and return a fully-typed ResearchIntent.
    """
    # Chain: prompt -> LLM -> JSON parsing.
    chain = intent_prompt | llm | intent_parser
    raw = chain.invoke({"user_query": user_query})

    # Convert dict -> ResearchIntent so downstream code always sees
    # a Pydantic object, not arbitrary dict structures.
    if isinstance(raw, dict):
        return ResearchIntent(**raw)

    # Fallback: if langchain returns a Pydantic-like object, use .model_dump()
    return ResearchIntent(**raw.model_dump())


In [97]:
# ---------------------------------------
# SEMANTIC SCHOLAR SEARCH – DESIGN NOTES
# ---------------------------------------
# This agent converts my structured ResearchIntent into a search query
# for the Semantic Scholar REST API.
#
# From my perspective:
# - I don't want the search query to be overly complex or fragile.
# - I want "topic" to be the main driver, with a few helper terms:
#     population, must_include_terms, and (optionally) target_entities keys.
#
# Design choices:
# - If I explicitly set `semantic_scholar_query` in the intent, that wins.
#   This gives me a manual override when I'm being picky.
# - If not, I build a query string like:
#       "<topic or HLQ> <population?> <first must_include terms> <few entity categories>"
# - I deliberately avoid injecting every possible detail into the query,
#   because that can reduce recall and miss relevant papers.
# - The returned structure is normalized into my Paper TypedDict to keep
#   downstream agents model agnostic.


In [98]:
# Semantic Scholar Agent


# This agent is responsible for building a search query from my
# ResearchIntent and calling the Semantic Scholar search API.

import os
import requests
from typing import List
from core_setup import Paper, ResearchIntent

SEMANTIC_SCHOLAR_API_KEY = os.environ.get("SEMANTIC_SCHOLAR_API_KEY")

BASE_URL = "https://api.semanticscholar.org/graph/v1/paper/search"


def build_search_query(intent) -> str:
    """
    Build a Semantic Scholar search query from a ResearchIntent.

    Strategy:
      1) If semantic_scholar_query is set, use that directly.
      2) Otherwise, use topic as the core, and only add a few extra terms
         (population, first 1–2 must_include_terms, optionally some
         target_entity categories).
    """
    # Handle both Pydantic model and dict (for flexibility).
    if hasattr(intent, "model_dump"):
        data = intent.model_dump()
    else:
        data = dict(intent)

    # 1) Explicit query wins, this lets me override everything manually.
    explicit_q = (data.get("semantic_scholar_query") or "").strip()
    if explicit_q:
        return explicit_q

    topic = (data.get("topic") or "").strip()
    hlq = (data.get("high_level_question") or "").strip()
    pop = (data.get("population") or "").strip()
    must_terms = [t.strip() for t in (data.get("must_include_terms") or []) if t.strip()]

    # Base: topic or high level question, if neither is present, I fall
    # back to a generic phrase so the query is never empty.
    if topic:
        base = topic
    elif hlq:
        base = hlq
    else:
        base = "cancer risk"  # generic fallback if everything is blank

    extras = []

    # If a population is specified, I bias the query towards that group.
    if pop:
        extras.append(pop)

    # I only add the first 1-2 must include terms so I don't over constrain.
    extras.extend(must_terms[:2])

    # OPTIONAL: add a few target_entity category names as soft context
    raw_targets = data.get("target_entities") or {}
    target_categories: list[str] = []
    if isinstance(raw_targets, dict):
        # With the dict, I just use the keys (e.g., "genes").
        target_categories = list(raw_targets.keys())[:3]
    elif isinstance(raw_targets, list):
        # Backwards compatibility in case I ever pass a list.
        target_categories = raw_targets[:3]

    extras.extend(target_categories)

    # Join everything into a single query string.
    query = " ".join([base] + extras).strip()
    return query


def semantic_scholar_search(intent: ResearchIntent, limit: int = 50) -> List[Paper]:
    """
    Call the Semantic Scholar search API using the built query and
    convert results into my Paper schema.
    """
    if SEMANTIC_SCHOLAR_API_KEY is None:
        raise ValueError("Set SEMANTIC_SCHOLAR_API_KEY in environment variables.")

    query = build_search_query(intent)
    print(f"[DEBUG] Semantic Scholar query: {query!r}")

    fields = "title,abstract,year,venue,authors,citationCount,url"

    params = {
        "query": query,
        "limit": limit,
        "fields": fields,
    }

    headers = {
        "x-api-key": SEMANTIC_SCHOLAR_API_KEY,
    }

    resp = requests.get(BASE_URL, headers=headers, params=params)
    resp.raise_for_status()
    data = resp.json()

    papers: List[Paper] = []
    for item in data.get("data", []):
        authors = [a["name"] for a in item.get("authors", [])]
        paper: Paper = {
            "id": item.get("paperId", ""),
            "title": item.get("title") or "",
            "abstract": item.get("abstract") or "",
            "year": item.get("year") or 0,
            "venue": item.get("venue") or "",
            "authors": authors,
            "citation_count": item.get("citationCount") or 0,
            "url": item.get("url") or "",
        }
        papers.append(paper)

    return papers


In [99]:
# ----------------------------
# FILTER AGENT - DESIGN NOTES
# ----------------------------
# This LLM based agent emulates the first pass abstract screening step
# in a real literature review.
#
# From my perspective:
# - The search results might be noisy or off topic.
# - I don't want to manually read all titles + abstracts just to get rid of
#   obviously irrelevant papers.
#
# Design choices:
# - I give the LLM the full ResearchIntent + the candidate papers and
#   ask it to produce two lists: kept and rejected.
# - I treat this as a high recall filter:
#     - It's okay if it keeps a few weakly relevant papers.
#     - It's worse if it drops papers that belong.
# - I'm not using detailed quality criteria here, that's the job of the
#   later quality ranking agent.


In [114]:
# Filter Agent


# This agent screens raw Semantic Scholar hits to decide which papers
# are relevant enough to keep.

from typing import List, Dict
from core_setup import ResearchIntent, Paper, ResearchState
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai import ChatOpenAI

# I’m using a separate LLM instance here just to be explicit, but it
# could also reuse core_setup.llm.
llm_filter = ChatOpenAI(model="gpt-5-mini", temperature=0.0)

filter_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a paper screening assistant. "
            "Given a list of papers and a research intent, decide which papers to KEEP or REJECT. "
            "Return a purely JSON decision."
        ),
        (
            "user",
            "Research intent (JSON):\n{intent}\n\n"
            "Papers (JSON list):\n{papers}\n\n"
            "Only keep papers that match the topic, population (if specified), and study types "
            "and that are not excluded by the 'exclude' list.\n\n"
            "Return ONLY JSON with exactly these keys:\n"
            "- 'kept': a list of paper IDs as strings (e.g., ['id1', 'id2']).\n"
            "- 'rejected': a list of paper IDs as strings.\n"
            "Do not return full paper objects, and do not include any other keys or text."
        ),
    ]
)

filter_parser = JsonOutputParser(pydantic_object=None)

def run_filter_agent(intent: ResearchIntent, papers: List[Paper]) -> Dict[str, List[Dict]]:
    """
    Ask the LLM to decide which papers to keep vs reject, based on the
    structured ResearchIntent.
    """
    chain = filter_prompt | llm_filter | filter_parser
    return chain.invoke({
        "intent": intent,
        "papers": papers
    })


In [101]:
# ---------------------------------------------------------
# QUALITY RANKING AGENT (S2 ACADEMIC GRAPH) - DESIGN NOTES
# ----------------------------------------------------------
# This is my metadata-based quality assessment step.
#
# From my perspective:
# - I want an objectiveish score that prefers:
#     - more recent papers,
#     - higher citation counts,
#     - influential citations,
#     - solid venues,
#     - appropriate publication types (e.g., clinical trials > case reports).
# - I don't want the LLM making up random quality scores.
#
# Design choices:
# - I call the Semantic Scholar Academic Graph /paper/batch endpoint to
#   fetch metadata in bulk for the filtered paper IDs.
# - I normalize citation related metrics using logs so that:
#     - Differences at low counts matter more than differences at crazy high counts.
# - I compute a weighted score in (0,1) from:
#     cit_score, influ_score, recency_score, venue_score, type_score.
# - I always return reasons alongside the score so I have a transparent
#   explanation when I later read the report.


In [102]:
# Quality Ranking Agent (Semantic Scholar Academic Graph, no LLM)


# Here I'm using purely heuristic scoring based on metadata from the
# Semantic Scholar Academic Graph API (not the search API).
# Goal: produce a list of RankedPaper objects with:
## quality_score in (0,1)
## quality_reasons: human readable breakdown of the factors

import os
import math
import requests
from typing import List, Dict
from core_setup import Paper, RankedPaper

S2_GRAPH_BASE = "https://api.semanticscholar.org/graph/v1/paper"


def fetch_paper_details_batch(paper_ids: List[str], fields: str) -> Dict[str, dict]:
    """
    Call Semantic Scholar Graph API /paper/batch to get metadata for many paperIds.
    Returns a dict: {paperId: metadata_dict}

    I use this to pull citation counts, influential citations, year,
    venue, publication types, etc. in one batch.
    """
    if not paper_ids:
        return {}

    api_key = os.getenv("SEMANTIC_SCHOLAR_API_KEY")
    if not api_key:
        raise ValueError("Set SEMANTIC_SCHOLAR_API_KEY in environment variables.")

    url = f"{S2_GRAPH_BASE}/batch"
    headers = {
        "x-api-key": api_key,
        "Content-Type": "application/json",
    }
    payload = {
        "ids": paper_ids,
        "fields": fields,
    }

    resp = requests.post(url, headers=headers, json=payload)
    resp.raise_for_status()
    data = resp.json()

    details_by_id: Dict[str, dict] = {}
    for item in data:
        pid = item.get("paperId")
        if pid:
            details_by_id[pid] = item
    return details_by_id


def compute_quality_score(
    meta: dict,
    year_min: int,
    year_max: int,
    max_citations: int,
) -> (float, dict):
    """
    Compute a quality score in (0,1) using simple heuristics:
    - citation count
    - influential citation count
    - recency (relative to other papers in this set)
    - venue (high impact journal vs unknown)
    - publication type (clinical trial, journal article, etc.)

    I also return a dict of factor contributions so I can later build
    human readable explanations.
    """

    # Extract raw values with defaults
    year = meta.get("year") or 0
    citations = meta.get("citationCount") or 0
    influ = meta.get("influentialCitationCount") or 0
    venue = (meta.get("venue") or "").lower()
    pub_types = meta.get("publicationTypes") or []
    fields_of_study = meta.get("fieldsOfStudy") or []

    # --- normalized citation count ---
    if max_citations > 0:
        cit_score = math.log(citations + 1) / math.log(max_citations + 1)
    else:
        cit_score = 0.0

    # --- normalized influential citations ---
    if max_citations > 0:
        influ_score = math.log(influ + 1) / math.log(max_citations + 1)
    else:
        influ_score = 0.0

    # --- recency score (relative within the set) ---
    if year_max > year_min and year > 0:
        recency_score = (year - year_min) / (year_max - year_min)
        recency_score = max(0.0, min(1.0, recency_score))
    else:
        recency_score = 0.5  # default neutral if I can't compute it

    # --- venue heuristics ---
    high_impact_keywords = ["nature", "cell", "science", "lancet", "nejm", "jama"]
    if any(k in venue for k in high_impact_keywords):
        venue_score = 1.0
    elif venue:
        venue_score = 0.6
    else:
        venue_score = 0.4

    # --- publication type heuristics ---
    pub_types_lower = [p.lower() for p in pub_types]
    if any("clinical trial" in p or "randomized" in p for p in pub_types_lower):
        type_score = 1.0
    elif any("journalarticle" in p or "journal article" in p for p in pub_types_lower):
        type_score = 0.8
    elif pub_types_lower:
        type_score = 0.6
    else:
        type_score = 0.5

    # I could add field of study heuristics here, for now I just pass
    # them through in the factors dict.

    # Combine into a single quality score.
    # These weights are arbitrary but reflect my rough priorities.
    score = (
        0.35 * cit_score +
        0.25 * influ_score +
        0.2  * recency_score +
        0.1  * venue_score +
        0.1  * type_score
    )

    factors = {
        "cit_score": cit_score,
        "influ_score": influ_score,
        "recency_score": recency_score,
        "venue_score": venue_score,
        "type_score": type_score,
        "year": year,
        "citations": citations,
        "influential_citations": influ,
        "venue": venue,
        "pub_types": pub_types,
        "fields_of_study": fields_of_study,
    }

    return score, factors


def run_quality_ranking_agent(papers: List[Paper]) -> List[RankedPaper]:
    """
    Take my filtered papers and enrich them with Academic Graph metadata
    to produce quality ranked papers with explanations.
    """
    if not papers:
        return []

    # IDs I’ll look up
    ids = [p["id"] for p in papers if p.get("id")]
    if not ids:
        return []

    fields = (
        "year,venue,citationCount,influentialCitationCount,"
        "publicationTypes,fieldsOfStudy"
    )

    # --- fetch details in batches to respect limits ---
    details: Dict[str, dict] = {}
    batch_size = 50
    for i in range(0, len(ids), batch_size):
        batch_ids = ids[i : i + batch_size]
        batch_details = fetch_paper_details_batch(batch_ids, fields)
        details.update(batch_details)

    # --- aggregate for normalization (year range, max citations) ---
    years = [meta.get("year") for meta in details.values() if meta.get("year")]
    citations_list = [meta.get("citationCount") or 0 for meta in details.values()]
    year_min = min(years) if years else 0
    year_max = max(years) if years else 0
    max_citations = max(citations_list) if citations_list else 0

    ranked: List[RankedPaper] = []
    for p in papers:
        pid = p.get("id")
        meta = details.get(pid, {})  # if missing, meta = {}

        score, factors = compute_quality_score(meta, year_min, year_max, max_citations)

        # Build human readable reasons for the score.
        reasons: List[str] = []

        reasons.append(f"Overall quality score: {score:.2f} (0–1 scale).")

        year = factors["year"]
        if year:
            reasons.append(f"Publication year: {year} (relative recency score {factors['recency_score']:.2f}).")

        citations = factors["citations"]
        if citations:
            reasons.append(f"Citations: {citations} (normalized score {factors['cit_score']:.2f}).")

        influ = factors["influential_citations"]
        if influ:
            reasons.append(f"Influential citations: {influ} (normalized score {factors['influ_score']:.2f}).")

        venue = factors["venue"]
        if venue:
            reasons.append(f"Venue: {venue} (venue score {factors['venue_score']:.2f}).")

        pub_types = factors["pub_types"]
        if pub_types:
            reasons.append(f"Publication types: {', '.join(pub_types)} (type score {factors['type_score']:.2f}).")

        fields_of_study = factors["fields_of_study"] or []
        if fields_of_study:
            reasons.append(f"Fields of study: {', '.join(fields_of_study)}.")

        rp: RankedPaper = {
            **p,
            "quality_score": score,
            "quality_reasons": reasons,
        }
        ranked.append(rp)

    # Highest quality first
    ranked.sort(key=lambda x: x["quality_score"], reverse=True)
    return ranked


In [103]:
# -------------------------------------
# INFO EXTRACTION AGENT - DESIGN NOTES
# -------------------------------------
# This step tries to answer: "Given the top papers, what entities do
# they mention that match what I care about?"
#
# From my perspective:
# - I care about things like genes, pathways, biomarkers, risk factors, etc.
# - But I don't want a free for all where the model pulls out anything.
#
# Design choices:
# - I use intent.target_entities as a DICTIONARY to specify:
#     - which categories to focus on (keys),
#     - which example entities/synonyms prime the model (values).
# - The agent loops over the ranked papers and, for each one, outputs:
#     paper_id, entities (by category), and entity_notes (small comments).
# - Downstream trend analysis works purely on this structured layer,
#   not on raw text, which keeps things more interpretable.


In [104]:
# Info Extraction Agent


# This agent reads ranked papers + the structured intent and pulls out
# specific entities (genes, pathways, biomarkers, etc.) that I care
# about, according to intent.target_entities.

from typing import List
from core_setup import RankedPaper, ExtractedInfo, ResearchIntent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

llm_extract = ChatOpenAI(model="gpt-5-mini", temperature=0.0)

extract_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a biomedical information extraction assistant. "
            "You will be given a research intent and a list of papers (title + abstract).\n\n"
            "The intent.target_entities field is a dictionary mapping category names "
            "(e.g. 'genes', 'risk_factors', 'biomarkers') to example entities or synonyms.\n\n"
            "For each paper, you must output an object with:\n"
            "- paper_id (string)\n"
            "- entities (object mapping each target entity category to a list of strings actually mentioned in this paper)\n"
            "- entity_notes (object mapping specific entities to short notes, "
            "such as direction of effect or risk interpretation).\n"
            "If a category has nothing relevant for a given paper, use an empty list for that category."
        ),
        (
            "user",
            "Research intent (JSON):\n{intent}\n\n"
            "Papers (JSON list):\n{papers}\n\n"
            "Return ONLY a JSON list of objects with keys: paper_id, entities, entity_notes."
        ),
    ]
)

extract_parser = JsonOutputParser(pydantic_object=None)

def run_info_extraction_agent(
    intent: ResearchIntent,
    ranked_papers: List[RankedPaper],
) -> List[ExtractedInfo]:
    """
    Use the LLM to extract entities of interest from the top ranked papers,
    guided by the target_entities mapping in the ResearchIntent.
    """
    # Convert Pydantic intent -> plain dict if needed.
    if hasattr(intent, "model_dump"):
        intent_json = intent.model_dump()
    else:
        intent_json = intent

    chain = extract_prompt | llm_extract | extract_parser
    return chain.invoke(
        {
            "intent": intent_json,
            "papers": ranked_papers,
        }
    )


In [105]:
# -----------------------------------
# TREND ANALYSIS AGENT – DESIGN NOTES
# -----------------------------------
# Given all the extracted entities across papers, this agent tries to
# answer: "Where do papers agree?" and "Where do they disagree?"
#
# From my perspective:
# - This is my meta-analysis lite step, but I'm not doing stats here.
# - I mainly want:
#     - a list of entities with strong multi paper support (consensus),
#     - a list of entities where evidence is mixed or conflicting.
#
# Design choices:
# - I let the LLM operate only on the structured ExtractedInfo objects,
#   not full abstracts, this focuses it on the curated signals instead of
#   raw text noise.
# - The output schema is:
#     consensus_findings: ({entity, evidence_support_papers, summary})
#     conflicts: [{entity, supporting_papers, contradicting_papers, notes}]
# - This is what I show next to myself at the checkpoint to judge if the
#   system is picking up the right story.


In [106]:
# Trend Analysis Agent


# Here I take the extracted entities from all papers and ask the LLM
# to synthesize consensus vs conflicts across the literature.

from typing import List
from collections import defaultdict
from core_setup import ExtractedInfo
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

llm_trend = ChatOpenAI(model="gpt-5-mini", temperature=0.0)

trend_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a meta-analysis assistant. "
            "You will be given extracted entities from multiple papers.\n"
            "Each entry has 'paper_id', 'entities' (category -> list of strings), "
            "and 'entity_notes'.\n\n"
            "Your job is to:\n"
            "- Identify which specific entities (e.g., particular risk factors, genes, scores) "
            "are supported by multiple papers.\n"
            "- Summarize the overall consensus about these entities.\n"
            "- Identify any major disagreements or conflicting findings."
        ),
        (
            "user",
            "Extracted info (JSON list):\n{extracted_info}\n\n"
            "Return JSON with keys:\n"
            "- consensus_findings: list of objects with keys 'entity', "
            "'evidence_support_papers', and 'summary'\n"
            "- conflicts: list of objects with keys 'entity', 'supporting_papers', "
            "'contradicting_papers', and 'notes'\n"
            "Return ONLY JSON."
        ),
    ]
)

trend_parser = JsonOutputParser(pydantic_object=None)

def run_trend_analysis_agent(
    extracted_info: List[ExtractedInfo],
) -> dict:
    """
    Summarize consensus and conflicts across all extracted entities.
    """
    chain = trend_prompt | llm_trend | trend_parser
    return chain.invoke({"extracted_info": extracted_info})


In [107]:
# ------------------------------
# CHECKPOINT AGENT – DESIGN NOTES
# ------------------------------
# This is where I keep myself in the loop.
#
# From my perspective:
# - I don't want the pipeline to run fully autonomously, especially on
#   sensitive topics, without me confirming the intent looks right.
# - I also may notice, after seeing consensus/conflicts, that I want to
#   tighten or relax some criteria (e.g., date_range, population).
#
# Design choices:
# - I show:
#     - truncated consensus findings
#     - truncated conflicts
#     - the current ResearchIntent as JSON
# - I can:
#     - press Enter to approve,
#     - 'e' / 'n' to edit fields interactively,
#     - 'q' to stop the pipeline entirely.
# - Any notes I leave in `notes_for_supervisor` get stored as
#   `checkpoint_notes` to be used by the output agent.


In [108]:
# Checkpoint Agent


# This is my human in the loop step, where I/(the user) can
# inspect and edit the ResearchIntent before the pipeline finishes.

import json
from typing import Optional, List, Union
from core_setup import ResearchIntent
from core_setup import ResearchState


def review_and_edit_intent(intent: Union[ResearchIntent, dict]) -> ResearchIntent:
    """
    Interactive CLI-style loop where I can review and optionally edit
    fields of the ResearchIntent. This is basically my original
    checkpoint function, updated to handle both dict and Pydantic.
    """
    # If we got a plain dict from upstream, wrap it back into Pydantic.
    if isinstance(intent, dict):
        intent = ResearchIntent(**intent)

    while True:
        print("\n=== Research intent (current) ===")
        print(json.dumps(intent.model_dump(), indent=2)[:3000])

        ans = input(
            "Approve intent? [Enter]=yes, 'e' or 'n'=edit, 'q'=quit: "
        ).strip().lower()

        # Approve as is
        if ans in ("", "y", "yes"):
            return intent

        # Quit the entire pipeline
        if ans in ("q", "quit"):
            raise SystemExit("Stopped at research intent review.")

        # Enter edit mode
        if ans in ("e", "n", "no"):
            print("\nEdit fields (press Enter to keep current value):\n")

            def edit_field(label: str, current: str) -> str:
                """Helper: edit a single scalar field."""
                new_val = input(f"{label} [{current}]: ").strip()
                return new_val or current

            def edit_list_field(
                label: str,
                current_list: Optional[List[str]],
            ) -> Optional[List[str]]:
                """Helper: edit a list field via comma-separated input."""
                current_str = ", ".join(current_list or [])
                new_val = input(
                    f"{label} (comma-separated) [{current_str}]: "
                ).strip()
                if not new_val:
                    return current_list
                return [x.strip() for x in new_val.split(",") if x.strip()]

            # --- my existing field edits ---
            intent.high_level_question = edit_field(
                "High-level question", intent.high_level_question
            )
            intent.topic = edit_field("Topic", intent.topic)

            if intent.population is not None:
                intent.population = edit_field("Population", intent.population)
            else:
                pop_new = input("Population [none]: ").strip()
                intent.population = pop_new or None

            intent.date_range = edit_field(
                "Date range (e.g., 'last 5 years')",
                intent.date_range or "",
            )

            intent.outcomes = edit_list_field("Outcomes", intent.outcomes)
            intent.study_types = edit_list_field("Study types", intent.study_types)
            intent.must_include_terms = edit_list_field(
                "Must include terms", intent.must_include_terms
            )
            intent.exclude_terms = edit_list_field(
                "Exclude terms", intent.exclude_terms
            )

            intent.notes_for_supervisor = edit_field(
                "Notes for supervisor", intent.notes_for_supervisor or ""
            )

            intent.semantic_scholar_query = edit_field(
                "Semantic Scholar query", intent.semantic_scholar_query
            )

            # I could add editing for target_entities later if I want,
            # but for now I leave it as is.
            continue  # loop, show updated JSON again

        print("Didn't understand that input; please press Enter, 'e', or 'q'.")


def run_checkpoint_agent(state: ResearchState) -> ResearchState:
    """
    Uses my review_and_edit_intent() function to let me inspect and
    modify the ResearchIntent, with optional context from trend analysis.
    """
    print("\n--- CHECKPOINT: review intent in light of extracted trends ---")

    if "consensus_findings" in state:
        print("\nTop consensus findings (truncated):")
        for c in state["consensus_findings"][:5]:
            print(f"- {c.get('entity')}: {c.get('summary')}")

    if "conflicts" in state:
        print("\nConflicts (truncated):")
        for c in state["conflicts"][:5]:
            print(f"- {c.get('entity')}: {c.get('notes')}")

    # Here I plug in my interactive function.
    state["intent"] = review_and_edit_intent(state["intent"])

    # I keep any notes_for_supervisor as checkpoint_notes so the
    # output agent can incorporate them.
    notes = getattr(state["intent"], "notes_for_supervisor", None)
    state["checkpoint_notes"] = notes or ""

    return state


In [109]:
# ---------------------------------
# OUTPUT FORMATTING – DESIGN NOTES
# ---------------------------------
# This is the report writer step.
#
# From my perspective:
# - I want a clean Markdown summary I can read, annotate, or paste into
#   a doc, not raw JSON or unstructured text.
# - The model should:
#     - explain the main findings,
#     - show which papers support what,
#     - highlight conflicts and limitations.
#
# Design choices:
# - I give the agent:
#     - my original user_query,
#     - the final intent,
#     - consensus_findings + conflicts,
#     - top ranked papers (with reasons),
#     - checkpoint_notes.
# - I ask it to produce a structured report with sections:
#     background, methods description, main findings, conflicts,
#     limitations, and next steps.


In [110]:
# Output Format Agent


# This is my final scientific writer agent that turns all the
# structured results into a Markdown report.

from core_setup import ResearchState
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm_output = ChatOpenAI(model="gpt-5-mini", temperature=0.2)

output_prompt = ChatPromptTemplate.from_messages(
    [
        ("system",
         "You are a scientific writer. Format the results of a literature review "
         "into clear Markdown with sections, bullet lists, and tables when helpful. "
         "Assume the audience has a biology background."),
        ("user",
         "User query:\n{user_query}\n\n"
         "Research intent:\n{intent}\n\n"
         "Consensus findings (JSON):\n{consensus}\n\n"
         "Conflicts (JSON):\n{conflicts}\n\n"
         "Top ranked papers (JSON, maybe truncated):\n{ranked_papers}\n\n"
         "Human checkpoint notes:\n{checkpoint_notes}\n\n"
         "Please return a well-structured Markdown report including:\n"
         "- Ranked papers/score reasoning\n"
         "- Overview / background\n"
         "- Key genes/pathways/biomarkers/causes and supporting evidence\n"
         "- Areas of disagreement or weak evidence\n"
         "- Brief methods/limitations note\n"
         "- Optional suggestions for next steps in analysis.")
    ]
)

def run_output_format_agent(state: ResearchState) -> str:
    """
    Turn the final state into a Markdown report string.
    """
    chain = output_prompt | llm_output
    md = chain.invoke({
        "user_query": state.get("user_query", ""),
        "intent": state.get("intent", {}),
        "consensus": state.get("consensus_findings", []),
        "conflicts": state.get("conflicts", []),
        "ranked_papers": state.get("ranked_papers", [])[:10],
        "checkpoint_notes": state.get("checkpoint_notes", ""),
    }).content
    return md


In [111]:
# -------------------------------
# SUPERVISOR AGENT - DESIGN NOTES
# -------------------------------
# This function is the glue that runs the whole pipeline in order.
#
# From my perspective:
# - I want a single call:
#       state = run_supervisor("my research question")
#   and then I get:
#       - debug prints at each step,
#       - a final `state["formatted_output"]` report.
#
# Design choices:
# - I explicitly log each step with "[SUP]" so I can see where things
#   fail or stall.
# - I keep the ResearchState as a simple TypedDict so it's easy to print or
#   inspect later.
# - The supervisor is intentionally linear right now, if I later move to
#   LangGraph, these steps will map to nodes in the graph.


In [112]:
# Supervisor Agent


# This is the orchestrator that calls all the other agents in order and
# wires inputs/outputs together via the ResearchState dict.

from core_setup import ResearchState, ResearchIntent

def run_supervisor(user_query: str) -> ResearchState:
    """
    Highlevel pipeline:
      1. Intent agent -> structured ResearchIntent
      2. Semantic Scholar search -> raw_papers
      3. Filter agent -> filtered_papers
      4. Quality ranking agent -> ranked_papers
      5. Info extraction agent -> extracted_info
      6. Trend analysis agent -> consensus/conflicts
      7. Human checkpoint agent -> possibly updated intent + notes
      8. Output formatting agent -> final Markdown report
    """
    # Initialize state with the original user query.
    state: ResearchState = {"user_query": user_query}

    print("[SUP] Step 1: Intent agent")
    raw_intent = run_intent_agent(state["user_query"])
    if isinstance(raw_intent, dict):
        state["intent"] = ResearchIntent(**raw_intent)
    else:
        state["intent"] = raw_intent

    print("[SUP] Step 2: Semantic Scholar search")
    state["raw_papers"] = semantic_scholar_search(state["intent"], limit=50)
    print(f"[SUP] raw papers: {len(state['raw_papers'])}")

    print("[SUP] Step 3: Filter agent")
    filter_result = run_filter_agent(state["intent"], state["raw_papers"])

    # From my perspective: the LLM might return either full paper objects
    # or just IDs, so I normalize everything into a set of IDs here.
    raw_kept = filter_result.get("kept", [])

    kept_ids = set()
    for item in raw_kept:
        # Case 1: the filter agent returned full paper dicts
        if isinstance(item, dict) and "id" in item:
            kept_ids.add(item["id"])
        # Case 2: the filter agent returned just paper IDs as strings
        elif isinstance(item, str):
            kept_ids.add(item)
        # Anything else I just ignore for now

    # Now I filter my original raw_papers list using this ID set
    state["filtered_papers"] = [
        p for p in state["raw_papers"] if p.get("id") in kept_ids
    ]
    print(f"[SUP] filtered papers: {len(state['filtered_papers'])}")

    print("[SUP] Step 4: Quality ranking")
    state["ranked_papers"] = run_quality_ranking_agent(state["filtered_papers"])[:20]
    print(f"[SUP] ranked papers: {len(state['ranked_papers'])}")

    print("[SUP] Step 5: Info extraction")
    state["extracted_info"] = run_info_extraction_agent(
        state["intent"],
        state["ranked_papers"],
    )
    print(f"[SUP] extracted entries: {len(state['extracted_info'])}")

    print("[SUP] Step 6: Trend analysis")
    trend_result = run_trend_analysis_agent(state["extracted_info"])
    state["consensus_findings"] = trend_result.get("consensus_findings", [])
    state["conflicts"] = trend_result.get("conflicts", [])
    print(f"[SUP] consensus: {len(state['consensus_findings'])}, conflicts: {len(state['conflicts'])}")

    print("[SUP] Step 7: Checkpoint")
    state = run_checkpoint_agent(state)

    print("[SUP] Step 8: Output formatting")
    state["formatted_output"] = run_output_format_agent(state)

    print("[SUP] Done, returning state")
    return state


In [115]:
# Here I'm actually running the whole pipeline once with a test query.
# I can change this string to explore different topics.

state = run_supervisor("Leading risk of breast cancer in young adults")


[SUP] Step 1: Intent agent
[SUP] Step 2: Semantic Scholar search
[DEBUG] Semantic Scholar query: 'breast cancer risk factors in young adults young adults'
[SUP] raw papers: 50
[SUP] Step 3: Filter agent
[SUP] filtered papers: 15
[SUP] Step 4: Quality ranking
[SUP] ranked papers: 15
[SUP] Step 5: Info extraction
[SUP] extracted entries: 15
[SUP] Step 6: Trend analysis
[SUP] consensus: 6, conflicts: 3
[SUP] Step 7: Checkpoint

--- CHECKPOINT: review intent in light of extracted trends ---

Top consensus findings (truncated):
- Rising incidence and poorer outcomes in adolescents and young adults (AYAs; 15–39 years): Multiple reviews and GBD analyses report an increasing global incidence of breast cancer in AYAs with evidence of worse disease-free and overall survival compared with older adults. Regional predictions and GBD estimates (cases, deaths, DALYs) support a rising burden in this age group.
- Dietary factors (higher red meat intake increases risk; higher plant-based intake and diet

Approve intent? [Enter]=yes, 'e' or 'n'=edit, 'q'=quit:  yes


[SUP] Step 8: Output formatting
[SUP] Done, returning state


In [116]:
# Finally, I render the Markdown report that the output agent produced.

from IPython.display import Markdown, display

display(Markdown(state["formatted_output"]))


Below is a concise literature‑review-style summary addressing the question:
What are the leading risk factors for developing breast cancer in young adults?  
I adopt a working definition used by many AYA studies: adolescents and young adults (AYA) = ages 15–39 years (I note where studies use other cutoffs). The review prioritizes systematic reviews/meta‑analyses, large cohort and case‑control studies, and recent Global Burden of Disease (GBD) estimates when available.

Executive summary
- Breast cancer incidence in AYAs (15–39) is rising in many regions and AYAs often experience worse disease‑free and overall survival than older adults.
- Major contributors to AYA breast cancer risk include inherited genetic syndromes (BRCA1/2, TP53, etc.), family history, reproductive/hormonal factors, prior therapeutic chest radiation, certain lifestyle factors (diet — high red meat; low plant/fiber; physical inactivity; alcohol; tobacco), and some benign breast diseases.
- Evidence strength varies: genetic and prior chest radiation risks are large and well supported; lifestyle and BMI associations for AYAs are often smaller, heterogeneous, or context dependent. Population attributable fractions reported by GBD point to diet and tobacco as measurable contributors to AYA breast cancer burden.
- Important gaps/conflicts: BMI’s role in AYA breast cancer is inconsistent across studies; ambient air pollution and evidence in transgender/nonbinary AYAs are currently limited.

Ranked key publications (selection) and reasoning
- Zheng et al., 2025 (GBD 2021 analysis of AYAs 15–39) — large, up‑to‑date global burden estimates and attributable risk fractions (dietary risks, tobacco). High relevance for incidence/trend and PAFs.
- McVeigh et al., 2021 (“A Review of Breast Cancer Risk Factors in Adolescents and Young Adults”, Cancers) — focused AYA review covering genetic, environmental, and lifestyle risks; useful synthesis for clinical implications.
- Cathcart‑Rake et al., 2018 (Cancer Journal) — review on modifiable risk factors in young women (physical activity, diet, BMI); helpful for modifiable exposures.
- Zhao et al., 2022 (GBD 2019, China focus) — country‑level comparative risk assessment pointing to red meat and BMI contributions.
- Yuan et al., 2024 (ecological study, NY State) — county‑level associations identifying ambient air pollution and other population exposures associated with younger‑onset breast cancer (hypothesis‑generating).
- Other useful reviews: pediatric/AYA oncology overviews (2022), classic endocrine‑risk reviews (Bernstein 2004) for mechanistic context.

(These papers were prioritized because they are recent, AYA‑focused, or use large, population‑level data. Limitations of individual papers are noted below.)

Background / burden and incidence in young adults
- GBD (2021, AYA 15–39): ~180,791 new breast cancer cases among AYAs globally in 2021; age‑standardized incidence rising (reported AAPC for women ~3.0%). Regional heterogeneity: higher incidence in high‑SDI regions, higher mortality in low‑SDI regions.
- Country examples: China (GBD 2019‑derived analysis) estimated 61,038 incident FeBGC cases among AYAs in 2019 and predicted rising incidence through 2030.
- Age distribution within AYAs: the bulk of “young adult” breast cancers cluster at the older end of the AYA range (late 20s–late 30s); true pediatric breast cancer is uncommon. AYAs as a group have a higher proportion of aggressive subtypes and worse outcomes relative to older adults.

Leading risk factors — summary, evidence level, and typical effect sizes (where available)
Below is a condensed table showing major domains, direction/relative magnitude, and evidence strength. Effect sizes are given as ranges reported in AYA‑targeted or large general studies when AYA‑specific estimates were available; many effect sizes come from mixed‑age analyses and are therefore flagged.

| Risk factor (domain) | Direction / magnitude (typical effect) | Evidence level (for AYAs) | Notes / population attributable fraction (PAF) where available |
|---|---:|---|---|
| Inherited high‑penetrance genes (BRCA1, BRCA2, TP53, PALB2, PTEN, CHEK2, ATM) | Large increase in risk. BRCA1/2 carriers: markedly elevated lifetime risks; substantial proportion of carriers develop breast cancer before age 40. | High (genetic cohort studies, registry data, systematic reviews) | BRCA1/2 major causes of early‑onset cases. Exact age‑specific penetrance varies by study/population (see text). |
| Family history (first‑degree relative) | Approx. 2–3× increased risk (varies with age of affected relative and number of relatives) | High (case‑control, cohort) | Family history remains an independent predictor after accounting for known mutations. |
| Prior chest radiation (e.g., mantle RT for Hodgkin lymphoma in adolescence) | Very large. RR often several‑fold; cumulative incidence by 30–40 years dramatically increased (some studies report cumulative incidence ~10–20% by age 40 among exposed survivors). | High (cohort studies of childhood/adolescent cancer survivors) | One of the strongest modifiable iatrogenic risks for young‑onset breast cancer; screening guidelines exist for survivors. |
| Reproductive/hormonal (early menarche, nulliparity, late first childbirth, shorter breastfeeding) | Direction consistent with adult literature: early menarche and nulliparity/late first birth → increased risk. Magnitudes modest (RRs typically 1.1–1.5 depending on comparison). | Moderate (cohort analyses; many studies include broader age ranges) | Breastfeeding protective in many studies; effect sizes modest but relevant to early‑onset risk. |
| Oral contraceptives (OCs) / exogenous hormones | Small increased risk associated with current/recent OC use in many studies (RR ~1.1–1.3); effects decline after cessation. | Moderate (meta‑analyses, cohort studies) | Most data come from general adult cohorts; AYA‑specific data limited but concern is often emphasized because of exposure during reproductive years. |
| Alcohol consumption | Dose‑dependent increase; typical estimates (all ages) ~7–10% increased risk per 10 g/day; for AYAs evidence consistent but limited on precise age‑specific RRs. | Moderate (meta‑analyses, cohort) | GBD attributes some breast cancer DALYs to alcohol; effect present across ages. |
| Diet (high red meat; low plant/fiber) | Higher red meat associated with increased risk; higher plant/fiber intake protective. Effect sizes modest (RRs commonly <1.5). | Low–moderate (observational studies, meta‑analyses, GBD) | GBD AYA analyses name dietary risks as the largest modifiable contributor to DALYs in some regions (e.g., China). |
| Physical activity | Protective (higher activity associated with ≈10–30% lower risk in many studies). | Moderate (cohort/meta‑analysis) | Protective effect reported in AYAs and broader adult cohorts. |
| Tobacco smoking | Small but measurable increased risk; GBD attributes a portion of DALYs to tobacco in AYAs. | Low–moderate (ecological/GBD, cohort evidence mixed) | Smoking exposure prior to first full‑term pregnancy may be particularly relevant biologically. |
| Body mass index (BMI) / obesity | Mixed/inconsistent for AYAs: in general adult literature, higher BMI is associated with higher postmenopausal breast cancer risk but sometimes with lower premenopausal risk. AYA‑specific findings are heterogeneous. | Low–conflicting (GBD/country analyses vs. AYA‑focused reviews) | Some GBD/country studies list high BMI as an important contributor; other AYA reviews find unclear or subtype‑dependent associations. |
| Benign proliferative breast disease | Increased risk (relative risk varies by histology; proliferative lesions with atypia confer higher risk). | Moderate (cohort/case‑control studies) | Often identified clinically at younger ages and known to increase future risk. |
| Socioeconomic / race/ethnicity | Heterogeneous effects: higher incidence in high‑SDI regions but higher mortality in low‑SDI regions. Black women more likely to present with aggressive subtypes and higher mortality at younger ages. | Moderate (population studies, registry data) | Reflects both exposure distributions and access to care. |
| Ambient air pollution (PM2.5, ozone) | Emerging/ecological evidence suggests positive associations in some studies; individual‑level causality not established. | Low (ecological; hypothesis‑generating) | Needs confirmation in individual‑level analytic studies. |

Key genes / pathways / biomarkers and supporting evidence
- DNA repair and homologous recombination pathway: BRCA1 and BRCA2
  - BRCA1 is strongly associated with early‑onset, triple‑negative phenotype in many carriers; BRCA2 also confers early risk but the phenotype distribution differs.
  - Age‑specific penetrance varies by study/population. Many BRCA1/2 carriers accumulate a substantial proportion of lifetime breast cancer risk before age 40.
- TP53 (Li‑Fraumeni syndrome)
  - Strongly increases risk for early‑onset breast cancer (often premenopausal); carriers are at markedly elevated risk across childhood/young adulthood.
- PALB2, CHEK2, ATM, PTEN (Cowden)
  - Moderate‑ to high‑penetrance genes that increase early‑onset risk in carriers (magnitude varies).
- Hormone receptor and intrinsic subtypes
  - Younger women have a higher relative prevalence of biologically aggressive subtypes (e.g., triple‑negative and HER2‑positive) compared with older women in many studies; this influences prognosis and likely reflects different etiologies.
- Biomarkers used clinically for risk stratification
  - Multi‑gene panels, polygenic risk scores (PRS) and family history models are increasingly applied to quantify risk and guide surveillance; evidence supports their utility, but calibration for AYAs and for non‑European populations is variable.

Areas of disagreement, uncertainty, or weak evidence
- BMI / obesity: evidence for BMI’s effect on breast cancer risk in AYAs is inconsistent. Adult literature shows opposite directions by menopausal status, but AYA‑specific studies vary by geography, subtype, and study design. Some GBD/country analyses highlight BMI as an important contributor to cancer burden, while AYA reviews find the association less straightforward.
- Air pollution: ecological analyses (e.g., NY State county data) reported positive associations between PM2.5/ozone and younger‑onset breast cancer, but global burden studies have not attributed a substantial fraction of AYA breast cancer to air pollution. Individual‑level causal evidence remains limited.
- Hormonal contraceptives: many adult studies report small increased risk with current/recent use; AYA‑specific effect sizes are modest and confounded by indication and reproductive behavior. Overall clinical consensus is cautious but not prohibitive.
- Transgender and nonbinary AYAs: empirical data on genotype/penetrance and breast cancer risks (e.g., after gender‑affirming hormones or surgeries) are limited; current risk models may not directly generalize.
- Population attributable fractions (PAFs): GBD provides PAFs for categories such as dietary risk and tobacco for AYAs in some analyses, but these are modelled estimates that depend on exposure prevalence and assumed relative risks and may not capture regional or subtype differences.

Methods / limitations of the evidence base (brief)
- Heterogeneous age definitions: many studies use different “young” cutoffs (e.g., <35, <40, 15–39), making synthesis imperfect.
- Many effect sizes come from mixed‑age cohort/meta‑analyses; truly AYA‑specific prospective data are fewer.
- GBD and ecological studies provide useful population‑level PAFs but rely on modelling assumptions and can differ from individual‑level estimates.
- Confounding and reverse causation risk in observational studies of lifestyle factors (e.g., smoking, alcohol, diet). Measurement error in self‑reported exposures.
- Genetic penetrance estimates vary by ancestry and study; PRS and gene panels are still being calibrated across populations.
- Subtype heterogeneity (ER/PR/HER2) is an important modifier but is not always reported or stratified in studies.

Selected quantitative estimates (examples, approximate ranges and caveats)
- Global AYA burden: GBD 2021 (AYAs 15–39) ≈ 180,791 incident cases (2021); AAPC incidence in women ~ +3.0% (1990–2021).
- GBD attributable fractions (AYAs, selected): dietary risks (example: 10.5% of DALYs in one GBD AYA analysis), tobacco ~2.0% of DALYs; high fasting plasma glucose ~1.6% (Zheng et al., 2025). These are modelled PAFs and vary by region.
- Prior chest radiation: cohort studies of childhood/adolescent Hodgkin lymphoma survivors show sharply increased breast cancer risks with RRs often several‑fold; cumulative incidence by age 40 among exposed survivors can be on the order of 10–20% in some series (varies by dose and field).
- BRCA1/2: carriers have markedly elevated lifetime risks; many carriers develop breast cancer at younger ages. Age‑specific penetrance estimates vary by paper and population; reported lifetime risks to age 70 commonly cited around 45–70% for BRCA1 and 40–60% for BRCA2 in older literature — a substantial fraction of those events occur before age 40 in many carriers. (Use mutation‑specific cohort data for precise counseling.)
- OCPs: pooled adult estimates suggest small increased risk (e.g., RR ≈1.1–1.3 for current/recent users) that wanes over time since stopping; AYA‑specific RRs are similar but data are limited.
- Alcohol: pooled adult estimates ~7–10% increased risk per 10 g/day alcohol intake; AYA‑specific incremental risk not well quantified separately.

Practical implications for clinicians / public health
- Genetic testing and early surveillance: identify individuals with family history or early cancers for genetic testing (BRCA1/2, TP53, etc.) because these confer high absolute risks at young ages and change management (enhanced surveillance, risk‑reducing options).
- Prior chest radiation survivors require early screening (MRI ± mammography) because of high cumulative risk.
- Population prevention: promote modifiable protective behaviors (physical activity, healthy diet, limiting alcohol, breastfeeding where possible) while acknowledging effect sizes are modest individually.
- Equity: targeted efforts are needed because incidence and mortality patterns differ by SDI, race/ethnicity, and access to care.

Suggestions for next steps in analysis or research (if you want to go deeper)
1. Define the exact age range for “young adults” for your analysis (recommended: 15–39 for comparability with AYA literature, or <40 if simpler).
2. Conduct a focused systematic search (PubMed/MEDLINE, Embase, Web of Science) limited to AYA age cutoffs and these study types: systematic reviews/meta‑analyses, large cohorts, population registries, and case‑control studies. Use search terms: “breast cancer” AND (“young adult” OR adolescent OR “early onset” OR “<40” OR “15–39”) plus risk factor terms (BRCA, radiation, parity, OCP, BMI, alcohol, diet, smoking, physical activity, PM2.5).
3. Extract age‑specific effect sizes (RR/OR/HR) stratified by subtype (ER+/ER−/HER2+/TNBC) where available. Perform meta‑analysis if sufficient homogeneous studies exist.
4. Estimate population attributable fractions for your target population using measured exposure prevalences and pooled RRs (with sensitivity analyses).
5. Stratify by geography/SDI and race/ethnicity where possible, because both exposure prevalence and baseline incidence differ substantially.
6. If focusing on genetic risk, compile gene‑specific age‑specific penetrance estimates by ancestry and assess calibration of PRS in AYAs.

Concluding remark
The strongest, most consistently supported drivers of breast cancer in young adults are inherited high‑penetrance genetic variants (BRCA1/2, TP53, etc.), family history, and prior chest radiation; these confer substantial age‑specific risks and have direct clinical implications (testing, surveillance, risk‑reducing interventions). Lifestyle and environmental contributors (diet, alcohol, physical activity, tobacco) are plausible and contribute at the population level (GBD PAFs), but effect sizes for individual risk and the evidence for some exposures (BMI, air pollution) in AYAs are heterogeneous and require more AYA‑focused, individual‑level studies.

If you want, I can:
- run a structured literature search and extract effect sizes (RR/OR/HR) for a specified age cutoff (e.g., 15–39 or <40),
- prepare an evidence table with study designs, sample sizes, and numeric effect estimates,
- or focus on one domain (e.g., genetic/heritability vs modifiable lifestyle risks) and produce a targeted, referenced summary. Which do you prefer?