# 📄 Idea2Paper — Demo Notebook

This notebook demonstrates the end-to-end pipeline of **Idea2Paper**:

- LLM clarifier to structure messy notes
- arXiv retrieval
- SPECTER (sentence-transformers/allenai-specter) semantic ranking
- PEGASUS-arXiv background summarization
- Markdown draft generation with references

> **Tip:** Ensure you ran `pip install -r requirements.txt` in your project root.

## ✅ setup

In [5]:
# Make `src/` importable when running from the notebook folder
import os, sys
root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if root not in sys.path:
    sys.path.insert(0, root)
print("Project root:", root)
from pprint import pprint

import pandas as pd

from src.config import (
    MAX_ARXIV_RESULTS, TOP_K, SIM_THRESHOLD, EMBED_MODEL, SUM_MODEL, DEVICE, MAX_SUMMARY_LEN, DRAFT_DIR
)
from src.llm_interface import ClarifierLLM
from src.retrieval import make_query_from_idea, fetch_arxiv
from src.ranker import PaperRanker
from src.summarizer import Summarizer
from src.feasibility import quick_feasibility
from src.generator import build_markdown, save_markdown

Project root: c:\Users\sivar\Documents\git\Idea2Paper


## ✍️ Provide your idea / notes

In [16]:
# Edit this string with your own concept/notes
idea = """Gravitational-wave tails and memory effect for mergers in astrophysical environments"""

print(idea)

Gravitational-wave tails and memory effect for mergers in astrophysical environments


## 🧩 Clarify and structure with LLM

In [17]:
# Initialize the lightweight LLM (FLAN-T5 by default; falls back to heuristics if it can't load)
llm = ClarifierLLM(device=DEVICE)

# Turn raw notes into structured fields for the pipeline
structured = llm.structure(idea)

# Generate concise clarifying questions you can answer or incorporate into the fields
questions = llm.followups(idea)

print("Structured fields:")
pprint(structured, width=100)

print("\nClarifying questions:")
for i, q in enumerate(questions, 1):
    print(f"{i}. {q}")

Structured fields:
{'constraints': '',
 'data': '',
 'domain': '',
 'hypothesis': '',
 'keywords': ['gravitational-wave',
              'tails',
              'memory',
              'effect',
              'mergers',
              'astrophysical',
              'environments'],
 'method': '',
 'metrics': '',
 'problem': 'Gravitational-wave tails and memory effect for mergers in astrophysical environments'}

Clarifying questions:
1. What is the exact domain/subfield?
2. What is the main hypothesis or novelty?
3. What method/architecture do you propose?
4. Do you have any data or experimental results?
5. How will you evaluate success (metrics/baselines)?


## 🔎 Retrieve related papers from arXiv

In [18]:
def build_search_query(struct: dict, fallback_text: str) -> str:
    """
    Compose a compact search string from the structured fields.
    Falls back to the raw idea if fields are sparse.
    """
    struct = struct or {}
    parts = [
        struct.get("domain", ""),
        struct.get("problem", ""),
        struct.get("hypothesis", ""),
        struct.get("method", ""),
        " ".join(struct.get("keywords", [])),
    ]
    q = " ".join(p for p in parts if p).strip()
    return q or (fallback_text.strip() if fallback_text else "machine learning")

# Build a query from the structured fields (fallback to the raw idea)
query_text = build_search_query(structured if isinstance(structured, dict) else {}, idea)
query = make_query_from_idea(query_text)
print("Query:", query)

# Fetch from arXiv
try:
    df = fetch_arxiv(query, max_results=MAX_ARXIV_RESULTS)
    print(f"Fetched {len(df)} papers.")
    if not df.empty:
        display(df[["title", "published", "url"]].head(10))
    else:
        print("No results. Try refining Domain/Problem/Keywords and re-run.")
except Exception as e:
    print("Error fetching from arXiv:", e)
    df = pd.DataFrame()

# Optional: cache results for inspection
# df.to_csv("data/papers.csv", index=False)

Query: ti:"Gravitational-wave tails and memory effect for mergers in astrophysical environments gravitational-wave tails memory effect mergers astrophysical environments" OR abs:"Gravitational-wave tails and memory effect for mergers in astrophysical environments gravitational-wave tails memory effect mergers astrophysical environments"
Fetched 0 papers.
No results. Try refining Domain/Problem/Keywords and re-run.


## 📈 Rank by semantic similarity (SPECTER)

In [19]:
# Guard: ensure we have retrieval results first
if "df" not in globals() or df is None or df.empty:
    print("No papers to rank. Run the arXiv retrieval step first.")
    ranked_df = pd.DataFrame()
    mean_sim = 0.0
else:
    # Use the structured Problem (fallback to raw idea) as the target text
    target_text = (structured.get("problem") if isinstance(structured, dict) else None) or idea

    # Rank with SPECTER embeddings
    ranker = PaperRanker(EMBED_MODEL)
    ranked_df, mean_sim = ranker.rank(target_text, df, top_k=TOP_K)

    # Show top results
    cols = ["title", "published", "similarity", "url"]
    display(ranked_df[cols])

print("Mean similarity:", round(float(mean_sim), 4))


No papers to rank. Run the arXiv retrieval step first.
Mean similarity: 0.0


## 🧪 Summarize background (PEGASUS-arXiv)

In [20]:

# Guard: ensure we have ranked papers
if "ranked_df" not in globals() or ranked_df is None or ranked_df.empty:
    print("No ranked papers to summarize. Run the retrieval & ranking steps first.")
    background = ""
else:
    try:
        # Concatenate title + abstract of top papers (limit to ~6 to stay within model context)
        texts = (ranked_df["title"] + ". " + ranked_df["summary"]).tolist()
        joined = " ".join(texts[: min(6, len(texts))])

        # Initialize PEGASUS-arXiv summarizer
        summarizer = Summarizer(model_name=SUM_MODEL, device=DEVICE)
        background = summarizer.summarize(joined, max_len=MAX_SUMMARY_LEN)

        print("=== Background & Related Work (summary) ===\n")
        print(background)
    except Exception as e:
        print("Summarization skipped due to model load/error:", e)
        background = ""

# Optional: keep for later cells
# with open("outputs/background_summary.txt", "w", encoding="utf-8") as f:
#     f.write(background)

No ranked papers to summarize. Run the retrieval & ranking steps first.


## ✅ Feasibility signal

In [21]:

# Choose the target text for feasibility (structured Problem → fallback to raw idea)
target_text = (structured.get("problem") if isinstance(structured, dict) else None) or idea

# Use the mean similarity from the ranking step if available
ms = float(mean_sim) if "mean_sim" in globals() else 0.0

feas = quick_feasibility(
    idea=target_text,
    mean_similarity=ms,
    sim_threshold=SIM_THRESHOLD
)

print("=== Feasibility Signal ===")
print(feas)








=== Feasibility Signal ===
❌ Very low plausibility: too vague or no related literature.


## 🧾 Generate Markdown draft

In [25]:

from IPython.display import Markdown, display

# Guards for previous steps
if "ranked_df" not in globals() or ranked_df is None:
    raise RuntimeError("No ranked_df found. Run retrieval & ranking first.")
if "background" not in globals():
    background = ""
if "feas" not in globals():
    feas = "⚠️ Feasibility not computed."

# Use structured Problem if present, else raw idea
problem_text = (structured.get("problem") if isinstance(structured, dict) else None) or idea

md = build_markdown(
    idea=problem_text,
    ranked_df=ranked_df if ranked_df is not None else pd.DataFrame(),
    background_summary=background,
    feasibility_text=feas,
    structured_fields=structured if isinstance(structured, dict) else None,
)

path = save_markdown(md, DRAFT_DIR)
print("Saved draft to:", path)


display(Markdown(md))

Saved draft to: drafts\idea2paper_20250831_041912.md


# Idea2Paper Draft

**Generated:** 2025-08-31 04:19 UTC

## Abstract
This work proposes a concept inspired by recent literature. We outline the core idea and potential impact.

## Background & Related Work
Relevant prior works indicate that this domain is active; a targeted literature review is recommended.

## Proposed Method / Idea
**Domain:**   
**Problem:** Gravitational-wave tails and memory effect for mergers in astrophysical environments  
**Hypothesis:**   
**Method:**   
**Data:**   
**Metrics:**   
**Constraints:**   
**Keywords:** gravitational-wave, tails, memory, effect, mergers, astrophysical, environments

## Expected Impact
- Clarifies feasibility and boundary conditions  
- Provides pathway to validation  
- Potential to advance the state of the art  

## Limitations
- Draft generated by an automated assistant; verify all claims  
- May miss non-arXiv or paywalled literature  
- Requires expert review before submission  

## Feasibility Signal
❌ Very low plausibility: too vague or no related literature.

## References
- No references found.


## 🔎 Preview draft (render Markdown inline)

In [26]:
# Guard: ensure the Markdown string `md` exists
try:
    md  # noqa: F821
except NameError:
    raise RuntimeError("No Markdown draft found. Run the '🧾 Generate Markdown draft' step first.")

display(Markdown(md))

# Idea2Paper Draft

**Generated:** 2025-08-31 04:19 UTC

## Abstract
This work proposes a concept inspired by recent literature. We outline the core idea and potential impact.

## Background & Related Work
Relevant prior works indicate that this domain is active; a targeted literature review is recommended.

## Proposed Method / Idea
**Domain:**   
**Problem:** Gravitational-wave tails and memory effect for mergers in astrophysical environments  
**Hypothesis:**   
**Method:**   
**Data:**   
**Metrics:**   
**Constraints:**   
**Keywords:** gravitational-wave, tails, memory, effect, mergers, astrophysical, environments

## Expected Impact
- Clarifies feasibility and boundary conditions  
- Provides pathway to validation  
- Potential to advance the state of the art  

## Limitations
- Draft generated by an automated assistant; verify all claims  
- May miss non-arXiv or paywalled literature  
- Requires expert review before submission  

## Feasibility Signal
❌ Very low plausibility: too vague or no related literature.

## References
- No references found.


## 🚀 Next steps
- Edit the structured fields and re-run the notebook cells to regenerate the draft.
- Try a different idea, or add domain-specific keywords in the structured block.
- Run the Streamlit app for a UI:  
  ```bash
  streamlit run src/app.py
  ```

In [2]:
import torch, transformers, safetensors
print("torch:", torch.__version__, "| cuda_available:", torch.cuda.is_available())
print("transformers:", transformers.__version__)
print("safetensors ok")

  from .autonotebook import tqdm as notebook_tqdm


torch: 2.5.1+cu121 | cuda_available: True
transformers: 4.56.0
safetensors ok
