# 论文套路抽取: Reusable Research Patterns


这个 notebook 目标是：
NIPS paper and review → 结构化 problem/solution 抽取 → problem/solution 级 embedding → 套路聚类 → RAG/KG-ready artifacts

Structure：

0. Environment Setup
1. Load Dataset
2. Paper Text Assembly
3. LLM-based Structured Extraction
4. Flatten and prepare text for clustering
5. Embedding & Vectorization
6. Hyperparameter for UMAP and HDBSCAN 
7. Pattern Clustering (UMAP and HDBSCAN)
8. Create pattern library for RAG / Downstream Use 
9. Cluster Interpretation & Naming (LLM)
10. Finalize: Link cluster_id to papers and problems (for building knowledge graph)



dataset by Alina: https://huggingface.co/datasets/Alina0796/neurips-2025-reviews/tree/main/data/full


 ## 0. Environment Setup

In [1]:
import os
import json
import zipfile
from pathlib import Path
from typing import Dict, List, Any, Optional

import numpy as np
import pandas as pd
from tqdm import tqdm

from sentence_transformers import SentenceTransformer
import hdbscan

# Optional LLM (OpenAI example)
from openai import OpenAI
client = OpenAI()


## 1. Load NIPS 2025 paper

download datasets: https://huggingface.co/datasets/Alina0796/neurips-2025-reviews/tree/main/data/full

In [3]:
DATA_DIR = Path("../review2_new/NIPS/NIPS_2025/")

print("Top-level folders:", [p.name for p in DATA_DIR.iterdir() if p.is_dir()])

# load paper
def load_jsons_from_dir(dir_path: Path) -> List[Dict[str, Any]]:
    items = []
    for fp in sorted(dir_path.rglob("*.json")):
        try:
            items.append(json.loads(fp.read_text(encoding="utf-8")))
        except Exception as e:
            print("Failed:", fp, e)
    return items

# 自动猜 paper 目录：名字里包含 "paper"
paper_dir = None
for p in DATA_DIR.iterdir():
    if p.is_dir() and "paper" in p.name.lower():
        paper_dir = p
        break

assert paper_dir is not None, "Could not find paper folder automatically."
papers = load_jsons_from_dir(paper_dir)
print("Loaded paper jsons:", len(papers))
print("Example keys:", papers[0].keys())
print("Example metadata keys:", papers[0].get("metadata", {}).keys())

Top-level folders: ['NIPS_2025_meta', 'NIPS_2025_review', 'NIPS_2025_paper']
Loaded paper jsons: 5183
Example keys: dict_keys(['forum_id', 'metadata'])
Example metadata keys: dict_keys(['source', 'year', 'title', 'abstractText', 'sections', 'references'])


In [4]:
papers_df = pd.DataFrame(papers)
papers_df.head()

Unnamed: 0,forum_id,metadata
0,004uTlSufe,"{'source': 'NeurIPS', 'year': 2025, 'title': '..."
1,00Bwl1woOJ,"{'source': 'NeurIPS', 'year': 2025, 'title': '..."
2,00oRAPDWsX,"{'source': 'NeurIPS', 'year': 2025, 'title': '..."
3,01hPO0uJhS,"{'source': 'NeurIPS', 'year': 2025, 'title': '..."
4,021PIPyOU1,"{'source': 'NeurIPS', 'year': 2025, 'title': '..."


In [5]:
# json print metadata of the first paper
print(json.dumps(papers[0].get("metadata", {}), indent=2))

{
  "source": "NeurIPS",
  "year": 2025,
  "title": "How Well Can Differential Privacy Be Audited in One Run?",
  "abstractText": "Recent methods for auditing the privacy of machine learning algorithms have improved computational efficiency by simultaneously intervening on multiple training examples in a single training run. Steinke et al. [\\[1\\]](#page-10-0) prove that one-run auditing indeed lower bounds the true privacy parameter of the audited algorithm, and give impressive empirical results. Their work leaves open the question of how precisely one-run auditing can uncover the true privacy parameter of an algorithm, and how that precision depends on the audited algorithm. In this work, we characterize the maximum achievable efficacy of one-run auditing and show that the key barrier to its efficacy is interference between the observable effects of different data elements. We present new conceptual approaches to minimize this barrier, towards improving the performance of one-run au

## 2. Build Paper Text（从 sections 拼接）

选择哪些 section（推荐：Intro/Related/Method/Experiment/Conclusion）

In [6]:
# Build Paper Text（从 sections 拼接）
# 选择哪些 section（推荐：Intro/Related/Method/Experiment/Conclusion）
# DEFAULT_INCLUDE = [
#    "abstract", "introduction", "background", "related",
#    "method", "approach", "model", "training",
#    "experiment", "evaluation", "results",
#    "discussion", "conclusion", "limitations"
#]

DEFAULT_INCLUDE = [
    "abstract", "introduction"
]

def normalize_heading(h: str) -> str:
    return (h or "").strip().lower()

def should_include_section(heading: str, include_keywords: List[str]) -> bool:
    h = normalize_heading(heading)
    return any(k in h for k in include_keywords)

def build_paper_text(paper: Dict[str, Any], include_keywords: List[str] = DEFAULT_INCLUDE,
                     max_chars: int = 18000) -> str:
    md = paper.get("metadata", {}) or {}
    title = md.get("title", "") or ""
    sections = md.get("sections", []) or []

    chunks = []
    if title:
        chunks.append(f"Title: {title}")

    for sec in sections:
        heading = sec.get("heading", "") or ""
        text = sec.get("text", "") or ""
        if not text.strip():
            continue
        if should_include_section(heading, include_keywords):
            chunks.append(f"\n## {heading}\n{text}")

    full = "\n".join(chunks).strip()
    # 截断，防止超出 LLM 上下文
    if len(full) > max_chars:
        full = full[:max_chars] + "\n\n[TRUNCATED]"
    return full

paper_texts = [build_paper_text(p) for p in papers]
print("Example paper_text length:", len(paper_texts[0]))
print(paper_texts[0][:800])


Example paper_text length: 4763
Title: How Well Can Differential Privacy Be Audited in One Run?

## 1 Introduction
Differential privacy (DP) is increasingly deployed to protect the privacy of training data, including in large-scale industry machine learning settings. As DP provides a theoretical guarantee about the worst-case behavior of a machine learning algorithm, any DP algorithm should be accompanied by a proof of an upper bound on its privacy parameters. However, such upper bounds can be quite loose. Worse, analyses and deployments of differential privacy can contain bugs that render those privacy upper bounds incorrect. As a result, there is growing interest in *privacy auditing* methods that can provide empirical lower bounds on an algorithm's privacy parameters. Such lower bounds can help detect whether the uppe


Map Paper ID（把 paper/review/rebuttal 对齐的关键）

In [None]:
import re

paper_rows = []
for p in papers:
    
    pid = p.get("forum_id", "")
    md = p.get("metadata", {}) or {}
    paper_rows.append({
        "paper_id": pid, 
         "year": md.get("year", ""),
         "source": md.get("source", ""),
        "title": md.get("title", ""),
        "paper_text": build_paper_text(p)
    })

paper_df = pd.DataFrame(paper_rows).drop_duplicates("paper_id")
paper_df.head()


Unnamed: 0,paper_id,year,source,title,paper_text
0,004uTlSufe,2025,NeurIPS,How Well Can Differential Privacy Be Audited i...,Title: How Well Can Differential Privacy Be Au...
1,00Bwl1woOJ,2025,NeurIPS,Uncertainty-Sensitive Privileged Learning,Title: Uncertainty-Sensitive Privileged Learni...
2,00oRAPDWsX,2025,NeurIPS,KL Penalty Control via Perturbation for Direct...,Title: KL Penalty Control via Perturbation for...
3,01hPO0uJhS,2025,NeurIPS,Who You Are Matters: Bridging Topics and Socia...,Title: Who You Are Matters: Bridging Topics an...
4,021PIPyOU1,2025,NeurIPS,ALTER: <u>All-in-One Layer Pruning and Tempora...,Title: ALTER: <u>All-in-One Layer Pruning and ...


In [29]:
# filter out papers with text less than 50 words   
#paper_df = paper_df[paper_df["paper_text"].str.split().str.len() >= 50].reset_index(drop=True)
#print("Filtered paper count:", len(paper_df))

# pick a small set of papers for testing
#
#paper title contains "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection"
#sample_df = paper_df[paper_df["title"].str.contains("Debate for LLM Judges", case=False)].reset_index(drop=True)
# paper text contains "evolution"
mad_df = paper_df[paper_df["title"].str.contains("Debate or Vote", case=False)].reset_index(drop=True)
mad_df.head()

# paper title comains "A-Mem: Agentic Memory for LLM Agents
#se_df = paper_df[paper_df["title"].str.contains("A-Mem", case=False)].reset_index(drop=True)
se_df = paper_df[paper_df["title"].str.contains("Self-Evolution Trajectory", case=False)].reset_index(drop=True)
se_df.head()

sample_df = pd.concat([mad_df, se_df]).reset_index(drop=True)
print("Sample paper count:", len(sample_df))
sample_df.head()

Sample paper count: 2


Unnamed: 0,paper_id,year,source,title,paper_text
0,iUjGNJzrF1,2025,NeurIPS,Debate or Vote: Which Yields Better Decisions ...,Title: Debate or Vote: Which Yields Better Dec...
1,isATAFP71B,2025,NeurIPS,SE-Agent: Self-Evolution Trajectory Optimizati...,Title: SE-Agent: Self-Evolution Trajectory Opt...


In [46]:
# value_counts of years
paper_df["source"].value_counts(dropna=False)

source
NeurIPS    5183
Name: count, dtype: int64

(Optional) Load Reviews / Rebuttals 并拼进去

## 3. LLM Extraction（你的字段：core idea / problem / gap / domains / methods / tricks）

In [33]:
SYSTEM_PROMPT = """
You are an expert research scientist and conference reviewer.
Extract structured research insights from the paper content (and optionally reviews/rebuttals).
Be concise, faithful, and separate core ideas vs implementation tricks.
Return valid JSON only.
""".strip()

USER_PROMPT = """
Extract the following fields as JSON:

- core_idea: 1-3 sentences
- problem_solution_pairs: list of {{ problem, solution, key_insight, setting }} where:
  - problem: 1-2 sentences
  - solution: 1-2 sentences
  - key_insight: 1 sentence
  - setting: short phrase (e.g., "DP auditing / one-run setting", "LLM reasoning / multi-agent", etc.)
- gap_analysis: {{
    prior_work_limitation,
    why_unresolved
  }}
- domains: list of strings
- proposed_methods: list of {{ method_name, method_type }} where method_type is one of:
  [architecture, loss, data, inference, training, evaluation, theory, system]
- tricks: list of {{ trick_description, trick_type, novelty_level }} where trick_type in:
  [engineering, optimization, heuristic, data-centric, evaluation]
  and novelty_level in [low, medium, high]

Rules:
- Output MUST be valid JSON only (no markdown, no commentary).
- core_idea should not repeat the problem_solution_pairs verbatim.
- problem_solution_pairs: 1-5 items; each pair must be distinct (no paraphrase duplicates).
- Prefer reusable phrasing in solutions (e.g., "Reduce X by doing Y under condition Z").
- If uncertain, still fill fields with best-effort text; avoid null. Use empty list [] only when truly none.

Text:
{paper_text}
""".strip()



def extract_insights_llm(text: str, model: str = "gpt-4.1") -> Dict[str, Any]:
    resp = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_PROMPT.format(paper_text=text)}
        ],
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

# 先抽样跑 500 篇做 sanity check
sample_df = paper_df.sample(n=min(500, len(paper_df)), random_state=42).copy()
#sample_df = paper_df.head(1)

results = []
for _, r in tqdm(sample_df.iterrows(), total=len(sample_df)):
    try:
        out = extract_insights_llm(r["paper_text"])
        results.append({**{"paper_id": r["paper_id"], "title": r["title"]}, **out})
    except Exception as e:
        results.append({"paper_id": r["paper_id"], "title": r["title"], "error": str(e)})

ex_df = pd.DataFrame(results)
ex_df.head()


100%|██████████| 500/500 [3:12:57<00:00, 23.16s/it]   


Unnamed: 0,paper_id,title,core_idea,problem_solution_pairs,gap_analysis,domains,proposed_methods,tricks
0,uG8kRtNGEI,Fix False Transparency by Noise Guided Splatting,The paper introduces Noise Guided Splatting (N...,[{'problem': '3D Gaussian Splatting (3DGS) oft...,{'prior_work_limitation': 'Previous methods fo...,"[3D neural rendering, computer vision, graphic...",[{'method_name': 'Noise Guided Splatting (NGS)...,[{'trick_description': 'Injecting high-opacity...
1,HdY8CCHife,**A Unified Stability Analysis of SAM vs SGD: ...,This paper presents a unified stability analys...,[{'problem': 'Existing analyses of SAM and SGD...,{'prior_work_limitation': 'Prior work has focu...,"[optimization, machine learning theory, deep l...",[{'method_name': 'Unified Stability Analysis F...,[{'trick_description': 'Empirically validate t...
2,4ULtNYHc5T,Exploring Tradeoffs through Mode Connectivity ...,This paper proposes a novel approach to multi-...,[{'problem': 'Optimization-based MTL methods s...,{'prior_work_limitation': 'Prior optimization-...,"[multi-task learning, deep learning optimizati...",[{'method_name': 'Curve-based mode connectivit...,[{'trick_description': 'Use NURBS instead of B...
3,klOr9y9nMU,CORE: Reducing UI Exposure in Mobile Agents vi...,CORE introduces a collaborative framework that...,[{'problem': 'Mobile agents for task automatio...,{'prior_work_limitation': 'Previous mobile age...,"[mobile automation, privacy-preserving AI, hum...",[{'method_name': 'CORE collaborative framework...,[{'trick_description': 'Partition UI pages int...
4,yjLew3Nd7z,Part-Level Visual Understanding,The paper introduces Explanatory Part Segmenta...,[{'problem': 'Current LMMs lack strong abiliti...,{'prior_work_limitation': 'Prior LMMs perform ...,"[computer vision, multimodal learning, object ...",[{'method_name': 'Explanatory Part Segmentatio...,[{'trick_description': 'Avoid using special se...


In [91]:
# print json of the first row
print(json.dumps(ex_df.iloc[0].to_dict(), indent=2, ensure_ascii=False))

{
  "paper_id": "uG8kRtNGEI",
  "title": "Fix False Transparency by Noise Guided Splatting",
  "core_idea": "The paper introduces Noise Guided Splatting (NGS), a method that injects persistent, high-opacity noise Gaussians inside object volumes during 3D Gaussian Splatting optimization to resolve the false transparency artifact. This approach enforces correct surface opacity by creating an occlusion barrier, and also provides a diagnostic tool for evaluating transparency artifacts in neural rendering. The method is simple to integrate and supports improved benchmarking through new datasets.",
  "problem_solution_pairs": [
    {
      "problem": "3D Gaussian Splatting (3DGS) often produces false transparency, where opaque surfaces appear semi-transparent due to unconstrained optimization and ambiguous alpha blending.",
      "solution": "Inject high-opacity, randomly colored noise Gaussians inside the object's volume during training to enforce correct surface opacity.",
      "key_insig

In [47]:
# save ex_df to jsonl
output_fp = DATA_DIR / "NIPS_2025_extraction_raw.jsonl"
with output_fp.open("w", encoding="utf-8") as f:
    for _, r in ex_df.iterrows():
        f.write(json.dumps(r.to_dict()) + "\n")
print("Saved extraction results to:", output_fp)


Saved extraction results to: ../review2_new/NIPS/NIPS_2025/NIPS_2025_extraction_raw.jsonl


In [37]:
print(json.dumps(ex_df.iloc[0].to_dict(), indent=2))

{
  "paper_id": "uG8kRtNGEI",
  "title": "Fix False Transparency by Noise Guided Splatting",
  "core_idea": "The paper introduces Noise Guided Splatting (NGS), a method that injects persistent, high-opacity noise Gaussians inside object volumes during 3D Gaussian Splatting optimization to resolve the false transparency artifact. This approach enforces correct surface opacity by creating an occlusion barrier, and also provides a diagnostic tool for evaluating transparency artifacts in neural rendering. The method is simple to integrate and supports improved benchmarking through new datasets.",
  "problem_solution_pairs": [
    {
      "problem": "3D Gaussian Splatting (3DGS) often produces false transparency, where opaque surfaces appear semi-transparent due to unconstrained optimization and ambiguous alpha blending.",
      "solution": "Inject high-opacity, randomly colored noise Gaussians inside the object's volume during training to enforce correct surface opacity.",
      "key_insig

## 4. Flatten：为 PS 聚类准备语料

In [40]:
# load the extracted inights
import os, json
from typing import Any, Dict, List, Optional
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# ====== paths (edit these) ======
JSONL_PATH = DATA_DIR / "NIPS_2025_extraction_raw.jsonl" # <-- 改成你的 jsonl
OUT_DIR = DATA_DIR / "output_nb"
CORPUS_DIR = os.path.join(OUT_DIR, "corpus")
EMB_DIR = os.path.join(OUT_DIR, "embeddings")
CLUSTER_DIR = os.path.join(OUT_DIR, "clusters")

for d in [OUT_DIR, CORPUS_DIR, EMB_DIR, CLUSTER_DIR]:
    os.makedirs(d, exist_ok=True)

print("OK. Output dir:", OUT_DIR)


OK. Output dir: ../review2_new/NIPS/NIPS_2025/output_nb


In [41]:
# helper
def safe_get(d: Dict[str, Any], keys: List[str], default=None):
    for k in keys:
        if k in d and d[k] not in (None, ""):
            return d[k]
    return default

def build_ps_cluster_text(ps: Dict[str, Any]) -> str:
    setting = (ps.get("setting") or "").strip()
    problem = (ps.get("problem") or "").strip()
    solution = (ps.get("solution") or "").strip()
    insight = (ps.get("key_insight") or "").strip()

    return (
        f"[SETTING] {setting}\n"
        f"[PROBLEM] {problem}\n"
        f"[SOLUTION] {solution}\n"
        f"[INSIGHT] {insight}"
    ).strip()

def looks_valid_pair(problem: str, solution: str) -> bool:
    return (problem is not None and solution is not None 
            and len(problem.strip()) >= 10 and len(solution.strip()) >= 10)


In [49]:
# flatten the extraction results
rows = []

with open(JSONL_PATH, "r", encoding="utf-8") as f:
    for line in tqdm(f, desc="Reading JSONL"):
        line = line.strip()
        if not line:
            continue
        obj = json.loads(line)

        paper_id = safe_get(obj, ["paper_id", "forum_id", "id"], default="")
        title = safe_get(obj, ["title"], default="")
        year = safe_get(obj, ["year"], default='2025')

        ps_list = obj.get("problem_solution_pairs") or []
        if not isinstance(ps_list, list):
            continue

        for i, ps in enumerate(ps_list):
            if not isinstance(ps, dict):
                continue

            problem = (ps.get("problem") or "").strip()
            solution = (ps.get("solution") or "").strip()
            key_insight = (ps.get("key_insight") or "").strip()
            setting = (ps.get("setting") or "").strip()

            if not looks_valid_pair(problem, solution):
                continue

            uid = f"{paper_id}#ps#{i}" if paper_id else f"unknown#ps#{i}"

            rows.append({
                "uid": uid,
                "paper_id": paper_id,
                "title": title,
                "year": year,
                "setting": setting,
                "problem": problem,
                "solution": solution,
                "key_insight": key_insight,
                "cluster_text": build_ps_cluster_text(ps),
            })

df_ps = pd.DataFrame(rows)
len(df_ps)


Reading JSONL: 0it [00:00, ?it/s]

1683

In [51]:
df_ps.head()

Unnamed: 0,uid,paper_id,title,year,setting,problem,solution,key_insight,cluster_text
0,uG8kRtNGEI#ps#0,uG8kRtNGEI,Fix False Transparency by Noise Guided Splatting,2025,3D neural rendering / object-centric 3DGS,3D Gaussian Splatting (3DGS) often produces fa...,"Inject high-opacity, randomly colored noise Ga...",Persistent internal noise structures act as an...,[SETTING] 3D neural rendering / object-centric...
1,uG8kRtNGEI#ps#1,uG8kRtNGEI,Fix False Transparency by Noise Guided Splatting,2025,3DGS training / ambiguous opacity regions,Standard 2D photometric losses in 3DGS supervi...,Guide the optimization by filling the object's...,Explicitly modeling the object's interior with...,[SETTING] 3DGS training / ambiguous opacity re...
2,uG8kRtNGEI#ps#2,uG8kRtNGEI,Fix False Transparency by Noise Guided Splatting,2025,Evaluation / transparency diagnostics,Evaluating the severity of false transparency ...,Use the recolored noise infill as a diagnostic...,Recoloring and visualizing internal noise Gaus...,[SETTING] Evaluation / transparency diagnostic...
3,HdY8CCHife#ps#0,HdY8CCHife,**A Unified Stability Analysis of SAM vs SGD: ...,2025,Optimization algorithm analysis / generalization,Existing analyses of SAM and SGD do not fully ...,Develop a unified stability framework that qua...,Data coherence modulates the stability and sim...,[SETTING] Optimization algorithm analysis / ge...
4,HdY8CCHife#ps#1,HdY8CCHife,**A Unified Stability Analysis of SAM vs SGD: ...,2025,Implicit bias / data-dependent analysis,The role of data coherence in shaping the impl...,Theoretically and empirically analyze how data...,High data coherence amplifies the simplicity b...,[SETTING] Implicit bias / data-dependent analy...


In [86]:
# save the problems to jsonl
PS_JSONL = OUT_DIR/"NIPS_2025_problem_solutions.jsonl"
sim_ps = df_ps[[ "paper_id", "title", "year", "setting", "problem", "solution", "key_insight"]]
with open(PS_JSONL, "w", encoding="utf-8") as f:
    for _, r in sim_ps.iterrows():
        f.write(json.dumps(r.to_dict(), ensure_ascii=False) + "\n")
print("Saved problem-solution pairs to:", PS_JSONL)

Saved problem-solution pairs to: ../review2_new/NIPS/NIPS_2025/output_nb/NIPS_2025_problem_solutions.jsonl


## 5. Embedding & Vectorization

In [None]:
# embedding
texts = df_ps["cluster_text"].tolist()
# Embedding
embedder = SentenceTransformer("all-mpnet-base-v2")
emb = embedder.encode(df_ps["cluster_text"].tolist(), normalize_embeddings=True, show_progress_bar=True)
#emb = embedder.encode(trick_df["canonical_trick"].tolist(), normalize_embeddings=True, show_progress_bar=True)
emb = np.asarray(emb)
emb.shape

## 6: 超参 UMAP(15D) + HDBSCAN

In [60]:
# fallback to UMAP
import umap

umap15 = umap.UMAP(
    n_components=15,
    n_neighbors=30,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
)

T_15 = umap15.fit_transform(emb)
T_15.shape

  warn(


(1683, 15)

In [63]:
import hdbscan
import numpy as np

def n_clusters(labels):
    return len(set(labels)) - (1 if -1 in labels else 0)

def run_hdb(X, mcs, ms=None):
    #cl = hdbscan.HDBSCAN(min_cluster_size=mcs, min_samples=ms, metric="euclidean", cluster_selection_method="leaf")
    cl = hdbscan.HDBSCAN(min_cluster_size=mcs, min_samples=ms, metric="euclidean") # "leaf" selection
    labels = cl.fit_predict(X)
    return labels, n_clusters(labels), (labels==-1).mean()

for mcs in [3,5,8,10,15]:
    p_lables, np_c, np_noise = run_hdb(T_15, mcs, ms=3)
    print("mcs", mcs, "| problem-solution clusters", np_c, "noise", round(np_noise,3))


mcs 3 | problem-solution clusters 135 noise 0.286
mcs 5 | problem-solution clusters 96 noise 0.307
mcs 8 | problem-solution clusters 41 noise 0.169
mcs 10 | problem-solution clusters 38 noise 0.176
mcs 15 | problem-solution clusters 31 noise 0.208


In [65]:
# locked mcs=8, ms=3
p_lables, np_c, np_noise = run_hdb(T_15, 8, ms=3)
print("Trick clusters", np_c, "noise", round(np_noise,3))
df_ps["cluster"] = p_lables

df_ps["cluster"].value_counts().head(58)

Trick clusters 41 noise 0.169


cluster
 6     344
-1     284
 18    176
 20     71
 15     71
 12     68
 10     37
 32     34
 9      32
 33     31
 30     28
 23     28
 39     26
 4      26
 11     25
 1      24
 8      24
 31     24
 35     23
 36     22
 21     21
 40     19
 27     19
 3      19
 24     17
 2      17
 13     15
 29     13
 34     13
 17     13
 16     13
 22     12
 5      11
 38     11
 19     11
 0      10
 37     10
 14      9
 26      8
 28      8
 7       8
 25      8
Name: count, dtype: int64

## 7. Put together： UMAP + HDBSCAN 聚类

In [None]:
import numpy as np
import pandas as pd
import umap
import hdbscan

assert len(df_ps) == emb.shape[0], "df_ps rows must match emb rows"

# --- UMAP reduce ---
um = umap.UMAP(
    n_neighbors=15,
    n_components=30,     # 常用 10~30
    metric="cosine",
    random_state=42
)
emb_low = um.fit_transform(emb)

# --- HDBSCAN cluster (locked mcs=8) ---
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=8,
    min_samples=3,                 # 稳一点；你可试 2~5
    metric="euclidean",
    #cluster_selection_method="leaf"
)
labels = clusterer.fit_predict(emb_low)

df_ps = df_ps.copy()
df_ps["cluster"] = labels

noise_rate = (df_ps["cluster"] == -1).mean() # 聚合后噪声率
n_clusters = (df_ps["cluster"].nunique() - (1 if -1 in df_ps["cluster"].unique() else 0))

print("clusters:", n_clusters, "noise:", round(noise_rate, 3))
df_ps["cluster"].value_counts().head(10)

  warn(


clusters: 43 noise: 0.165


cluster
 9     352
-1     278
 27     84
 37     78
 28     72
 20     67
 18     65
 11     37
 0      36
 42     33
Name: count, dtype: int64

In [100]:
# json print of the first row
print(json.dumps(df_ps.iloc[0].to_dict(), indent=2, ensure_ascii=False))

{
  "uid": "uG8kRtNGEI#ps#0",
  "paper_id": "uG8kRtNGEI",
  "title": "Fix False Transparency by Noise Guided Splatting",
  "year": "2025",
  "setting": "3D neural rendering / object-centric 3DGS",
  "problem": "3D Gaussian Splatting (3DGS) often produces false transparency, where opaque surfaces appear semi-transparent due to unconstrained optimization and ambiguous alpha blending.",
  "solution": "Inject high-opacity, randomly colored noise Gaussians inside the object's volume during training to enforce correct surface opacity.",
  "key_insight": "Persistent internal noise structures act as an occlusion barrier, preventing the optimization from blending back surfaces into the front-facing rendering.",
  "cluster_text": "[SETTING] 3D neural rendering / object-centric 3DGS\n[PROBLEM] 3D Gaussian Splatting (3DGS) often produces false transparency, where opaque surfaces appear semi-transparent due to unconstrained optimization and ambiguous alpha blending.\n[SOLUTION] Inject high-opacity,

## 8 RAG-ready pattern library

生成 RAG-ready 的 pattern 文档（每个 cluster 一个 doc）

这里我给你一个“够用又通用”的 schema（后面很适合丢进向量库）：

pattern_id

pattern_name（稍后由 LLM 生成）

pattern_description

examples（trick_text 样本）

supporting_items（可选：paper_id/title 等证据字段，如果你有）

In [67]:
from collections import defaultdict

# 先生成基础 pattern doc（不含 LLM 命名）
pattern_docs = []
for cid, sub in df_ps[df_ps["cluster"] != -1].groupby("cluster"):
    # examples：放 10 条最典型/随机
    ex = sub["cluster_text"].sample(n=min(10, len(sub)), random_state=0).tolist()

    doc = {
        "pattern_id": int(cid),
        "pattern_name": None,            # Step 2 填
        "pattern_description": None,     # Step 2 填
        "cluster_size": int(len(sub)),
        "examples": ex,
    }

    # 如果你有 paper_id/title，可以一起带上（可选）
    if "paper_id" in sub.columns or "title" in sub.columns:
        keep_cols = [c for c in ["paper_id", "title"] if c in sub.columns]
        if keep_cols:
            doc["supporting_items"] = sub[keep_cols].drop_duplicates().head(30).to_dict("records")

    pattern_docs.append(doc)

pattern_docs = sorted(pattern_docs, key=lambda x: x["cluster_size"], reverse=True)
pattern_docs[:2]


[{'pattern_id': 9,
  'pattern_name': None,
  'pattern_description': None,
  'cluster_size': 352,
  'examples': ['[SETTING] Explainable VAD / multi-granularity reasoning\n[PROBLEM] Existing methods struggle to comprehensively understand and reason about anomalies at multiple temporal granularities, especially for complex, long-duration events.\n[SOLUTION] Construct a hierarchical granularity-aware tree to represent videos at multiple temporal scales, supporting multi-granularity anomaly reasoning and score fusion.\n[INSIGHT] Hierarchical representations allow flexible aggregation of evidence from both coarse and fine temporal segments, improving detection of both short and long anomalies.',
   "[SETTING] Benchmark construction / multi-modal, multi-video QA\n[PROBLEM] Current benchmarks only associate each question with a single short video clip, failing to assess models' ability to synthesize information across multiple sources.\n[SOLUTION] Curate AVHaystacks, a dataset of 3100 QA pairs

## 9. Cluster Naming（LLM 自动生成套路名）: 给 cluster 做 LLM 命名 + coherence score

coherence score：建议 1–5 分
并让它输出 “why” 的一句话理由（方便你写论文分析）

In [71]:
import json
from openai import OpenAI
client = OpenAI()

CLUSTER_SYS = """
You are a research methods analyst. You name reusable research patterns and evaluate cluster coherence.
Be specific and avoid generic phrases.
Return valid JSON only.
""".strip()

CLUSTER_USER = """
Given the following research settings, problems and solutions pair (all from the same cluster), do:

1) Provide a short, specific pattern_name (<= 8 words).
2) Provide a 1-2 sentence pattern_description describing the shared reusable technique.
3) Provide a setting that this pattern applies to.
4) Provide a base problem that this pattern addresses.
5) Provide a base solution that this pattern implements.
6) Rate coherence_score from 1 to 5:
   - 5: highly coherent, same reusable pattern
   - 3: somewhat coherent, related but mixed
   - 1: incoherent, unrelated items
7) Provide one-sentence rationale.

Return JSON:
{{
  "pattern_name": "...",
  "pattern_description": "...",
  "setting": "...",
  "base_problem": "...",
  "base_solution": "...",
  "coherence_score": <int 1-5>,
  "rationale": "..."
}}

problem-solution pairs:
{texts}
""".strip()

def name_and_score_cluster(df, model="gpt-4.1"):
    joined = "\n\n---\n\n".join(df[:12])  # 8~12 条足够
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": CLUSTER_SYS},
            {"role": "user", "content": CLUSTER_USER.format(texts=joined)},
        ],
        temperature=0,
        # 如果你模型支持强制 JSON，可以开这个：
        # response_format={"type":"json_object"},
        timeout=90,
    )
    return json.loads(resp.choices[0].message.content)


对所有 pattern_docs 批量命名 + 打分（带 checkpoint）

In [None]:
import os, time

OUT_JSONL = OUT_DIR/"pattern_library_mcs8.jsonl"

def append_jsonl(path, obj):
    with open(path, "a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")

def load_done_ids(path):
    done = set()
    if not os.path.exists(path):
        return done
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            try:
                done.add(json.loads(line)["pattern_id"])
            except:
                pass
    return done

done = load_done_ids(OUT_JSONL)

for doc in pattern_docs:
    pid = doc["pattern_id"]
    if pid in done:
        continue

    try:
        info = name_and_score_cluster(doc["examples"])
        doc2 = {**doc, **info}
        append_jsonl(OUT_JSONL, doc2)
        print(f"[OK] cluster={pid} size={doc['cluster_size']} score={doc2['coherence_score']} name={doc2['pattern_name']}")
        time.sleep(0.2)  # 温和一点，避免限速
    except Exception as e:
        append_jsonl(OUT_JSONL, {"pattern_id": pid, "error": str(e), "cluster_size": doc["cluster_size"]})
        print(f"[FAIL] cluster={pid} err={e}")


[OK] cluster=9 size=352 score=3 name=Granularity-Aware Reasoning
[OK] cluster=27 size=84 score=5 name=Architecture-Aware Compression & Constraint
[OK] cluster=37 size=78 score=5 name=Adaptive Reasoning Optimization
[OK] cluster=28 size=72 score=5 name=Adaptive Parameterization and Routing
[OK] cluster=20 size=67 score=5 name=Diffusion Model Optimization Patterns
[OK] cluster=18 size=65 score=5 name=Structure-Aware Graph Uncertainty Modeling
[OK] cluster=11 size=37 score=5 name=Symmetry-Constrained Generative Modeling
[OK] cluster=0 size=36 score=5 name=Assumption-Driven Causal Identification
[OK] cluster=42 size=33 score=5 name=Loss Decomposition and Policy Unification
[OK] cluster=8 size=29 score=5 name=Interaction-Aware Attribution
[OK] cluster=25 size=29 score=5 name=Context-Aware Cache and Decoding Optimization
[OK] cluster=30 size=29 score=5 name=Geometry-Aware Optimization Extensions
[OK] cluster=38 size=28 score=5 name=Systematic Safety Stress Testing
[OK] cluster=5 size=26 scor

汇总成 DataFrame（便于筛选“高质量套路”）


In [75]:
rows = []
with open(OUT_JSONL, "r", encoding="utf-8") as f:
    for line in f:
        rows.append(json.loads(line))

pattern_df = pd.DataFrame(rows)
pattern_df = pattern_df[pattern_df.get("error").isna() if "error" in pattern_df.columns else slice(None)]
#pattern_df.sort_values(["coherence_score", "cluster_size"], ascending=[False, False]).head(1)


In [92]:
pattern_df.head()

Unnamed: 0,pattern_id,pattern_name,pattern_description,cluster_size,examples,supporting_items,setting,base_problem,base_solution,coherence_score,rationale
0,9,Granularity-Aware Reasoning,This pattern constructs hierarchical or multi-...,352,[[SETTING] Explainable VAD / multi-granularity...,"[{'paper_id': 'uG8kRtNGEI', 'title': 'Fix Fals...",Explainable VAD / multi-granularity reasoning,Existing methods struggle to comprehensively u...,Construct a hierarchical granularity-aware tre...,3,While several pairs use hierarchical or multi-...
1,27,Architecture-Aware Compression & Constraint,This pattern leverages the structural properti...,84,[[SETTING] RNN expressivity / formal language ...,"[{'paper_id': 'lE2cD7C9fk', 'title': 'On Induc...",Transformer and RNN architectures for sequence...,"Standard methods for model analysis, compressi...",Develop architecture-aware algorithms that exp...,5,All pairs employ a reusable strategy of exploi...
2,37,Adaptive Reasoning Optimization,"This pattern involves adaptively structuring, ...",78,[[SETTING] LLM training / multi-format reasoni...,"[{'paper_id': 'D8nHwexHNv', 'title': 'Unveilin...",LLM training and reasoning optimization across...,Standard training or optimization methods for ...,Introduce adaptive mechanisms—such as step-wis...,5,All pairs share the core technique of adaptive...
3,28,Adaptive Parameterization and Routing,This pattern involves adaptively learning or r...,72,[[SETTING] Scientific ML / large operator mode...,"[{'paper_id': '4ULtNYHc5T', 'title': 'Explorin...","LLM training, scientific ML adaptation, solver...","Static or naive parameter choices, knowledge t...","Dynamically learn, route, or adjust parameters...",5,All pairs share the core technique of adaptive...
4,20,Diffusion Model Optimization Patterns,These research problems and solutions share th...,67,[[SETTING] RL value estimation / continuous co...,"[{'paper_id': 'YB9VGCClv9', 'title': 'Diffusio...",Diffusion model inference and training across ...,Standard diffusion model methods suffer from i...,"Introduce novel optimization methods, unified ...",5,All pairs focus on advancing diffusion model m...


In [83]:
out = pattern_df[["pattern_id", "pattern_name", "pattern_description", "cluster_size", "setting","base_problem","base_solution","supporting_items"]]
#out = extract_insights_llm(paper_df.iloc[0]["paper_text"])
print(json.dumps(out.to_dict(orient="records"), indent=2, ensure_ascii=False))

[
  {
    "pattern_id": 9,
    "pattern_name": "Granularity-Aware Reasoning",
    "pattern_description": "This pattern constructs hierarchical or multi-level representations to enable reasoning and aggregation of evidence across multiple granularities, modalities, or sources, improving model performance on complex tasks.",
    "cluster_size": 352,
    "setting": "Explainable VAD / multi-granularity reasoning",
    "base_problem": "Existing methods struggle to comprehensively understand and reason about anomalies at multiple temporal granularities, especially for complex, long-duration events.",
    "base_solution": "Construct a hierarchical granularity-aware tree to represent videos at multiple temporal scales, supporting multi-granularity anomaly reasoning and score fusion.",
    "supporting_items": [
      {
        "paper_id": "uG8kRtNGEI",
        "title": "Fix False Transparency by Noise Guided Splatting"
      },
      {
        "paper_id": "yjLew3Nd7z",
        "title": "Part-Le

In [84]:
# save to jsonl
SIM_OUT_JSONL = OUT_DIR/"NIPS_pattern_library.jsonl"
with open(SIM_OUT_JSONL, "w", encoding="utf-8") as f:
    for _, r in out.iterrows():
        f.write(json.dumps(r.to_dict(), ensure_ascii=False) + "\n")



## 10. TODO: Finalize paper insights for Knowledge graph
* back link cluster_id
* correlate with reviews