# 顶会套路 / story 抽象 / 可复用方法论

## 结构化知识抽取层（Extraction Layer）
* pattern-level： ⟨Base Problem, Solution Pattern, Story⟩三元组，并明确“Story = 论文包装方式”
* paper-level：idea （关键创新点或核心思路）/ domain （一级）/ sub_domains （二级）
* narrative-aware（不是普通摘要）

## 自动聚类
* 拼接：Story/Base Problem/Solution/Idea
* SBERT embedding
* UMAP + HDBSCAN clustering
* cluster stats


一个完成态（v1.0） 的 pipeline 应该长这样：
```bash 
[Paper Corpus]
      ↓
[Structured Extraction]
  (idea / domain / story / pattern)
      ↓
[Story Embedding]
      ↓
[Story Cluster (UMAP+HDBSCAN)]
      ↓
[Cluster Naming + Typing]
      ↓
[Pattern Library]
      ↓
[Online Phase]
  Idea → Cluster(Story) → Pattern → Wrap → Review
```



 ## 0. Environment Setup

In [1]:
import os
import json
import zipfile
from pathlib import Path
from typing import Dict, List, Any, Optional

import numpy as np
import pandas as pd
from tqdm import tqdm

from sentence_transformers import SentenceTransformer
import hdbscan

# Optional LLM (OpenAI example)
from openai import OpenAI
client = OpenAI()


## 1. LLM Extraction

dataset: https://huggingface.co/datasets/AgentAlphaAGI/Paper-Review-Dataset/blob/main/ICLR_merged_cleaned_huggingface.jsonl

Downloaded to local: data/ICLR_merged_cleaned_huggingface.jsonl, below script read from local and resume based on done papers in output/iclr_patterns_full.json

```bash
python Paper-KG-Pipeline/scripts/extract_patterns_ICLR_en_local.py

```

Ouput: output/iclr_patterns_full.json

In [72]:
# quick look at the extacted paper patterns, some papers may have extraction errors, even after retrying
with open("../output/iclr_patterns_full.jsonl", 'r', encoding='utf-8') as f:
    papers = [json.loads(line) for line in f]
papers_df = pd.DataFrame(papers)
papers_df.head()

Unnamed: 0,paper_id,paper_title,idea,domain,sub_domains,research_patterns,error
0,RUzSobdYy0V,Quantifying and Mitigating the Impact of Label...,Analyze and mitigate the impact of label error...,Fairness & Accountability,"[Label Noise, Disparity Metrics, Model Fairnes...",[{'base_problem': 'Label errors in training an...,
1,N3kGYG3ZcTi,Suppression helps: Lateral Inhibition-inspired...,Incorporate lateral inhibition mechanisms from...,Computer Vision,"[Image Classification, Neural Network Architec...",[{'base_problem': 'Current CNN architectures d...,
2,tmIiMPl4IPa,Factorized Fourier Neural Operators,Introduce a factorized Fourier-based neural op...,Machine Learning,"[Neural Operators, Partial Differential Equati...",[{'base_problem': 'Existing machine learning a...,
3,mhnHqRqcjYU,DFPC: Data flow driven pruning of coupled chan...,Introduce a data-free pruning strategy for cou...,Machine Learning,"[Neural Network Pruning, Multi-Branch Architec...",[{'base_problem': 'Existing pruning methods fo...,
4,sZI1Oj9KBKy,TVSPrune - Pruning Non-discriminative filters ...,Introduce a data-free pruning method using tot...,Machine Learning,"[Neural Network Pruning, Model Compression, Da...",[{'base_problem': 'Pruning deep neural network...,


In [61]:
papers_df['paper_title'].str.contains("AlphaEdit", na=False)

0       False
1       False
2       False
3       False
4       False
        ...  
8306    False
8307    False
8308    False
8309    False
8310    False
Name: paper_title, Length: 8311, dtype: bool

In [82]:
# 2025 best paper
# 2025 (Highlights):
# Safety Alignment Should be Made More Than Just a Few Tokens Deep (Qi et al.).
# Learning Dynamics of LLM Finetuning (Ren & Sutherland).
# AlphaEdit: Null-Space Constrained Model Editing for Language Models (Fang et al.).
# 2024 best paper
#"Generalization in diffusion models arises from geometry-adaptive harmonic representations"
#"Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors"
#"AnyVision Transformers Need Registers"
#"Protein Discovery with Discrete Walk-Jump Sampling"
# select these best papers based on paper_title


best_papers = papers_df[papers_df['paper_title'].str.contains("Generalization in diffusion models arises from geometry-adaptive harmonic representations|Learning Interactive Real-World Simulators|Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors|Vision Transformers Need Registers|Protein Discovery with Discrete Walk-Jump Sampling", na=False)].drop(columns=["error"])

for i in range(len(best_papers)):
    print(f"Best Paper {i+1}:")
    print(json.dumps(best_papers.iloc[i].to_dict(), indent=2))
    print("\n")

Best Paper 1:
{
  "paper_id": "ANvmVS2Yr0",
  "paper_title": "Generalization in diffusion models arises from geometry-adaptive harmonic representations",
  "idea": "Diffusion models generalize well by leveraging geometry-adaptive harmonic representations, aligning inductive biases with data density.",
  "domain": "Machine Learning",
  "sub_domains": [
    "Diffusion Models",
    "Image Denoising",
    "Inductive Bias",
    "Harmonic Analysis"
  ],
  "research_patterns": [
    {
      "base_problem": "Diffusion models risk memorizing training data rather than learning the true continuous data density, questioning their generalization capabilities.",
      "solution_pattern": "Analyze the inductive biases of DNNs trained on image denoising tasks, revealing a geometry-adaptive harmonic basis that aligns with data density and supports strong generalization.",
      "story": "Reframe the understanding of diffusion models from mere sample generators to sophisticated learners of data geometry

In [22]:
# 2024 best paper
# 2025 best paper
# 2025 (Highlights):
# Safety Alignment Should be Made More Than Just a Few Tokens Deep (Qi et al.).
# Learning Dynamics of LLM Finetuning (Ren & Sutherland).
# AlphaEdit: Null-Space Constrained Model Editing for Language Models (Fang et al.).

[
  {
    "base_problem": "Label errors in training and test data disproportionately affect group-based disparity metrics, particularly harming minority groups.",
    "solution_pattern": "Develop a method to estimate the impact of changing a single training input's label on group disparity metrics, enabling targeted corrections to improve fairness.",
    "story": "Shift the focus from overall model accuracy to the nuanced impact of label errors on fairness metrics, providing a framework for identifying and correcting data issues that skew disparity metrics.",
    "application": "Fairness improvement in machine learning models, data quality auditing, targeted label correction in datasets."
  }
]
[
  {
    "base_problem": "Current CNN architectures do not fully utilize neurobiological mechanisms like lateral inhibition, which can enhance contrast and recognition capabilities.",
    "solution_pattern": "Introduce a lateral inhibition-inspired design that uses a low-pass filter and learnab

In [23]:
# domain value counts
papers_df['domain'].value_counts()

domain
Machine Learning                5310
Computer Vision                 1074
Natural Language Processing      778
Artificial Intelligence          317
Security & Privacy               235
                                ... 
Geospatial Data Science            1
Climate Science                    1
Computational Linguistics          1
Life Sciences                      1
Theoretical Computer Science       1
Name: count, Length: 98, dtype: int64

In [24]:
# subdomains flatten and value counts
subdomains = papers_df['sub_domains'].explode()
subdomains.value_counts()

sub_domains
Reinforcement Learning          859
Large Language Models           681
Generative Models               475
Diffusion Models                461
Neural Networks                 458
                               ... 
Textual Semantics                 1
Document Image Understanding      1
Event Localization                1
Sample Importance                 1
Instance-level Optimization       1
Name: count, Length: 7478, dtype: int64

In [25]:
# save domain and subdomain value counts
papers_df['domain'].value_counts().to_csv("../output/iclr_domain_value_counts.csv")
subdomains.value_counts().to_csv("../output/iclr_subdomain_value_counts.csv")

In [9]:
# TODO : further clustering on domain and subdomain

## 2. 包装模板： Embedding -> Clustering (UMAP and HDBSCAN )

Purpose
- Flatten paper JSONL -> pattern records
- Embed pattern text (Story-centric by default)
- UMAP + HDBSCAN clustering
- Compute cluster coherence metrics
- Fit Zipf (rank-size) stats
- LLM-based concise cluster naming (instead of top-words)
- Auto-tier clusters (A/B/C) and write tier_A/B/C.jsonl
- Generate report.md with Zipf + noise share + Top-10 table

Outputs (in --outdir):
- assignments.jsonl :               (Paper-to-cluster 映射表)
- clusters.jsonl :                   (cluster-level summary, incl. coherence + llm name)
- cluster_library.jsonl             (RAG-ready cluster objects w/ exemplars, 用于 online 抽套路)
- Tier A（优先入库）: size ≥ 30 且 centroid_coherence_mean ≥ 0.40（阈值你可按分位数调）
- Tier B（入库但需人工命名/清洗）：10 ≤ size < 30 或 coherence 介于 [0.30, 0.40)
- Tier C（尾部/噪声候选）：size < 10 或 coherence < 0.30（做“待合并/待拆分/待丢弃”队列）
- report.md : cluster analysis

关键点：size 与 coherence 必须一起用。大簇低 coherence 通常是“语义混合/主题过宽”，小簇高 coherence 反而可能是“高价值 niche pattern”。

```bash
(.venv) liling@new-host-7 Paper-KG-Pipeline % python scripts/analyze_clusters.py \
  --input output/iclr_patterns_full.jsonl \
  --outdir output \
  --sbert_model sentence-transformers/all-MiniLM-L6-v2 \
  --llm_name \
  --llm_model gpt-4.1-mini

Patterns: 8285
Batches: 100%|█████████████████████████████████████████████████████████████| 130/130 [00:11<00:00, 11.15it/s]
/Users/liling/projects/Idea2Paper/.venv/lib/python3.11/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
Clusters (excluding noise): 124
Noise/outliers (-1): 2304
Zipf:
  alpha (rank-size slope): 0.7125063659555734
  r2 (log-log fit): 0.967288605248928
  topk_share: {'1': 0.05534191606754723, '3': 0.13074736666109346, '5': 0.1815749874602909, '10': 0.2844006019060358, '20': 0.4450760742350777}
Outputs written to: output/
```


In [32]:
# quick look at the generated cluster_library
with open("../output/cluster_library.jsonl", 'r', encoding='utf-8') as f:
    clusters = [json.loads(line) for line in f]
clusters_df = pd.DataFrame(clusters)
# reorder cluster_df by size column
clusters_df = clusters_df.sort_values(by='size', ascending=False)
#clusters_df.head()
# write back to cluster_library_sorted.jsonl
with open("../output/cluster_library_sorted.jsonl", 'w', encoding='utf-8') as f:
    for _, row in clusters_df.iterrows():
        f.write(json.dumps(row.to_dict()) + '\n')   

In [36]:
# take a look at the first 5 clusters
#for i in range(2):
#    print(f"Cluster {i+1}:")
#    print(json.dumps(clusters_df.iloc[i].to_dict(), indent=2))
#    print("\n")

In [None]:
## 

## reviews

In [43]:
# ===== Local input file (downloaded JSONL) =====
ICLR_Path  = Path("../data/paper_reviews_dataset_iclr_test_10.jsonl")

#read the ICLR dataset
with open(ICLR_Path, 'r', encoding='utf-8') as f:
    papers = [json.loads(line) for line in f]
papers_df = pd.DataFrame(papers)    



In [44]:
# json print research_patterns of 2 papers
print(json.dumps(papers_df.iloc[0].to_dict(), indent=2))

{
  "title": "Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics",
  "authors": "Julius Adebayo, Melissa Hall, Bowen Yu, Bobbie Chern",
  "abstract": "Errors in labels obtained via human annotation adversely affect a trained model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's group-based disparity metrics\\footnote{Group-based disparity metrics like subgroup calibration, false positive rate, false negative rate, equalized odds, and equal opportunity are more often known, colloquially, as \\textit{fairness metrics} in the literature. We use the term group-based disparity metrics in this work.}. Here we study the effect of label error on a model's group-based disparity metrics like group calibration. We empirically characterize how varying levels of label error, in both training and test data, affect these disparity metrics. We find tha

In [51]:
# parse related_notes field: string to array of json

papers_df['parsed_related_notes'] = papers_df['related_notes'].apply(lambda x: json.loads(x) if isinstance(x, str) else x)

print(json.dumps(papers_df.iloc[0]['parsed_related_notes'], indent=2))
#print(json.dumps(papers_df.iloc[1]['related_notes'], indent=2))

[
  {
    "review_id": "tlqdB1VCIb",
    "paper_id": "RUzSobdYy0V",
    "reviewer": null,
    "paper_summary": "This paper investigates the effect of label error on the model\u2019s disparity metrics (e.g., calibration, FPR, FNR) on both the training and test set. The authors found that empirically, label errors have a larger influence on minority groups than on majority groups. The authors proposed a method to estimate the influence of changing a single training input\u2019s label on a model\u2019s group disparity metric. Reviewers agree that the studied problem is important and may have many practical implications and that the proposed method is well-motivated. At the same time, reviewers also have several sensible concerns; e.g., the technical contribution may not be strong enough, and the proposed method may not practical to deal with real-world machine learning datasets. However, overall, I believe the value overweights the issues in the paper.",
    "strengths": "",
    "weakness