Skip to content

Refactor sqlite toolchain: build_db pipeline, argparse CLI, README #702

Merged
Xreki merged 11 commits into
PaddlePaddle:developfrom
Xreki:opt_sqlite
May 11, 2026
Merged

Refactor sqlite toolchain: build_db pipeline, argparse CLI, README #702
Xreki merged 11 commits into
PaddlePaddle:developfrom
Xreki:opt_sqlite

Conversation

@Xreki
Copy link
Copy Markdown
Collaborator

@Xreki Xreki commented May 9, 2026

PR Category

Other

Description

本PR的工作:

  • 新增 build_db.py: 一站式批量建库脚本,自动初始化 DB、遍历 4 种 sample_type 插入样本,完成后自动执行分桶(generate_buckets)和分组(generate_groups)。
    • 替代当前的graphsample_insert.sh,使用 shell 脚本批量插入样本,效率太低下,开销主要在于进程启动
    • generate_subgraph_dataset.sh中移除已被 build_db.py 替代的 insert_graph_sample()generate_database() 函数。

为了代码复用,需要对其他组件进行函数封装:

  • graphsample_insert.py 重构: 提取 insert_one_sample() 可复用函数,支持 op_names_path_prefix 参数以在插入子图样本时同步写入算子名称和输入 tensor meta。
  • 分桶/分组模块封装: graph_sample_bucket_generator.pygraph_sample_groups_insert.py 新增 generate_buckets() / generate_groups() 公共接口,供 build_db.py 链式调用。

一些代码优化工作:

  • upload_dataset.py / download_dataset.py: 从 upload.py / download.py 重命名,移除硬编码变量,改为 argparse 命令行参数方式。
  • 变量命名规范化: graph_sample_groups_insert.pygid→group_id, c→candidate, seen_dtypes→picked_dtypes, v1/v2→v1_stats/v2_stats 等。
  • README 全面重写 (Readme.mdREADME.md): 补充数据表结构概览、全部脚本的使用说明,所有路径改为相对路径。

优化效果:相比使用graphsample_insert.sh进行数据库生成,时间从10+h减少到1h以内。

Xreki and others added 2 commits May 9, 2026 13:24
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract the insertion logic into a reusable insert_one_sample() function
so build_db.py can import it directly instead of duplicating the code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 9, 2026

Thanks for your contribution!

Xreki and others added 2 commits May 9, 2026 15:29
- init_db: compute migrates_dir from script location instead of CWD-relative path
- build_db: use main(args), add --op_names_path_prefix as required arg, auto-create db via migrate()
- Remove unused GRAPH_NET_ROOT and graph_net import from build_db

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Auto-collect sample paths by scanning for model.py when list file is missing
- Use loop over sample_types instead of repeated code blocks
- Track and print success/fail counts and order range per type

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Xreki and others added 5 commits May 9, 2026 16:11
- Skip non-full_graph types when directory is missing
- Print sample dir and list file paths before processing each type

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Xreki Xreki changed the title Optimize sqlite implementation. Refactor sqlite toolchain: build_db pipeline, argparse CLI, README overhaul May 11, 2026
@Xreki Xreki changed the title Refactor sqlite toolchain: build_db pipeline, argparse CLI, README overhaul Refactor sqlite toolchain: build_db pipeline, argparse CLI, README May 11, 2026
@Xreki Xreki requested a review from fangfangssj May 11, 2026 08:20
Comment on lines +459 to +461
print(
"insert {sample_type} failed: integrity error (possible duplicate uuid or graph_hash)"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缺少 f-string 前缀

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,下个PR里面加一下。

@Xreki Xreki merged commit 2555829 into PaddlePaddle:develop May 11, 2026
3 checks passed
@Xreki Xreki deleted the opt_sqlite branch May 11, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants