Refactor sqlite toolchain: build_db pipeline, argparse CLI, README by Xreki · Pull Request #702 · PaddlePaddle/GraphNet

Xreki · 2026-05-09T07:03:55Z

PR Category

Other

Description

本PR的工作：

新增 build_db.py: 一站式批量建库脚本，自动初始化 DB、遍历 4 种 sample_type 插入样本，完成后自动执行分桶（generate_buckets）和分组（generate_groups）。
- 替代当前的graphsample_insert.sh，使用 shell 脚本批量插入样本，效率太低下，开销主要在于进程启动
- generate_subgraph_dataset.sh中移除已被 build_db.py 替代的 insert_graph_sample() 和 generate_database() 函数。

为了代码复用，需要对其他组件进行函数封装：

graphsample_insert.py 重构: 提取 insert_one_sample() 可复用函数，支持 op_names_path_prefix 参数以在插入子图样本时同步写入算子名称和输入 tensor meta。
分桶/分组模块封装: graph_sample_bucket_generator.py 和 graph_sample_groups_insert.py 新增 generate_buckets() / generate_groups() 公共接口，供 build_db.py 链式调用。

一些代码优化工作：

upload_dataset.py / download_dataset.py: 从 upload.py / download.py 重命名，移除硬编码变量，改为 argparse 命令行参数方式。
变量命名规范化: graph_sample_groups_insert.py 中 gid→group_id, c→candidate, seen_dtypes→picked_dtypes, v1/v2→v1_stats/v2_stats 等。
README 全面重写 (Readme.md → README.md): 补充数据表结构概览、全部脚本的使用说明，所有路径改为相对路径。

优化效果：相比使用graphsample_insert.sh进行数据库生成，时间从10+h减少到1h以内。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extract the insertion logic into a reusable insert_one_sample() function so build_db.py can import it directly instead of duplicating the code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

paddle-bot · 2026-05-09T07:04:01Z

Thanks for your contribution!

- init_db: compute migrates_dir from script location instead of CWD-relative path - build_db: use main(args), add --op_names_path_prefix as required arg, auto-create db via migrate() - Remove unused GRAPH_NET_ROOT and graph_net import from build_db Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Auto-collect sample paths by scanning for model.py when list file is missing - Use loop over sample_types instead of repeated code blocks - Track and print success/fail counts and order range per type Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Skip non-full_graph types when directory is missing - Print sample dir and list file paths before processing each type Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fangfangssj · 2026-05-11T08:37:41Z

+        print(
+            "insert {sample_type} failed: integrity error (possible duplicate uuid or graph_hash)"
+        )


缺少 f-string 前缀

好的，下个PR里面加一下。

Xreki and others added 2 commits May 9, 2026 13:24

Strip invisible chars from all string args in graphsample_insert main()

040a442

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extract insert_one_sample from graphsample_insert and add build_db.py

fa7257d

Extract the insertion logic into a reusable insert_one_sample() function so build_db.py can import it directly instead of duplicating the code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Xreki and others added 2 commits May 9, 2026 15:29

Xreki force-pushed the opt_sqlite branch from 3d6be75 to 585d1fc Compare May 9, 2026 08:02

Xreki and others added 5 commits May 9, 2026 16:11

Add directory check and path logging for each sample type in build_db

34150d2

- Skip non-full_graph types when directory is missing - Print sample dir and list file paths before processing each type Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Simplify sample_types to a plain string list in build_db

a3f7c0b

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Rename and minor fix.

71a1d21

Update README.

ed2240b

Add generation of buckets and groups into the build of db.

7b1c685

Xreki changed the title ~~Optimize sqlite implementation.~~ Refactor sqlite toolchain: build_db pipeline, argparse CLI, README overhaul May 11, 2026

Rename variables.

c865655

Xreki force-pushed the opt_sqlite branch from a01ed04 to c865655 Compare May 11, 2026 05:54

Xreki changed the title ~~Refactor sqlite toolchain: build_db pipeline, argparse CLI, README overhaul~~ Refactor sqlite toolchain: build_db pipeline, argparse CLI, README May 11, 2026

Optimize session.

7d9ee77

Xreki requested a review from fangfangssj May 11, 2026 08:20

fangfangssj approved these changes May 11, 2026

View reviewed changes

Xreki merged commit 2555829 into PaddlePaddle:develop May 11, 2026
3 checks passed

Xreki deleted the opt_sqlite branch May 11, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor sqlite toolchain: build_db pipeline, argparse CLI, README #702

Refactor sqlite toolchain: build_db pipeline, argparse CLI, README #702
Xreki merged 11 commits into
PaddlePaddle:developfrom
Xreki:opt_sqlite

Xreki commented May 9, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

fangfangssj May 11, 2026

Uh oh!

Xreki May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xreki commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

Description

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

fangfangssj May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Xreki May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xreki commented May 9, 2026 •

edited

Loading