Agent: add extraction helper scripts and dedup tool#713
Merged
Conversation
|
Thanks for your contribution! |
Add graph_net/agent/scripts/ with three utilities:
1. analyze_extraction_log.sh
- Analyze batch extraction logs for success/failure stats
- Categorize failures (script error, model too large, download failure, etc.)
- Support CPU/GPU modes and binary logs (grep -a)
- Generate processed/success model lists to /tmp/
2. check_extraction_progress.sh
- Check real-time status of running extraction tasks
- Show PID, CPU/memory, worker count, progress, speed estimate,
disk space, and sample directory counts
- Auto-detect latest log or accept custom log path
3. gen_hash_and_dedup.py
- Walk extracted subgraphs, compute SHA256 of model.py files
- Generate graph_hash.txt per subgraph and dedup_report.txt
- Support --remove to delete duplicate subgraphs (keep first per group)
- Default workspace is current directory (.), no hardcoded paths
Update graph_net/agent/README.md with usage docs and environment variable
overrides (GRAPHNET_LOG_DIR, GRAPHNET_SUCCESS_DIR, GRAPHNET_SAMPLES_DIR).
Xreki
approved these changes
May 19, 2026
Collaborator
Author
|
TODO:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Feature Enhancement
Description
Agent: 新增抽取辅助脚本和去重工具
新增 graph_net/agent/scripts/ 目录,提供三个辅助脚本:
analyze_extraction_log.sh —— 日志分析脚本
Failed to download / timeout / 401 / ducc 等)
check_extraction_progress.sh —— 进度检查脚本
(已处理/总数、成功/失败、成功率、速度估算、预计剩余时间)、
磁盘空间、样本目录文件数
gen_hash_and_dedup.py —— 子图去重脚本
更新 graph_net/agent/README.md:
GRAPHNET_SAMPLES_DIR),便于不同环境复用
计算图,实测可缩减 90%+(85K 子图 -> 1.5K 唯一,2.3 GB -> 172 MB)