-
Notifications
You must be signed in to change notification settings - Fork 36
【Hackathon 9th Sprint No.9】feat: implement ES(t) macro/micro cross-validation and refactor analysis utilities #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…sis utilities
This commit implements the Error-aware Speedup Score (ES_t) metric from
Section 3.2.2 of the technical report (arXiv:2510.24035), along with the
mathematical proofs from Appendix B and C that establish the sample-level
validity of both S_t and ES_t metrics.
Key Features:
=============
1. Appendix B Implementation - Sample-level proof for S_t:
- Micro-level calculation: geometric mean of rectified speedups for all samples
- Macro-level calculation: S_t = α^λ · β^(ληp) · b^(1-λ)
- Cross-validation: both methods produce identical results, proving S_t
is equivalent to the geometric mean of sample-level rectified speedups
2. Appendix C Implementation - Sample-level proof for ES_t:
- Micro-level calculation: geometric mean of error-aware rectified speedups
- Macro-level calculation: ES_t = α^λ · β^(ληp) · γ_t^(1-λ)
- Dynamic penalty factor: γ_t = b^(sum(π_c * indicator(t < c)))
- Cross-validation: validates that ES_t is the geometric mean of
error-aware rectified speedups, where failure samples use type-specific
dynamic penalties instead of fixed penalty b
3. Error-aware design (Section 3.2.2):
- Error type classification: c=1 (accuracy), c=2 (runtime crash), c=3 (compile failure)
- Tiered tolerance rules: t≥1 tolerates accuracy errors, t≥2 tolerates
runtime crashes, t≥3 tolerates all errors
- Dynamic penalty γ_t adapts based on error type distribution and tolerance level
4. Independent verification script:
- verify_macro_params.py: calculates and prints all macro parameters
(alpha, beta, gamma, lambda, eta, pi) independently
- Enables validation of plot_ESt results by computing each parameter separately
5. Mandatory validation mechanism:
- plot_ESt.py: enforces macro/micro result matching before adoption
- Rejects results if validation fails, ensuring calculation correctness
6. Code refactoring for maintainability:
- macro_statistics.py: dedicated module for macro parameter calculations
- Each parameter has independent function (alpha, beta, gamma, lambda, eta, pi)
- Reduced nesting levels in analysis_util.py by extracting helper functions
- Simplified scan_all_folders and added .txt file support
- Improved code organization following software engineering best practices
Technical Details:
==================
- Micro calculation: processes each sample individually, applies rectified
speedup rules, then computes geometric mean
- Macro calculation: uses aggregated statistics (correct count, speedup
distributions, error type proportions) to compute expected values
- Validation: compares micro and macro results with tolerance threshold (1e-6)
- All calculations verified against real benchmark data (118 samples)
Files Changed:
==============
- graph_net/analysis_util.py: refactored with helper functions, integrated
macro_statistics module, reduced nesting, simplified scan_all_folders
- graph_net/macro_statistics.py: new module for macro parameter calculations
- graph_net/plot_ESt.py: added mandatory macro/micro validation
- graph_net/verify_macro_params.py: new independent verification script
All code passes pre-commit checks, compiles successfully, and has been
validated with real benchmark data.
|
Thanks for your contribution! |
lixinqi
reviewed
Nov 14, 2025
This commit refactors the evaluation metrics calculation code with the following improvements:
1. Terminology refactoring: macro -> aggregated
- Rename macro_statistics.py to samples_statistics.py
- Rename verify_macro_params.py to verify_aggregated_params.py
- Update all variable and function names accordingly
2. Code structure improvements
- Extract verification logic in plot_ESt.py into separate functions
* compare_single_tolerance_level (12 lines)
* print_verification_result (1 line)
* verify_aggregated_micro_consistency (28 lines, meets ≤30 line requirement)
- Refactor verify_aggregated_params.py to use functional programming style
* Replace structured loops with list comprehensions
* Use Counter for error type counting
* Reduce multiple traversals to single pass where possible
3. Reduce function parameter coupling
- calculate_beta: derive slowdown_speedups internally from correct_speedups
- calculate_lambda: derive correct_count internally from correct_speedups
- calculate_eta: derive statistics internally from correct_speedups
4. Decouple error type handling
- calculate_pi: accept error_type_counts (dict) instead of hardcoded types
- calculate_gamma: accept generic parameters (tolerance, get_pi, errno_tolerances)
- Support user-defined error codes instead of hardcoded error types
5. Code quality improvements
- Use explicit len() checks instead of implicit boolean conversion
- Use modern Python type hints (list/tuple instead of typing.List/Tuple)
- Improve code readability and maintainability
All changes have been verified and pass pre-commit checks.
…regated_params.py
lixinqi
reviewed
Nov 16, 2025
- Replace error_type_counts (dict[str, int]) with errno2count (dict[int, int]) - Add get_errno_from_error_type() to map error type strings to errno (1, 2, 3) - Add get_error_type_from_errno() for reverse mapping when error type strings are needed - Update calculate_pi() to use errno2count and return dict[int, float] - Update calculate_all_aggregated_parameters() to use errno2count and errno_tolerance_thresholds - Update analysis_util.py and verify_aggregated_params.py to use errno2count - Improve code maintainability by using integer errno for sorting and comparison
lixinqi
reviewed
Nov 17, 2025
lixinqi
reviewed
Nov 17, 2025
- Rename verify_es_match_at_tolerance to compare_aggregated_es_and_microscopic_es - Replace tolerance_level with tolerance parameter - Replace tolerance_threshold with atol/rtol to avoid confusion - Rename verify_aggregated_microscopic_consistency to get_verified_aggregated_es_values - Change return type to dict only (remove all_matched) - Rename verified_scores to verified_es_values - Replace micro with microscopic throughout - Rename check_sample_correctness to get_sample_correctness - Rename t1 variables to first_errno_tolerance - Rename es_components to es_constructor_params - Rename calculate_parameters_for_tolerance to calculate_es_constructor_params_for_tolerance - Rename custom_map to errno_tolerance_overrides - Rename errno_as_tolerances to errno2tolerance - Add enable_aggregation_mode command line option
lixinqi
reviewed
Nov 18, 2025
- Modified plot_ES_results to return fig, ax, all_x_coords for external plotting - Added manual plotting of aggregated ES(t) curves in main function - Both microscopic and aggregated curves are plotted on the same graph - Aggregated curves use dashed lines with square markers for distinction - All verification checks pass with floating-point precision differences (1.39e-17)
- Move ax.legend() outside the aggregation mode condition block - Ensure legend is always displayed regardless of aggregation mode - Fix issue where legend was missing when aggregation mode is disabled
Contributor
Author
JewelRoam
approved these changes
Nov 18, 2025
roll-away
pushed a commit
to roll-away/GraphNet
that referenced
this pull request
Nov 19, 2025
…lidation and refactor analysis utilities (PaddlePaddle#363) * feat: implement ES(t) macro/micro cross-validation and refactor analysis utilities This commit implements the Error-aware Speedup Score (ES_t) metric from Section 3.2.2 of the technical report (arXiv:2510.24035), along with the mathematical proofs from Appendix B and C that establish the sample-level validity of both S_t and ES_t metrics. Key Features: ============= 1. Appendix B Implementation - Sample-level proof for S_t: - Micro-level calculation: geometric mean of rectified speedups for all samples - Macro-level calculation: S_t = α^λ · β^(ληp) · b^(1-λ) - Cross-validation: both methods produce identical results, proving S_t is equivalent to the geometric mean of sample-level rectified speedups 2. Appendix C Implementation - Sample-level proof for ES_t: - Micro-level calculation: geometric mean of error-aware rectified speedups - Macro-level calculation: ES_t = α^λ · β^(ληp) · γ_t^(1-λ) - Dynamic penalty factor: γ_t = b^(sum(π_c * indicator(t < c))) - Cross-validation: validates that ES_t is the geometric mean of error-aware rectified speedups, where failure samples use type-specific dynamic penalties instead of fixed penalty b 3. Error-aware design (Section 3.2.2): - Error type classification: c=1 (accuracy), c=2 (runtime crash), c=3 (compile failure) - Tiered tolerance rules: t≥1 tolerates accuracy errors, t≥2 tolerates runtime crashes, t≥3 tolerates all errors - Dynamic penalty γ_t adapts based on error type distribution and tolerance level 4. Independent verification script: - verify_macro_params.py: calculates and prints all macro parameters (alpha, beta, gamma, lambda, eta, pi) independently - Enables validation of plot_ESt results by computing each parameter separately 5. Mandatory validation mechanism: - plot_ESt.py: enforces macro/micro result matching before adoption - Rejects results if validation fails, ensuring calculation correctness 6. Code refactoring for maintainability: - macro_statistics.py: dedicated module for macro parameter calculations - Each parameter has independent function (alpha, beta, gamma, lambda, eta, pi) - Reduced nesting levels in analysis_util.py by extracting helper functions - Simplified scan_all_folders and added .txt file support - Improved code organization following software engineering best practices Technical Details: ================== - Micro calculation: processes each sample individually, applies rectified speedup rules, then computes geometric mean - Macro calculation: uses aggregated statistics (correct count, speedup distributions, error type proportions) to compute expected values - Validation: compares micro and macro results with tolerance threshold (1e-6) - All calculations verified against real benchmark data (118 samples) Files Changed: ============== - graph_net/analysis_util.py: refactored with helper functions, integrated macro_statistics module, reduced nesting, simplified scan_all_folders - graph_net/macro_statistics.py: new module for macro parameter calculations - graph_net/plot_ESt.py: added mandatory macro/micro validation - graph_net/verify_macro_params.py: new independent verification script All code passes pre-commit checks, compiles successfully, and has been validated with real benchmark data. * refactor: rename macro to aggregated and improve code quality This commit refactors the evaluation metrics calculation code with the following improvements: 1. Terminology refactoring: macro -> aggregated - Rename macro_statistics.py to samples_statistics.py - Rename verify_macro_params.py to verify_aggregated_params.py - Update all variable and function names accordingly 2. Code structure improvements - Extract verification logic in plot_ESt.py into separate functions * compare_single_tolerance_level (12 lines) * print_verification_result (1 line) * verify_aggregated_micro_consistency (28 lines, meets ≤30 line requirement) - Refactor verify_aggregated_params.py to use functional programming style * Replace structured loops with list comprehensions * Use Counter for error type counting * Reduce multiple traversals to single pass where possible 3. Reduce function parameter coupling - calculate_beta: derive slowdown_speedups internally from correct_speedups - calculate_lambda: derive correct_count internally from correct_speedups - calculate_eta: derive statistics internally from correct_speedups 4. Decouple error type handling - calculate_pi: accept error_type_counts (dict) instead of hardcoded types - calculate_gamma: accept generic parameters (tolerance, get_pi, errno_tolerances) - Support user-defined error codes instead of hardcoded error types 5. Code quality improvements - Use explicit len() checks instead of implicit boolean conversion - Use modern Python type hints (list/tuple instead of typing.List/Tuple) - Improve code readability and maintainability All changes have been verified and pass pre-commit checks. * style: apply black formatting to samples_statistics.py and verify_aggregated_params.py * refactor: unify error type to errno mapping for better sorting - Replace error_type_counts (dict[str, int]) with errno2count (dict[int, int]) - Add get_errno_from_error_type() to map error type strings to errno (1, 2, 3) - Add get_error_type_from_errno() for reverse mapping when error type strings are needed - Update calculate_pi() to use errno2count and return dict[int, float] - Update calculate_all_aggregated_parameters() to use errno2count and errno_tolerance_thresholds - Update analysis_util.py and verify_aggregated_params.py to use errno2count - Improve code maintainability by using integer errno for sorting and comparison * refactor: split tolerance report generation * refactor: improve naming and semantics for ES calculation - Rename verify_es_match_at_tolerance to compare_aggregated_es_and_microscopic_es - Replace tolerance_level with tolerance parameter - Replace tolerance_threshold with atol/rtol to avoid confusion - Rename verify_aggregated_microscopic_consistency to get_verified_aggregated_es_values - Change return type to dict only (remove all_matched) - Rename verified_scores to verified_es_values - Replace micro with microscopic throughout - Rename check_sample_correctness to get_sample_correctness - Rename t1 variables to first_errno_tolerance - Rename es_components to es_constructor_params - Rename calculate_parameters_for_tolerance to calculate_es_constructor_params_for_tolerance - Rename custom_map to errno_tolerance_overrides - Rename errno_as_tolerances to errno2tolerance - Add enable_aggregation_mode command line option * feat: add aggregated ES(t) plotting and verification - Modified plot_ES_results to return fig, ax, all_x_coords for external plotting - Added manual plotting of aggregated ES(t) curves in main function - Both microscopic and aggregated curves are plotted on the same graph - Aggregated curves use dashed lines with square markers for distinction - All verification checks pass with floating-point precision differences (1.39e-17) * fix: move ax.legend outside aggregation condition block - Move ax.legend() outside the aggregation mode condition block - Ensure legend is always displayed regardless of aggregation mode - Fix issue where legend was missing when aggregation mode is disabled
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

This commit implements the Error-aware Speedup Score (ES_t) metric from Section 3.2.2 of the technical report (arXiv:2510.24035), along with the mathematical proofs from Appendix B and C that establish the sample-level validity of both S_t and ES_t metrics.
Key Features:
Appendix B Implementation - Sample-level proof for S_t:
Appendix C Implementation - Sample-level proof for ES_t:
Error-aware design (Section 3.2.2):
Independent verification script:
Mandatory validation mechanism:
Code refactoring for maintainability:
Technical Details:
Files Changed:
All code passes pre-commit checks, compiles successfully, and has been validated with real benchmark data.
PR Category
Description