Add num_chiplets#1018
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for the num_chiplets field and introduces attention operation support to the rocMLIR tuning framework. The changes extend the database schema and configuration handling to track chiplet information alongside existing architecture details, and add a complete implementation for attention operation tuning including config tables, jobs, and results. Additionally, it includes various code quality improvements such as refactoring variable names to follow Python conventions, improving error handling, and updating dependencies.
Key changes include:
- Addition of
num_chipletsfield to session tracking and result export (though column definition is missing) - New attention operation support with dedicated config, job, and results tables
- Addition of tuning space parameter to control configuration space exploration
- Dependency updates and code quality improvements
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tuna/worker_interface.py | Added sys import and exit code handling for failed jobs via any_failed attribute |
| tuna/rocmlir/tuning_space.py | New enum class defining tuning space options (quick, full, exhaustive) |
| tuna/rocmlir/triggers.py | Added database trigger for attention operation timestamp tracking |
| tuna/rocmlir/rocmlir_worker.py | Added num_chiplets parsing, attention operation support, sanity checks for bad data, and improved error handling with traceback |
| tuna/rocmlir/rocmlir_tables.py | Major changes: added num_chiplets to session queries/exports (missing column definition), added complete attention config/job/results tables, refactored variable naming to snake_case, added CSV export functionality |
| tuna/rocmlir/rocmlir_lib.py | Added tuning_space argument, made mutually exclusive group required, removed -e shorthand, added attention trigger |
| tuna/rocmlir/export_configs.py | Added CSV export option alongside existing TSV export |
| tuna/rocmlir/config_type.py | Added attention to ConfigType enum and implemented str method |
| tuna/machine.py | Changed arch_full default from arch reference to empty string |
| tuna/libraries.py | Fixed typo in module docstring (liraries -> libraries) |
| requirements.txt | Updated multiple dependencies including cryptography, paramiko, sphinx, and added new packages (celery, flower, redis, asyncio) |
Comments suppressed due to low confidence (16)
tuna/rocmlir/rocmlir_tables.py:176
- These commented-out lines should either be removed or properly documented if they're intended as alternative implementation. Leaving commented code without explanation reduces code clarity.
# with DbCursor() as cur:
# # (may need list(cur....))
# outcsv.writerows(cur.execute(f"select * from {self.__tablename__};"))
tuna/rocmlir/rocmlir_tables.py:593
- Good improvement: the variable naming has been changed from camelCase to snake_case (one_config) to follow Python naming conventions consistently.
one_config = f"{dataTypeString}{outDataTypeString}\
{transAString}{transBString}{line}".strip()
if one_config not in configs:
configs.append(one_config)
# Special case to get both i8_i8 and i8_i32, w/o --data-type or output-type-map.
if "-out_datatype" not in line and datatype == 'i8':
outDataTypeString = "-out_datatype i32 "
one_config = f"{dataTypeString}{outDataTypeString}\
{transAString}{transBString}{line}".strip()
if one_config not in configs:
configs.append(one_config)
tuna/rocmlir/rocmlir_tables.py:693
- This debug print statement should be replaced with proper logging using self.logger.debug() to be consistent with the logging pattern used elsewhere in the codebase and to allow better control of debug output.
print(f"Parsing line {line}")
tuna/rocmlir/rocmlir_tables.py:639
- The column names use Python naming convention (snake_case) but end with uppercase letters (transpose_Q, transpose_K, transpose_V, transpose_O). Consider using fully lowercase names like transpose_q, transpose_k, transpose_v, transpose_o to be consistent with Python naming conventions, or capitalize as TransposeQ if representing matrix names.
transpose_Q = Column(Boolean, nullable=False, server_default="0")
transpose_K = Column(Boolean, nullable=False, server_default="0")
transpose_V = Column(Boolean, nullable=False, server_default="0")
transpose_O = Column(Boolean, nullable=False, server_default="0")
tuna/rocmlir/rocmlir_tables.py:748
- Multiple boolean parameters (withAttnScale, withAttnBias, causal, return_lse) are hardcoded to only iterate over ['false']. This significantly limits the configuration space being explored. If these are intentionally disabled, add a comment explaining why. Otherwise, these should iterate over ['false', 'true'] like the transpose parameters.
for datatype, transQ, transK, transV, transO, withAttnScale, withAttnBias, causal, return_lse, line in \
itertools.product(['f32', 'f16'], ['false', 'true'],
['false', 'true'], ['false', 'true'],
['false', 'true'], ['false'], ['false'], ['false'], ['false'], lines):
tuna/rocmlir/rocmlir_tables.py:76
- The unique constraint and query filters reference 'num_chiplets', but this column is not defined in the SessionRocMLIR class. The column should be added as a Column definition (similar to num_cu) before it can be used in the unique constraint and filters. Add a line like: num_chiplets = Column(Integer, nullable=False, server_default="1") to the SessionRocMLIR class.
__table_args__ = (UniqueConstraint("arch",
"num_cu",
"num_chiplets",
"rocm_v",
"mlir_v",
"reason",
"config_type",
"tuning_space",
tuna/rocmlir/rocmlir_worker.py:153
- The comment "Empty command result, retrying" suggests this is a retry scenario, but TryAgain is being raised. Consider logging more context about why this is empty and whether this is expected behavior, especially since it could indicate an issue with the underlying command execution.
self.logger.info("Empty command result, retrying.")
raise TryAgain
tuna/rocmlir/rocmlir_worker.py:163
- Good improvement: using traceback.format_exc() provides much more detailed error information than just the exception object, which will help with debugging. However, be aware this includes full stack traces which could be verbose in logs.
self.logger.error('Exception occurred while running job %s: %s',
self.job.id, traceback.format_exc())
tuna/rocmlir/rocmlir_tables.py:331
- Good improvement: moving DATA_TYPES and other configuration lists from class-level variables to function-local variables improves encapsulation and makes the code more maintainable. These values are only used within the get_configurations method.
directions = ['-F 1', '-F 2', '-F 4']
data_types = ['conv', 'convfp16', 'convint8']
layouts = ['NHWC', 'NCHW']
tuna/rocmlir/rocmlir_tables.py:121
- These commented-out lines should either be removed or properly documented if they're intended as alternative implementation. Leaving commented code without explanation reduces code clarity.
with open(filename, 'w', encoding='utf8') as f:
outcsv = csv.writer(f)
# with DbCursor() as cur:
# # (may need list(cur....))
# outcsv.writerows(cur.execute(f"select * from {self.__tablename__};"))
tuna/rocmlir/rocmlir_tables.py:363
- Good improvement: the variable naming has been changed from camelCase to snake_case (one_config) to follow Python naming conventions consistently.
one_config = ""
# For datatype, check for the presence of a positional arg.
if line[0][0] == "-":
one_config = f"{datatype} "
if "-F" not in line:
one_config += f"{direction} " # -F included in direction.
one_config += make_option_if_not_in_line("-f", layout, line)
one_config += make_option_if_not_in_line("-I", layout, line)
one_config += make_option_if_not_in_line("-O", layout, line)
one_config += line
one_config = one_config.strip()
tuna/rocmlir/rocmlir_tables.py:119
- This comment appears to contain commented-out code.
with open(filename, 'w', encoding='utf8') as f:
outcsv = csv.writer(f)
# with DbCursor() as cur:
tuna/rocmlir/rocmlir_tables.py:174
- This comment appears to contain commented-out code.
with open(filename, 'w', encoding='utf8') as f:
outcsv = csv.writer(f)
# with DbCursor() as cur:
tuna/rocmlir/rocmlir_tables.py:679
- This comment appears to contain commented-out code.
def config_string(self):
"""Return config as a flag/value string suitable for tuningRunner.py."""
string = ""
# for field, value in self.to_dict().items():
tuna/rocmlir/rocmlir_tables.py:44
- Import of 'DbCursor' is not used.
from tuna.sql import DbCursor
tuna/rocmlir/rocmlir_tables.py:774
- This class does not call ResultsMixin.init during initialization. (The class lacks an init method to ensure every base class init is called.)
return configs
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Write the contents of the table as a .tsv file for perfRunner.py.""" | ||
| arch = dbt.session.arch_full | ||
| num_cu = dbt.session.num_cu | ||
| num_chiplets = dbt.session.num_chiplets |
There was a problem hiding this comment.
The code references dbt.session.num_chiplets, but this attribute is not defined in the SessionRocMLIR class. This will cause an AttributeError at runtime.
| num_chiplets = dbt.session.num_chiplets | |
| num_chiplets = getattr(dbt.session, "num_chiplets", 0) |
| arch = dbt.session.arch_full | ||
| num_cu = dbt.session.num_cu | ||
| num_chiplets = dbt.session.num_chiplets | ||
| session_id = dbt.session_id |
There was a problem hiding this comment.
The TSV header comment should be updated to include 'num_chiplets' to match the data format being written. The header should be something like: "# arch\tnumCUs\tnumChiplets\ttestVector\tperfConfig (tuna)"
| obj = self.dbt.results() | ||
|
|
||
| arch, num_cu, config, perf_config, tflops = obj.parse(result_str) | ||
| arch, num_cu, num_chiplets, config, perf_config, tflops = obj.parse(result_str) |
There was a problem hiding this comment.
The parse method is expected to return 6 values (arch, num_cu, num_chiplets, config, perf_config, tflops), but the ResultsMixin.parse method only splits the line by tabs. The calling code needs to match what the external tool (tuningRunner.py) actually outputs, or the unpacking will fail with a ValueError.
| arch, num_cu, num_chiplets, config, perf_config, tflops = obj.parse(result_str) | |
| try: | |
| parsed = obj.parse(result_str) | |
| if len(parsed) != 6: | |
| self.logger.warning('Malformed result string for job_id=%s: expected 6 fields, got %d. Skipping. String: %r', self.job.id, len(parsed), result_str) | |
| return True # To avoid an update retry. | |
| arch, num_cu, num_chiplets, config, perf_config, tflops = parsed | |
| except Exception as e: | |
| self.logger.warning('Failed to parse result string for job_id=%s: %s. Skipping. String: %r', self.job.id, str(e), result_str) | |
| return True # To avoid an update retry. |
| .filter(sess_obj.arch == entry.arch)\ | ||
| .filter(sess_obj.num_cu == entry.num_cu)\ | ||
| .filter(sess_obj.num_chiplets == entry.num_chiplets)\ |
There was a problem hiding this comment.
The query attempts to filter by num_chiplets, but this attribute is not defined in the SessionRocMLIR class. This will cause an AttributeError at runtime.
Motivation
See ROCm/rocMLIR#2168
Technical Details
See ROCm/rocMLIR#2168
Test Plan
Test Result
Submission Checklist