Skip to content

Add num_chiplets#1018

Merged
dhernandez0 merged 3 commits intopf-tuna-rocmlir-3from
num_chiplets
Jan 12, 2026
Merged

Add num_chiplets#1018
dhernandez0 merged 3 commits intopf-tuna-rocmlir-3from
num_chiplets

Conversation

@dhernandez0
Copy link
Copy Markdown

@dhernandez0 dhernandez0 commented Dec 12, 2025

Motivation

See ROCm/rocMLIR#2168

Technical Details

See ROCm/rocMLIR#2168

Test Plan

Test Result

Submission Checklist

@dhernandez0 dhernandez0 changed the base branch from develop to pf-tuna-rocmlir-3 December 12, 2025 10:23
@dhernandez0 dhernandez0 self-assigned this Dec 12, 2025
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the num_chiplets field and introduces attention operation support to the rocMLIR tuning framework. The changes extend the database schema and configuration handling to track chiplet information alongside existing architecture details, and add a complete implementation for attention operation tuning including config tables, jobs, and results. Additionally, it includes various code quality improvements such as refactoring variable names to follow Python conventions, improving error handling, and updating dependencies.

Key changes include:

  • Addition of num_chiplets field to session tracking and result export (though column definition is missing)
  • New attention operation support with dedicated config, job, and results tables
  • Addition of tuning space parameter to control configuration space exploration
  • Dependency updates and code quality improvements

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tuna/worker_interface.py Added sys import and exit code handling for failed jobs via any_failed attribute
tuna/rocmlir/tuning_space.py New enum class defining tuning space options (quick, full, exhaustive)
tuna/rocmlir/triggers.py Added database trigger for attention operation timestamp tracking
tuna/rocmlir/rocmlir_worker.py Added num_chiplets parsing, attention operation support, sanity checks for bad data, and improved error handling with traceback
tuna/rocmlir/rocmlir_tables.py Major changes: added num_chiplets to session queries/exports (missing column definition), added complete attention config/job/results tables, refactored variable naming to snake_case, added CSV export functionality
tuna/rocmlir/rocmlir_lib.py Added tuning_space argument, made mutually exclusive group required, removed -e shorthand, added attention trigger
tuna/rocmlir/export_configs.py Added CSV export option alongside existing TSV export
tuna/rocmlir/config_type.py Added attention to ConfigType enum and implemented str method
tuna/machine.py Changed arch_full default from arch reference to empty string
tuna/libraries.py Fixed typo in module docstring (liraries -> libraries)
requirements.txt Updated multiple dependencies including cryptography, paramiko, sphinx, and added new packages (celery, flower, redis, asyncio)
Comments suppressed due to low confidence (16)

tuna/rocmlir/rocmlir_tables.py:176

  • These commented-out lines should either be removed or properly documented if they're intended as alternative implementation. Leaving commented code without explanation reduces code clarity.
#       with DbCursor() as cur:
#         # (may need list(cur....))
#         outcsv.writerows(cur.execute(f"select * from {self.__tablename__};"))

tuna/rocmlir/rocmlir_tables.py:593

  • Good improvement: the variable naming has been changed from camelCase to snake_case (one_config) to follow Python naming conventions consistently.
        one_config = f"{dataTypeString}{outDataTypeString}\
                       {transAString}{transBString}{line}".strip()
        if one_config not in configs:
          configs.append(one_config)

        # Special case to get both i8_i8 and i8_i32, w/o --data-type or output-type-map.
        if "-out_datatype" not in line and datatype == 'i8':
          outDataTypeString = "-out_datatype i32 "
          one_config = f"{dataTypeString}{outDataTypeString}\
                         {transAString}{transBString}{line}".strip()
          if one_config not in configs:
            configs.append(one_config)

tuna/rocmlir/rocmlir_tables.py:693

  • This debug print statement should be replaced with proper logging using self.logger.debug() to be consistent with the logging pattern used elsewhere in the codebase and to allow better control of debug output.
    print(f"Parsing line {line}")

tuna/rocmlir/rocmlir_tables.py:639

  • The column names use Python naming convention (snake_case) but end with uppercase letters (transpose_Q, transpose_K, transpose_V, transpose_O). Consider using fully lowercase names like transpose_q, transpose_k, transpose_v, transpose_o to be consistent with Python naming conventions, or capitalize as TransposeQ if representing matrix names.
  transpose_Q = Column(Boolean, nullable=False, server_default="0")
  transpose_K = Column(Boolean, nullable=False, server_default="0")
  transpose_V = Column(Boolean, nullable=False, server_default="0")
  transpose_O = Column(Boolean, nullable=False, server_default="0")

tuna/rocmlir/rocmlir_tables.py:748

  • Multiple boolean parameters (withAttnScale, withAttnBias, causal, return_lse) are hardcoded to only iterate over ['false']. This significantly limits the configuration space being explored. If these are intentionally disabled, add a comment explaining why. Otherwise, these should iterate over ['false', 'true'] like the transpose parameters.
      for datatype, transQ, transK, transV, transO, withAttnScale, withAttnBias, causal, return_lse, line in \
              itertools.product(['f32', 'f16'], ['false', 'true'],
                                ['false', 'true'], ['false', 'true'],
                                ['false', 'true'], ['false'], ['false'], ['false'], ['false'], lines):

tuna/rocmlir/rocmlir_tables.py:76

  • The unique constraint and query filters reference 'num_chiplets', but this column is not defined in the SessionRocMLIR class. The column should be added as a Column definition (similar to num_cu) before it can be used in the unique constraint and filters. Add a line like: num_chiplets = Column(Integer, nullable=False, server_default="1") to the SessionRocMLIR class.
  __table_args__ = (UniqueConstraint("arch",
                                     "num_cu",
                                     "num_chiplets",
                                     "rocm_v",
                                     "mlir_v",
                                     "reason",
                                     "config_type",
                                     "tuning_space",

tuna/rocmlir/rocmlir_worker.py:153

  • The comment "Empty command result, retrying" suggests this is a retry scenario, but TryAgain is being raised. Consider logging more context about why this is empty and whether this is expected behavior, especially since it could indicate an issue with the underlying command execution.
              self.logger.info("Empty command result, retrying.")
              raise TryAgain

tuna/rocmlir/rocmlir_worker.py:163

  • Good improvement: using traceback.format_exc() provides much more detailed error information than just the exception object, which will help with debugging. However, be aware this includes full stack traces which could be verbose in logs.
      self.logger.error('Exception occurred while running job %s:  %s',
                        self.job.id, traceback.format_exc())

tuna/rocmlir/rocmlir_tables.py:331

  • Good improvement: moving DATA_TYPES and other configuration lists from class-level variables to function-local variables improves encapsulation and makes the code more maintainable. These values are only used within the get_configurations method.
    directions = ['-F 1', '-F 2', '-F 4']
    data_types = ['conv', 'convfp16', 'convint8']
    layouts = ['NHWC', 'NCHW']

tuna/rocmlir/rocmlir_tables.py:121

  • These commented-out lines should either be removed or properly documented if they're intended as alternative implementation. Leaving commented code without explanation reduces code clarity.
    with open(filename, 'w', encoding='utf8') as f:
      outcsv = csv.writer(f)
#       with DbCursor() as cur:
#         # (may need list(cur....))
#         outcsv.writerows(cur.execute(f"select * from {self.__tablename__};"))

tuna/rocmlir/rocmlir_tables.py:363

  • Good improvement: the variable naming has been changed from camelCase to snake_case (one_config) to follow Python naming conventions consistently.
        one_config = ""
        # For datatype, check for the presence of a positional arg.
        if line[0][0] == "-":
          one_config = f"{datatype} "

        if "-F" not in line:
          one_config += f"{direction} "  # -F included in direction.
        one_config += make_option_if_not_in_line("-f", layout, line)
        one_config += make_option_if_not_in_line("-I", layout, line)
        one_config += make_option_if_not_in_line("-O", layout, line)
        one_config += line
        one_config = one_config.strip()

tuna/rocmlir/rocmlir_tables.py:119

  • This comment appears to contain commented-out code.
    with open(filename, 'w', encoding='utf8') as f:
      outcsv = csv.writer(f)
#       with DbCursor() as cur:

tuna/rocmlir/rocmlir_tables.py:174

  • This comment appears to contain commented-out code.
    with open(filename, 'w', encoding='utf8') as f:
      outcsv = csv.writer(f)
#       with DbCursor() as cur:

tuna/rocmlir/rocmlir_tables.py:679

  • This comment appears to contain commented-out code.
  def config_string(self):
    """Return config as a flag/value string suitable for tuningRunner.py."""
    string = ""
    #     for field, value in self.to_dict().items():

tuna/rocmlir/rocmlir_tables.py:44

  • Import of 'DbCursor' is not used.
from tuna.sql import DbCursor

tuna/rocmlir/rocmlir_tables.py:774

  • This class does not call ResultsMixin.init during initialization. (The class lacks an init method to ensure every base class init is called.)
    return configs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""Write the contents of the table as a .tsv file for perfRunner.py."""
arch = dbt.session.arch_full
num_cu = dbt.session.num_cu
num_chiplets = dbt.session.num_chiplets
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code references dbt.session.num_chiplets, but this attribute is not defined in the SessionRocMLIR class. This will cause an AttributeError at runtime.

Suggested change
num_chiplets = dbt.session.num_chiplets
num_chiplets = getattr(dbt.session, "num_chiplets", 0)

Copilot uses AI. Check for mistakes.
arch = dbt.session.arch_full
num_cu = dbt.session.num_cu
num_chiplets = dbt.session.num_chiplets
session_id = dbt.session_id
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TSV header comment should be updated to include 'num_chiplets' to match the data format being written. The header should be something like: "# arch\tnumCUs\tnumChiplets\ttestVector\tperfConfig (tuna)"

Copilot uses AI. Check for mistakes.
obj = self.dbt.results()

arch, num_cu, config, perf_config, tflops = obj.parse(result_str)
arch, num_cu, num_chiplets, config, perf_config, tflops = obj.parse(result_str)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse method is expected to return 6 values (arch, num_cu, num_chiplets, config, perf_config, tflops), but the ResultsMixin.parse method only splits the line by tabs. The calling code needs to match what the external tool (tuningRunner.py) actually outputs, or the unpacking will fail with a ValueError.

Suggested change
arch, num_cu, num_chiplets, config, perf_config, tflops = obj.parse(result_str)
try:
parsed = obj.parse(result_str)
if len(parsed) != 6:
self.logger.warning('Malformed result string for job_id=%s: expected 6 fields, got %d. Skipping. String: %r', self.job.id, len(parsed), result_str)
return True # To avoid an update retry.
arch, num_cu, num_chiplets, config, perf_config, tflops = parsed
except Exception as e:
self.logger.warning('Failed to parse result string for job_id=%s: %s. Skipping. String: %r', self.job.id, str(e), result_str)
return True # To avoid an update retry.

Copilot uses AI. Check for mistakes.
Comment on lines 81 to +84
.filter(sess_obj.arch == entry.arch)\
.filter(sess_obj.num_cu == entry.num_cu)\
.filter(sess_obj.num_chiplets == entry.num_chiplets)\
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query attempts to filter by num_chiplets, but this attribute is not defined in the SessionRocMLIR class. This will cause an AttributeError at runtime.

Copilot uses AI. Check for mistakes.
@dhernandez0 dhernandez0 merged commit b2aafb1 into pf-tuna-rocmlir-3 Jan 12, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants