Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sys.cpu.utilization metrics for titus isolate v2 #127

Merged

Conversation

copperlight
Copy link
Collaborator

@copperlight copperlight commented May 23, 2024

The new 7th generation hardware sets a cfs_quota of max, which is then read as a 0 in the number vector. This was breaking utilization calculations.

  • Refactor code to remove all remaining traces of cgroup selection, since all Titus hosts are now on cgroupv2.
  • Remove extra ifdef macro for TITUS_SYSTEM_SERVICE that had no effect.
  • The TITUS_NUM_CPU environment variable is expected to always be present, and should always be used to determine the number of CPUs allocated.
  • The number of CPUs can be used to determine the value of cfs_quota, which can then be used to calculate available CPU time.
  • Refactor metrics collection for cpu time and cpu utilization, to reduce the number of duplicate calls and simplify the processing.
  • Shuffle the order of nvml library initialization, so that we can use the standard logger to report status.
  • Refactor the setup-venv.sh script, to make it easier to bootstrap new dev environments on bionic hosts.

@copperlight copperlight force-pushed the fix-sys.cpu.utilization branch 2 times, most recently from 1c58da9 to ec0bcb1 Compare May 23, 2024 14:49
The new 7th generation hardware sets a `cfs_quota` of `max`, which is then read
as a `0` in the number vector. This was breaking utilization calculations.

* Refactor code to remove all remaining traces of cgroup selection, since all
Titus hosts are now on cgroupv2.
* Remove extra `ifdef` macro for `TITUS_SYSTEM_SERVICE` that had no effect.
* The `TITUS_NUM_CPU` environment variable is expected to always be present, and
should always be used to determine the number of CPUs allocated.
* The number of CPUs can be used to determine the value of `cfs_quota`, which
can then be used to calculate available CPU time.
* Refactor metrics collection for cpu time and cpu utilization, to reduce the
number of duplicate calls and simplify the processing.
* Shuffle the order of nvml library initialization, so that we can use the
standard logger to report status.
* Refactor the `setup-venv.sh` script, to make it easier to bootstrap new dev
environments on bionic hosts.
@copperlight copperlight merged commit ae234dc into Netflix-Skunkworks:main May 23, 2024
2 checks passed
@copperlight copperlight deleted the fix-sys.cpu.utilization branch May 23, 2024 15:21

auto cpu_max = read_num_vector_from_file(path_prefix_, "cpu.max");
auto cfs_period = cpu_max[1];
auto cfs_quota = cfs_period * num_cpu;
auto avail_cpu_time = (delta_t / cfs_period) * cfs_quota;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing the avail_cpu_time and num_cpu seems to be duplicated for the normal stats and peak. It might make sense to have functions for those to ensure they stay consistent in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #128.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants