Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option to Turn Off _fill_perfdata #168

Merged
merged 5 commits into from
Jun 20, 2024

Conversation

michaelmckinsey1
Copy link
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Jun 10, 2024

_fill_perfdata adds empty rows for profiles that do not contain data for a node, such that each node has a row entry for every profile. This can be obnoxious in some cases, as in the performance algorithms dataset, 2,531,150 rows in the performance data table with _fill_perfdata and 163,778 rows with _fill_perfdata off.

This is a small optimization for the reader, as _fill_perdata is only called once, most time is spent in _unify. However, when working with a large amount of files with missing rows, a Thicket with _fill_perfdata off will be easier and faster to work with.

@michaelmckinsey1 michaelmckinsey1 added area-thicket Issues and PRs involving Thicket's core Thicket datastructure and associated classes priority-normal Normal priority issues and PRs status-ready-for-review This PR is ready to be reviewed by assigned reviewers type-feature Requests for new features or PRs which implement new features labels Jun 10, 2024
@michaelmckinsey1 michaelmckinsey1 self-assigned this Jun 10, 2024
@michaelmckinsey1
Copy link
Collaborator Author

Need to add unit test.

Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments/changes, but this mostly looks good.

The one thing I'd mention is that this will have a notable impact in performance in some cases. As you said in the initial PR comment, this functionality can substantially increase the number of rows in the performance dataframe. Given how slow pandas is and given that pandas uses a minimum of 2x the memory that a single DataFrame should take up, adding a substantial number of nodes will slow down all of Thicket.

Given that, my one remaining comment (which we'll need to discuss) is that it might be better to keep this off by default. Right now in this PR, users have to opt out of this feature, which could destroy Thicket's performance if used in the wrong situation. I'd argue that it's better for users to opt into this feature in situations where they know the performance won't be destroyed. Addressed in the discussion below. This is the default behavior of Thicket right now, so this PR shouldn't change that. However, IMO, this will cause issues in the future.

thicket/ensemble.py Show resolved Hide resolved
thicket/thicket.py Outdated Show resolved Hide resolved
unique_perf_indices = set(new_thicket.dataframe.index.droplevel("node"))
full_idx = all([idx in unique_perf_indices for idx in new_thicket.profile])
if not full_idx:
new_thicket = new_thicket.squash()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is trying to do. From what I can tell, this is trying to see if the profile list is a subset of the unique profile IDs in the performance dataframe index. If that's true, then the dataframe can store profile IDs that aren't in the Thicket.profile field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is detecting if fill_perfdata is off. To make it more clear I thought of a couple options:

  1. I could make it a helper function with a docstring to make it more clear.
  2. Have a boolean thicket attribute for fill_perfdata that can be checked here.

The purpose of this is that there is a case when the index is not full, and a filter is performed on the metadata that nodes disappear from the performance data, if the metadata for those profiles was filtered out. The code above will update the graph using squash if this happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I'm understanding right, what you're trying to do is 1) check if any nodes were removed from the performance data as a side effect of filtering profiles from the metadata and 2) fix the graph with squash if any nodes were removed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what I wrote above is correct, then this code will not do what you're wanting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code (assuming the set(Series) thing works) only checks if Thicket.profiles is a subset of the profile IDs in the performance dataframe index. That tells you nothing about the nodes.

Copy link
Collaborator Author

@michaelmckinsey1 michaelmckinsey1 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tells you if there are profiles in the performance dataframe that do not exist in tk.profile. For example if I simulate removing a profile from the dataframe, it will be false which will trigger a squash:

image

The actual issue I noticed with this is that they will both always be in sync due to _sync_profile_components. So what I should check here instead is if len(tk.graph) != len(tk.dataframe.index.get_level_values("node").unique()) which is more explicit anyway. These sort of things will get caught when I add unit tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code is not checking if there are profiles in the dataframe that don't exist in tk.profile. It's checking if there are profiles in tk.profile that don't exist in the dataframe. It's a subtle difference, but it would allow e.g., the dataframe to contain extra profiles that don't exist in tk.profiles.

I actually just encountered this same type of thing in Befikir's query_stats PR (#157), and I realized there's a better way to do this check:

sorted(tk.profile) == sorted(unique_perf_indices.tolist())

This will check that both the dataframe and tk.profile agree in terms of contents and length. Node objects are comparable via their _hatchet_nid fields. As a result, the Node objects within different parts of the same Thicket object can be compared in this way. This comparison will not work across Thicket objects though because there's no guarantee that equivalent Node objects will have the same _hatchet_nid values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's checking if there are profiles in tk.profile that don't exist in the dataframe.

But that is what I was trying to accomplish with that code. That would indicate that there are missing profiles in the performance data, which is what I was saying. However, it doesn't address

it would allow e.g., the dataframe to contain extra profiles that don't exist in tk.profiles.

which your correct would be an issue, if not for _sync_profile_components being called earlier. Regardless, what I was actually trying to do is check nodes so I'm using a different check now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, if we are trying to update the state of the profiles in a Thicket we should be using _sync_profile_components. And if we are trying to validate properties about the state of a Thicket we should use validate_profile or validate_dataframe in utils.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're 100% positive that you've ensured the graph and dataframe contain the same node objects, then this is fine.

Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally forget to set "Request changes" on my main review. Refer to the main review for comments.

@michaelmckinsey1
Copy link
Collaborator Author

I have a few comments/changes, but this mostly looks good.

The one thing I'd mention is that this will have a notable impact in performance in some cases. As you said in the initial PR comment, this functionality can substantially increase the number of rows in the performance dataframe. Given how slow pandas is and given that pandas uses a minimum of 2x the memory that a single DataFrame should take up, adding a substantial number of nodes will slow down all of Thicket.

Given that, my one remaining comment (which we'll need to discuss) is that it might be better to keep this off by default. Right now in this PR, users have to opt out of this feature, which could destroy Thicket's performance if used in the wrong situation. I'd argue that it's better for users to opt into this feature in situations where they know the performance won't be destroyed.

It's on by default because the default behavior in Thicket has been to fill the index in the performance table. Actually, the default behavior originally was to not fill the index, then there was a design decision made at some point to fill it all the time. My goal with this PR is to at least give the user the option, based on their data.

@ilumsden
Copy link
Collaborator

Wait, really? In that case, what you have in this PR is fine, but I will second guess that decision. I suspect we haven't run into any major performance issues because we've just used effectively N copies of the same program. But, once we run into less homogeneous data (e.g., MPMD-style data like what my benchmark will create), we will likely start running into major performance issues with this decision.

Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, so long as we are positive that the Node objects are the same across thicket components.

@michaelmckinsey1
Copy link
Collaborator Author

michaelmckinsey1 commented Jun 18, 2024

Looks good, so long as we are positive that the Node objects are the same across thicket components.

They should be unless there is a bug. utils.validate_nodes is a function that checks this particular property. If we wanted to be 100% sure, I think we would have to check properties we want to enforce about a Thicket before and after operations like filter_metadata, which could be expensive. I am open to this idea though, as it would help catch bugs.

Edit: I just realized we had a basically identical discussion in #173 (comment) so I am referencing it here. I agree that this is a broader issue than both of these PRs.

@pearce8 pearce8 merged commit 73de0ce into LLNL:develop Jun 20, 2024
4 checks passed
@slabasan slabasan added this to the 2024.2.0 milestone Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-thicket Issues and PRs involving Thicket's core Thicket datastructure and associated classes priority-normal Normal priority issues and PRs status-ready-for-review This PR is ready to be reviewed by assigned reviewers type-feature Requests for new features or PRs which implement new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants