Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup dataframes #1360

Merged

Conversation

nkanazawa1989
Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 commented Jan 18, 2024

Summary

This PR updates the implementation of ScatterTable and AnalysisResultTable based on the comment from @itoko .

Details and comments

Current pattern heavily uses inheritance; Table(DataFrame, MixIn), but this causes several problems. Qiskit Experiments class directly depends on the third party library, resulting in Sphinx directive mismatch and poor robustness of the API. Instead of using inheritance, these classes are refactored with composition and delegation, namely

class Table:
    def __init__(self):
        self._data = DataFrame(...)

this pattern is also common in other software libraries using dataframe. Since this PR removes unreleased public classes, this should be merged before the release.

Although this updates many files, these are just delegation of data handling logic to the class itself, which simplifies the implantation of classes that operate the container objects. Also new pattern allows more strict dtype management with dataframe.

@nkanazawa1989 nkanazawa1989 added this to the Release 0.6 milestone Jan 18, 2024
@wshanks
Copy link
Collaborator

wshanks commented Jan 22, 2024

This is good timing because the ScatterTable inheritance model just broke with the release of pandas 2.2. See for example the test failures on #1363 here.

@nkanazawa1989 nkanazawa1989 marked this pull request as ready for review January 31, 2024 18:22
@@ -181,6 +181,7 @@ def __eq__(self, other):
self.neg_coef_o1 == other.neg_coef_o1,
self.neg_coef_o2 == other.neg_coef_o2,
self.neg_coef_o3 == other.neg_coef_o3,
self.offset == other.offset,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a followup of #1236 . PR can be separated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the nice things about dataclasses 🙂

coruscating added a commit to coruscating/qiskit-experiments that referenced this pull request Feb 1, 2024
Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_save is not working right now. I get errors like {"errors":["experiment_uuid is required","type must be a string","chisq is not a decimal"]} after this:

expdata=exp.run(backend)
expdata.auto_save=True

Then if I run save() again on its own, everything seems to save correctly.

return instance

@property
def dataframe(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add __repr__() so that the dataframe is shown when users try to print the table directly? I think that's a bit more intuitive than accessing table.dataframe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern for returning the html representation is that a user might believe the object is a bare pandas dataframe and blame missing methods that pandas provides. Of course html representation is very helpful, and I'm not against adding it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, maybe there can be a header before the dataframe content to distinguish the table from a dataframe.

@nkanazawa1989
Copy link
Collaborator Author

nkanazawa1989 commented Feb 2, 2024

Thanks @coruscating for catching the bug.

The experiment ID is not recorded in the analysis result table (because the experiment name and ID are sort of duplication and only human-friendly name is stored) and the service AnalysisResult object created from the table requires the ID to look for where to save the data. Likely this was introduced in the previous PR. Fixed in 901bb28

This was not correct. I just forgotten to specify columns="all" when I filter the analysis result table. By default it returns a subset of columns that doesn't contain the experiment_id. Correct fix is 8dc6c4f (force-pushed)

Copy link
Contributor

@itoko itoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkanazawa1989 , thank you for your hard work on this cleanup. I've mainly reviewed public interface of ScatterTable class newly introduced during this series of refactoring. Overall, looks good to me in that point.

Just one major comment from me. To tell the following points clearly to users, I think we need more introductory explanation is necessary at the beginning of ScatterTable class (and/or in docs/tutorials/curve_analysis.rst). For example, we may need to explain the important tuple (kind, category, analysis) as a unique key; We may need to mention to what values are allowed for category and what each of them represents; so on.

I understand ScatterTable object now to be a single source of the truth as the data used for all plots. I think that's good for users who rework with some complex Analysis classes. Instead, it may contain multiple data sets. That requires users to fully understand the three concepts, kind, category and analysis, because this triplet must be a unique key to retrieve one sequence of data points -- (x, y, y_err) arrays -- from multiple data sets. I think that's why all of the versatile filter method and get_? methods have those in their options. I personally like get_? methods since most of experiments would be simple and ScatterTable object is likely to have a single key (kind, category and analysis) for all rows. They would prevent from calling filter functions everywhere.

I'll left minor comments inline, each of which should not prevent merging this PR.

qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved
qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved
@nkanazawa1989
Copy link
Collaborator Author

Thanks @itoko , I added more documentation in 346d23a. I added the documentation directly to the class doc of the ScatterTable for visibility and added the cross-ref link to the tutorial.

most of experiments would be simple and ScatterTable object is likely to have a single key (kind, category and analysis) for all rows

This is not true because even the model and analysis are trivial most of analysis subclasses produce two categories of raw and formatted, e.g. in RB, raw corresponds to P1 of each randomization seed and formatted is P1 of the sample average of them. I hesitate to add dedicated methods like raw_xvals and formatted_xvals for shortcut because naming rule depends on analysis; e.g. StarkRamseyXYAmpScanAnalysis has four columns "raw", "ramsey_xy", "freq", "fitted".

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nkanazawa1989, I like the lazy add rows approach and the overall the code looks fine to me. Since it's difficult to test the ScatterTable output from experiments in this PR, perhaps we can merge this first and handle any potential followups from adding unit tests in #1342 in another PR, if needed.

qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through all the changes and left comments. There are many but I think they are easy to address and most are just related to documentation.

The thing I am most uncertain about with ScatterTable is how much it is a user facing class versus an internal class. For a user just running experiments, I think it is not exposed, but for a user writing CurveAnalysis subclasses it is? So it only makes sense to document it in the context of curve analysis? There is a little documentation now but not too much (the table is shown in the the curve analysis workflow section and the columns are described in the ScatterTable doc string). One concern I have to make the roles of name, data_uid, category, and model (not a column but a term used in the same context) clear.

docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved
if key in self._data.index:
return [key]
# This key is name of entry
loc = self._data["name"] == key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the key can match an ID or a name. That feels a little overloaded though they shouldn't collide.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unfortunately this is existing behavior :(

yerr_arr_fit = unp.std_devs(uval_arr_fit)
else:
yerr_arr_fit = np.zeros_like(xval_arr_fit)
for xval, yval, yerr in zip(xval_arr_fit, yval_arr_fit, yerr_arr_fit):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still surprises me that it is better to iterate over numpy arrays point by point and add them to them to lists to add to a new dataframe rather than just adding the numpy arrays to a new dataframe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling of the empty column is expensive because it requires careful handling of missing values. Without doing this shots column may be accidentally typecasted to float because numpy doesn't support nullable integer. This means we first need to create a 2D object-dtype ndarray and populate values, then convert it into dataframe. Since current _lazy_add_rows buffer assumes row-wise data list, arrays needs to be converted into this form internally.

qiskit_experiments/framework/experiment_data.py Outdated Show resolved Hide resolved
def x(self) -> np.ndarray:
"""X values."""
return self.xval.to_numpy()
# For backward compatibility with CurveData.x
return self._data.xval.to_numpy(dtype=float, na_value=np.nan)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pretty efficient? We might look at if it is worth caching these properties where it would be tempting to use them multiple times in the same function.

Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC using the lru_cache for the class method may cause memory leak and I guess I need to implement custom logic (i.e. cache that clears the data when _lazy_add_rows is changed).

According to the performance analysis of T1Analysis, the most time-consuming logic after 03aac67 is still iter_groups, and the most expensive part is the data extraction from the dataframe. This means implementation of the cache doesn't drastically improve the performance.
image

Probably we may want to replace pandas with polars in the next release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to comment on iter_groups that we should look at using polars there. Even just casting to polars for that one method and back to pandas might help (I mean, not refactoring the rest of the code to use polars).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to stop over-engineering. We currently allow users to run experiment on existing experiment data to get more statistically confident result. But I don't think this is heavy used feature and we can remove. In principle we run very expensive operation for the sake of something useless; if we have 100 xvals, we need to iterate over them and create new table with size=1 to average... Even if we switch to polars there still be overhead of object creation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that that feature has never seemed that useful to me. I think we could implement some support for post-processing instead, like a combine_results function or something that could take multiple experiment results, warn if the options are not all the same, and then run the analysis with the combined job data.

We didn't have this ability to extend the data of a completed experient in the code we used before qiskit-experiments and I think it was a feature request of something to be included when starting over from scratch. I like the experiment being less mutable and extending the experiment being like "copy-on-write" where a new experiment ID is generated for the combination of multiple experiments if you need that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think polars could possibly help here though because I expect its groupby to be more efficient than iterating through values in Python. With the pyarrow backend (I think not yet the default in pandas), it should be possible for pandas and polars to work on the same in-memory representation of the data so it should be efficient to go back and forth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestion. IIRC when we started this project we had QV experiment in our mind. Namely we need to repeat until standard error gets converged within 1 sigma. The rest of experiments don't require such mechanism. Combine results is something useful :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this feature was already removed by #463. This means we can now just drop this formatter and raise a warning if we encounter duplicated xvals -- bur likely for next release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I opened #1394 to try to capture what we said here but maybe its description is out of date.

@wshanks
Copy link
Collaborator

wshanks commented Feb 5, 2024

Also, I read all the code changes but one thing that is hard with a PR like this is to review all the code that didn't change. I tried checking the documentation and release notes and I think they still make sense with the changes that were made.

nkanazawa1989 and others added 2 commits February 6, 2024 10:23
Co-authored-by: Will Shanks <wshaos@posteo.net>
and data points can also take metadata triplet (`data_uid`, `category`, `analysis`)
to distinguish the subset.

* The `data_uid` is an integer key representing the class of the data.
Copy link
Contributor

@itoko itoko Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this info in the doc, @nkanazawa1989 . It helps me a lot to understand the concepts to know as a user of ScatterTable class. I've now understood data_uid identifies a "series" of data to fit a model. And name is another human-friendly "identifier" that indicates a series of data to fit a model as well. I've also found that the term series has already been used in the Curve Analysis tutorial. Given that, series_id and series_name sound better to me. What do you think? @nkanazawa1989 @wshanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

series_index instead of series_id? Looking at the example in the tutorial, data_uid can take a common value for different category. In that sense, it's not "ID" (to me, ID sounds like a solo unique identifier), it's more like "index" defined for each (category, analysis) data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion. Perhaps series_idx to be shorter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case to consider regarding the naming is StarkRamseyXYAmpScanAnalysis. For that experiment, X and Y vs amp data are recorded. Then the X and Y data are converted into phase vs amp data which is fit. So there are two raw series that become the formatted series that is fit. Maybe series_ still makes sense since one series is passed to the model. It is the fit models though that have a one to one mapping the data_uid values.

Copy link
Contributor

@itoko itoko Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, I'm converging to Will's comment:

One concern I have to make the roles of name, data_uid, category, and model (not a column but a term used in the same context) clear.

Maybe, all we need is model_ with fixed categories (or "stage"s?) (as for StarkRamseyXYAmpScanAnalysis, we could distinguish "ramsey_xy" and "freq" data sets using model_name instead of category).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot define fit model for ramsey_xy because its P1 envelope is unpredictable. In this sense, "ramsey_xy" is another the intermediate data to produce the frequency curves which have fit models (Ramsey data is used for visualization and we cannot discard -- when fit doesn't go well experimentalists can get some insights from the raw Ramsey signal). This is why I cannot introduce fixed categories -- this requires the Ramsey analysis to define ScatterTable subclass with its own categories.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways I also like series. This was originally used in the curve analysis to indicate the data belongs to a particular fit model.

Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ac5bdd8. Thanks for the suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the broadest way to think about the columns are that series labels a single curve of some kind (ScatterTable is under qiskit_experiments.curve_analysis so it is safe to assume all the data belong to curves). category is a label for a group of curves/series that all belong to the same stage of processing (raw/formatted/fitted). analysis is a namespace for separating these groups of curves from different subexperiments.

@coruscating Is it important to get the release out today? I think I would like to clarify these terms a bit in the docs, but that could be done in a follow up if it is important to release today. I don't have any objections to the code (as long as further discussion doesn't make us want to change any names again).

docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Show resolved Hide resolved
Copy link
Collaborator

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. I will follow up about the documentation for ScatterTable.

@nkanazawa1989 nkanazawa1989 added this pull request to the merge queue Feb 6, 2024
Merged via the queue into qiskit-community:main with commit 36c06ea Feb 6, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants