Refactor pandas save load and convert dtypes#3412
Refactor pandas save load and convert dtypes#3412samuelgarcia merged 7 commits intoSpikeInterface:mainfrom
Conversation
zm711
left a comment
There was a problem hiding this comment.
Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?
| value = np.nan | ||
| template_metrics.at[index, metric_name] = value | ||
| return template_metrics | ||
| return template_metrics.convert_dtypes() |
There was a problem hiding this comment.
This is a weak recommendation ,but maybe we put this on its own line with a comment. Just from reading this I have no clue why we need to do this and doing this in the return line is even more confusing. So something like
# see xx
template_metrics.convert_dtypes()
return template_metrics| metrics.loc[empty_unit_ids] = np.nan | ||
|
|
||
| return metrics | ||
| return metrics.convert_dtypes() |
There was a problem hiding this comment.
same here. From the code it is not clear why we need to convert dtypes so I would refer to divide this into a convert step and then only return the converted. That way we can have a comment explaining why we need to convert.
There was a problem hiding this comment.
added comment and convert step
We don't have warnings anywhere. If a user tries to compute template or quality metrics without pandas, it will throw an interpetable |
|
One last comment: for analyzers saved to |
| warnings.warn( | ||
| "The zarr store was not properly consolidated prior to v0.101.1. " | ||
| "This may lead to unexpected behavior in loading extensions. " | ||
| "Please consider re-saving the SortingAnalyzer object." |
There was a problem hiding this comment.
with a save_as(format='zarr')? I just want to make sure the error is as clear as possible.
There was a problem hiding this comment.
Not really..since the problem is consolidation, the save as may fail to discover all the pieces of the folder. I changed it to re-generating.
Honestly, I don't think this will be an issue since it will only happen if:
- you saved to zarr between 0.101.0 and 0.101.1
- you don't have write access to the data
|
|
||
| # we use the convert_dtypes to convert the columns to the most appropriate dtype and avoid object columns | ||
| # (in case of NaN values) | ||
| template_metrics = template_metrics.convert_dtypes() |
|
Thanks Alessio! |
We found out that zarr consolidation doens't seem to play well with our way of saving/loading dataframes to zarr, using xarray.
xarraywas only used to save/load pandas dataframes to zarr, and this PR modifies that by saving each column and the index directly. This is similar to how xarray saves to zarr, so it should be backcompatible (testing now).To make sure we don't run in such problems in the future, I added a roundtrip test in the common extension tests that asserts that data reloaded is the same as the original ones.
As sugegsted by @h-mayorquin here #3365, the generated and reloaded dataframes are also converted to numpy dtypes with the
convert_dtypesfunction. We just have to make sure to call aSeries.to_numpyto cast pandas dtypes to numpy ones.