Converting a cube to a Pandas dataframe #4526

kaedonkers · 2022-01-25T11:32:48Z

✨ Feature Request

Return a "long" DataFrame which retains all the metadata of the cube (either by default or using kwarg table="long").

Motivation

Currently iris.pandas.as_data_frame turns a 2D cube (a specification not present in the documentation) into a "pivot table"/"wide table" DataFrame with one of the former dim coord values as the column names and the other dim coord values as the indices. The values of the DataArray at the centre of the cube become the table values.

I would argue that this result is unexpected for Pandas users, not particularly useful and loses lots of metadata in the process.

Feature list

To keep track of ideas as I find them:

Specify in the documentation that, currently, only a 2D cube can be converted to a DataFrame
Remove dependence on cube.coord(dimensions=[0/1]) as this throws unnecessary errors related to the presence of AuxCoords

Proposed change

A better default behaviour would be to generate a "long" table in which all the coord values from the cube are in separate columns of the DataFrame, with the coord name as the name of the column. The DataArray values would be in another column named after the cube. Attributes could also be included as their own columns for maximum metadata retention, although this might want a toggle kwarg as it could clutter the resulting DataFrame.

This would also allow the conversion to handle more than 2D data (which should really be added to the current documentation as a requirement).

For example:

Current situation

cube
>>> precipitation_amount / (kg m-2 day-1) (ensemble_member: 12; time: 7200)
        Dimension coordinates:
            ensemble_member                               x         -
            time                                          -         x
        Cell methods:
            mean                          time
        Attributes:
            Conventions                   CF-1.7

iris.pandas.as_data_frame(cube)
>>>     1980-12-01 12:00:00  1980-12-02 12:00:00  1980-12-03 12:00:00  ...  2000-11-30 12:00:00
    1              1.707561             0.000042             0.003254  ...             0.000034
    4              0.000037             0.001898             0.002343  ...             0.052522
    5              2.079478             0.080967             0.156937  ...             0.051903
    6             16.845169             3.951329             0.034234  ...             0.000010
    7              1.636697             0.000034             0.000013  ...             0.000002
    8              4.940392            15.944072             7.649919  ...             0.001480
    9              0.000037             0.458525             0.122213  ...             1.298610
    10            11.951665             0.038251             3.440227  ...             0.000024
    11             0.001125             0.000270             0.000000  ...             0.007789
    12             0.590116             3.575591            10.533336  ...             0.000010
    13             5.615687             0.000008             0.014623  ...             1.066165
    15             6.515756            17.816267             0.077115  ...             0.527009
    [12 rows x 7200 columns]

Proposed change

cube
>>> precipitation_amount / (kg m-2 day-1) (ensemble_member: 12; time: 7200)
        Dimension coordinates:
            ensemble_member                               x         -
            time                                          -         x
        Cell methods:
            mean                          time
        Attributes:
            Conventions                   CF-1.7

iris.pandas.as_data_frame(cube)
>>>        ensemble_member                 time     precipitation_amount
    0                    1  1980-12-01 12:00:00                 1.707561
    1                    4  1980-12-01 12:00:00                 0.000037
    2                    5  1980-12-01 12:00:00                 2.079478
    3                    6  1980-12-01 12:00:00                16.845169
    4                    7  1980-12-01 12:00:00                 1.636697
    ...                ...                  ...                      ...
    86395               10  2000-11-30 12:00:00                 0.047245
    86396               11  2000-11-30 12:00:00                 0.000072
    86397               12  2000-11-30 12:00:00                 0.028169
    86398               13  2000-11-30 12:00:00                 3.638094
    86399               15  2000-11-30 12:00:00                 0.408061
    
    [86400 rows x 3 columns]

The text was updated successfully, but these errors were encountered:

trexfeathers · 2022-01-26T10:35:00Z

I would argue that this result is unexpected for Pandas users

Thanks for the insight @kaedonkers! There isn't much Pandas knowledge within team Iris, so if this is something you think would be valuable: do you think you or someone else knowledgeable could put up a PR?

kaedonkers · 2022-01-26T11:14:04Z

Thanks @trexfeathers - Yes it is something I think I can put up a PR for, but am a bit short on time at the moment. Once I've finished the project I'm currently on, I'll take a stab at it.

kaedonkers · 2022-01-26T11:15:18Z

If you know of any pandas users in the iris community who would want in on this then please let them know! I am no pandas power user but others may be

hsteptoe · 2022-03-29T14:19:06Z

@kaedonkers I've started looking at this... #4669 is a start on being more agnostic about the number of cube dimensions... ~~but I'm not doing very well at setting up a test version of iris to properly test my changes~~

hsteptoe · 2022-08-03T10:39:38Z

Good further discussion with @trexfeathers today. A rough summary of our discussion:

We'll keep the Cube -> DataFrame (Improve iris.pandas cube -> data.frame #4669) and DataFrame -> Cube (Improvements to Pandas-to-Iris bridge #4890) as separate PRs
We don't think the copy=True argument as it currently exists will port to our updates. Some work required to investigate the use of numpy views and or use of dask DataFrames to try and maintain the memory efficient principal though...
We agreed that Cube -> Series is no longer needed, and we would provide a doc example of Cube -> DataFrame -> Series to help users maintain this workflow
These updates may introduce breaking changes more suitable for a 4.0 Iris release, but we could get them into a 3.n release using iris.Future-type construct
More work required to work with auxiliary coordinates and global attributes
We should develop a shared test case to verify we can do Cube -> DataFrame -> Cube without loss of fidelity
Our preference is to provide a basic conversion of Cube -> DataFrame, and let users make further DataFrame transforms (pivots etc.) in Pandas, rather than overengineering the Iris function

trexfeathers · 2022-08-04T14:05:50Z

We should develop a shared test case to verify we can do Cube -> DataFrame -> Cube without loss of fidelity

In line with general Iris testing philosophy, the majority of the testing should be 'atomic' unit tests that have maximum focus and minimal run time*. The round-trip tests you describe make perfect integration tests, since doing a full object equality has a wider focus and longer run time vs unit tests. We could have one going Cube-DataFrame-Cube, and another doing the inverse.

(* within reason - no need to get wrapped up with mocking)

The current testing module was written before we started the unit/integration distinction:

We now place tests in either iris/tests/unit or iris/tests/integration, so that will need sorting out. I suggest we both add our new unit tests to the existing module before we discuss relocation. Easier for the new integration tests.
Thankfully the existing tests for Iris-to-Pandas are already a reasonable model to copy for your unit tests.
(I'm not so lucky other direction that I am working on!)

trexfeathers · 2022-08-05T14:22:36Z

We don't think the copy=True argument as it currently exists will port to our updates. Some work required to investigate the use of numpy views ...

Some new thoughts on this matter: #4669 (review)

trexfeathers · 2022-08-05T14:50:46Z

We don't think the copy=True argument as it currently exists will port to our updates. Some work required to investigate the use of ... dask DataFrames

Dask DataFrame doesn't yet support multiindexing (dask/dask#1493), so that's probably a non-starter for both of us?

Seems there is plenty of Pandas functionality that doesn't work with a Dask DataFrame; I had originally imagined it to be fairly interchangeable with a Pandas DataFrame (as is the case with NumPy and Dask Array), but since that's not the case I'd say it's defensible to include a docstring note saying Dask DataFrame isn't supported (maybe a check in the code too?).

kaedonkers · 2022-10-07T13:36:25Z

@trexfeathers @hsteptoe Thank you both for following this up! Progress looks good, do you need any input from me? An example perhaps?

Something I forgot to include in the original feature request is that Xarray and Pandas have this functionality in the form of xarray.DataSet.to_pandas() and pandas.DataFrame.to_xarray(), in case there is any prior-art that is useful there.

trexfeathers · 2022-10-19T10:58:04Z

Thanks @kaedonkers 😊

@hsteptoe has included a slew of useful examples in #4669, so you can rest easy. I'm just trying to find time to review it as soon as I can!

trexfeathers · 2022-11-17T13:15:23Z

Closed by #5074

kaedonkers added the New: Feature label Jan 25, 2022

trexfeathers removed the New: Feature label Jan 26, 2022

hsteptoe mentioned this issue Mar 29, 2022

Improve iris.pandas cube -> data.frame #4669

Merged

9 tasks

stephenworsley assigned trexfeathers Oct 19, 2022

trexfeathers closed this as completed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting a cube to a Pandas dataframe #4526

Converting a cube to a Pandas dataframe #4526

kaedonkers commented Jan 25, 2022 •

edited

trexfeathers commented Jan 26, 2022

kaedonkers commented Jan 26, 2022

kaedonkers commented Jan 26, 2022

hsteptoe commented Mar 29, 2022 •

edited

hsteptoe commented Aug 3, 2022 •

edited

trexfeathers commented Aug 4, 2022 •

edited

trexfeathers commented Aug 5, 2022 •

edited

trexfeathers commented Aug 5, 2022

kaedonkers commented Oct 7, 2022

trexfeathers commented Oct 19, 2022

trexfeathers commented Nov 17, 2022

Converting a cube to a Pandas dataframe #4526

Converting a cube to a Pandas dataframe #4526

Comments

kaedonkers commented Jan 25, 2022 • edited

✨ Feature Request

Motivation

Feature list

Proposed change

trexfeathers commented Jan 26, 2022

kaedonkers commented Jan 26, 2022

kaedonkers commented Jan 26, 2022

hsteptoe commented Mar 29, 2022 • edited

hsteptoe commented Aug 3, 2022 • edited

trexfeathers commented Aug 4, 2022 • edited

trexfeathers commented Aug 5, 2022 • edited

trexfeathers commented Aug 5, 2022

kaedonkers commented Oct 7, 2022

trexfeathers commented Oct 19, 2022

trexfeathers commented Nov 17, 2022

kaedonkers commented Jan 25, 2022 •

edited

hsteptoe commented Mar 29, 2022 •

edited

hsteptoe commented Aug 3, 2022 •

edited

trexfeathers commented Aug 4, 2022 •

edited

trexfeathers commented Aug 5, 2022 •

edited