Develop the next version of the LightCurve API #1520

Cadair · 2015-08-03T14:40:45Z

This issue began as a debate as to whether or not to switch the LightCurve object from pandas dataframes to astropy tables. Triggered by work going on in astropy/astropy#3915 to enable the use of astropy table for time series (more generally allowing a unique index column).
However, in order to make that decision, it was decided that we should define the functionality we need/want the LightCurve API to have. A list of required/desired functionality has been started:

Required/Desired functionality of LightCurve API:

LightCurve Truncation

getting shorter duration lightcurves from an existing lightcurve
use case: getting GOES data from around the peak of a flare only

Merging LightCurves

taking two or more lightcurves and merging them together to form one lightcurve
use case: looking at EVE summary data over the course of a month for irradiance studies

Subsampling

taking specific elements of an existing lightcurve (for example every n'th element), and creating a new lightcurve
use case: want to match the sampling rate of one lightcurve with another

Resampling

taking an existing lightcurve and resample it at different times to get a new lightcurve
use case: the original lightcurve is unevenly sampled, and is easier to handle when evenly sampled
Summing
summing every 'n' elements of the original time-series
use case: increase the signal to noise ratio of the data

Sorting

sort the data so time is monotonically increasing
use case: can be hard to make sense of a lightcurve otherwise.

Unit aware

Being able to assign a physical unit to a column of the LightCurve data
use case: Is your data in keV or angstroms.

Unit manipulation

Convert between units or calculating new unit when two LightCurve columns are multiplied, etc.
use case: Convert keV -> angstrom or W/m^2 * s * m^2 = J, e.g. when calculating total radiated energy during a flare.

Pandas vs. Astropy Table: Open Questions

Is manipulating astropy tables significantly slower than pandas dataframes?

Miscellaneous

If SunPy is going to use astropy time, astropy tables in the LightCurve would make maintaining the LightCurve easier.
Pandas will always have more development resources than astropy and has become a standard dependence of scientific python programming.

Cadair · 2015-08-03T14:41:00Z

ping @ehsteve @wafels

ehsteve · 2015-08-03T15:05:21Z

Disagree! Pandas is now standard in the scientific python stack so it is as much of a "dependency" as numpy, scipy, or matplotlib. Also it provides much more functionality than astropy tables do (and will continue to get more capabilities over time).

I am strongly against this.

ehsteve · 2015-08-03T15:06:08Z

I agree that it would be great for Pandas to support units. Perhaps we can hack that in somehow.

Cadair · 2015-08-03T15:14:27Z

@ehsteve We would loose no functionality by doing this, as we would still provide access to the dataframe. This is about making it easier for us to maintain the underlying code.

The dependency point is the least important of the ones I made above, although it would still make certain things easier.

I do not think that "hacking" units onto pandas is ever going to be a good idea, as it will cause us much more work, rather than less, which is my main motivator. Also if we are going to move to astropy time in all other places in the code base, having this as astropy table will make it much easier to work with this.

I think we need to think about the functionality of pandas that we wrap, i.e. not enable users to use by LightCurve.data which is functionality that will still be available. (through something LIKE LightCurve.dataframe. In the current implementation of LightCurve and my current understanding of future factory changes we need the following core features in the underlying data type:

Unique time series index.
Truncation of series by time range.
Joining of multiple series together.

these three core things that we need (and we don't even use the last two in the current implementation) can all be provided by astropy.table, which would make it easier for us to integrate lightcurve into SunPy as the first class data type we all want it to be, while loosing no functionality for the end user.

Cadair · 2015-08-03T15:25:44Z

To evidence my statement that it would make things easier for us if done properly, I would like to hold up @DanRyanIrish's GOES code:

See things like this:
https://github.com/sunpy/sunpy/blob/master/sunpy%2Finstr%2Fgoes.py#L164
Where he is having to require the units of the data in the lightcurve be correct, because he then converts it to a quantity later on, so that he could use units to make the rest of his code easier.

On the topic of time consider our future need for epoch conversions, and going between TAI time and UTC for JSOC, or downloading lightcurve data that has the time column in "utime" and the user wanting to then download data from JSOC based on the peak intensity in a LightCurve, in our current model that would involve doing a utime -> UTC -> TAI conversion (and loosing accuracy at each step), where as if we could use Astropy Table, with an Astropy Time index column, we could do utime -> TAI time, only when passed into the jsoc.attrs.Time constructor.

These two things are just to highlight how using more of the Astropy infrastructure should make our lives simpler.

aringlis · 2015-08-03T15:50:05Z

@ehsteve has a good point that pandas is pretty standard and has a lot of great functionality. @Cadair to be honest I'm not sure I understand what you're proposing -we should discuss at a Dev meeting. Agree that 'hacking' pandas would be bad though.

Cadair · 2015-08-03T15:52:37Z

I also agree that pandas is great, and to clarify I think that having it as a dependency is not a practical problem at all these days.

DanRyanIrish · 2015-08-04T07:50:08Z

I was unable to make the dev meeting last night so perhaps you can tell me if a decision was made on this. I see real advantages to switching to astropy tables, as @Cadair showed with my GOES code. Although pandas has some great functionality, is there anything we are likely to want to do with LightCurve data that pandas would make far easier/better than astropy tables?

Cadair · 2015-08-04T08:01:41Z

We did not discuss this at the dev meeting, we had lively debates about other things, like the nature of an example gallery.

I am with @DanRyanIrish on this, I have pointed out a couple of things that Astropy tables would improve over pandas, but I am yet to see any comments about what Pandas is bringing to the table that we need to implement the LightCurve object.

DanRyanIrish · 2015-08-04T08:05:58Z

The only question I can think to ask is is accessing and manipulated a pandas dataframe significantly faster than for an astropy table? If people want to process a lot of data, e.g. years of GOES data or a lot of high cadence LYRA data, that might be something to consider if we want people to use the LightCurve object rather than reading FITS data straight into their own numpy arrays.

Cadair · 2015-08-04T08:28:36Z

@DanRyanIrish That's an interesting question, and something that's worth investigating later on.

wafels · 2015-08-04T13:46:26Z

Probably worth while to come up with a list of operations you would want to
perform on a lightcurve in order to make it a useful object. Then we can
discuss implementation.

So, to kick things off, here are a list of operations and use cases.
Please add more as you see fit.

truncation

getting shorter duration lightcurves from an existing lightcurve
use case: getting GOES data from around the peak of a flare only

merging

taking two or more lightcurves and merging them together to form one
lightcurve
use case: looking at EVE summary data over the course of a month for
irradiance studies

subsampling

taking specific elements of an existing lightcurve (for example every
n'th element), and creating a new lightcurve
use case: want to match the sampling rate of one lightcurve with another

resampling

taking an existing lightcurve and resample it at different times to get
a new lightcurve
use case: the original lightcurve is unevenly sampled, and is easier to
handle when evenly sampled

summing

summing every 'n' elements of the original time-series,
use case: increase the signal to noise ratio of the data

sorting

sort the data so time is monotonically increasing
use case: can be hard to make sense of a lightcurve otherwise.

On Tue, Aug 4, 2015 at 4:28 AM, Stuart Mumford notifications@github.com
wrote:

@DanRyanIrish https://github.com/DanRyanIrish That's an interesting
question, and something that's worth investigating later on.

—
Reply to this email directly or view it on GitHub
#1520 (comment).

Cadair · 2015-08-04T13:48:21Z

@wafels thanks, that's great. Are you suggesting that we make all that functionality available through the LightCurve API?

ehsteve · 2015-08-04T13:49:49Z

In answer to @Cadair I would vote yes on that.

wafels · 2015-08-04T14:32:58Z

I don't know. What would be the alternative - a bunch of independent
functions that operate on the lightcurves?

lightcurve.operation.truncate(my_lc, "2010-01-01", "2010-01-02 12:34:58")

lightcurve.operation.sum(my_lc, bins=60*u.s)

lightcurve.operation.merge(my_lc1, my_lc2, overlap=np.mean)

On Tue, Aug 4, 2015 at 9:49 AM, Steven Christe notifications@github.com
wrote:

In answer to @Cadair https://github.com/Cadair I would vote yes on that.

—
Reply to this email directly or view it on GitHub
#1520 (comment).

Cadair · 2015-08-04T14:36:06Z

I think all of that functionality is reasonable to be in the API for lightcurve. I think this is a good way to define this choice of underlying data object, in that if it turns out astropy table can not perform all the things we need it to, then it is not suitable as a replacement.

Having said that, I do not think that anything on that list given by @wafels is likely to cause a problem.

DanRyanIrish · 2015-08-04T14:46:14Z

I would add to that list
Unit aware

Being able to assign a physical unit to a column of the LightCurve data
use case: Is your data in keV or angstroms.

Unit manipulation

Convert between units or calculating new unit when two LightCurve columns are multiplied, etc.
use case: Convert keV -> angstrom or W/m^2 * s * m^2 = J, e.g. when calculating total radiated energy during a flare.

Cadair · 2015-08-04T15:24:17Z

@DanRyanIrish Those features are practically only going to be available if we use Astropy table with quantity columns. Which, of course, is my point ;)

DanRyanIrish · 2015-08-04T15:26:00Z

@Cadair I agree. That's why they should to be in the list of features started by @wafels. These are features we really want.

DanRyanIrish · 2015-08-10T18:03:01Z

Here is a summary of what has been discussed on this issue so far including in SunPy dev meetings. @ehsteve can add this comment as a new summary for this issue.

This issue began as a debate as to whether or not to switch the LightCurve object from pandas dataframes to astropy tables. However, in order to make that decision, it was decided that we should define the functionality we need/want the LightCurve API to have. A list of required/desired functionality has been started:

Required/Desired functionality of LightCurve API
LightCurve Truncation
-- getting shorter duration lightcurves from an existing lightcurve
-- use case: getting GOES data from around the peak of a flare only
Merging LightCurves
-- taking two or more lightcurves and merging them together to form one lightcurve
-- use case: looking at EVE summary data over the course of a month for irradiance studies
Subsampling
-- taking specific elements of an existing lightcurve (for example every n'th element), and creating a new lightcurve
-- use case: want to match the sampling rate of one lightcurve with another
Resampling
-- taking an existing lightcurve and resample it at different times to get a new lightcurve
-- use case: the original lightcurve is unevenly sampled, and is easier to handle when evenly sampled Summing
-- summing every 'n' elements of the original time-series
-- use case: increase the signal to noise ratio of the data
Sorting
-- sort the data so time is monotonically increasing
-- use case: can be hard to make sense of a lightcurve otherwise.
Unit aware
-- Being able to assign a physical unit to a column of the LightCurve data
-- use case: Is your data in keV or angstroms.
Unit manipulation
-- Convert between units or calculating new unit when two LightCurve columns are multiplied, etc.
-- use case: Convert keV -> angstrom or W/m^2 * s * m^2 = J, e.g. when calculating total radiated energy during a flare.

Pandas vs. Astropy Table: Open Questions
-- Is manipulating astropy tables significantly slower than pandas dataframes?

Miscellaneous
-- If SunPy is going to use astropy time, astropy tables in the LightCurve would make maintaining the LightCurve easier.
-- Pandas will always have more development resources than astropy and has become a standard dependence of scientific python programming.

Fell free to add to this list or answer questions raised. And also to add anything else to this summary you feel I've missed. Once we've decided the functionality of the API and answered the open questions we have, we can decide whether or not to switch LightCurve to depend on Astropy tables.

Alex-Ian-Hamilton · 2016-05-13T21:43:07Z

At the most recent SunPy dev meeting the consensus was that the Lightcurve Refactor should be done using Pandas but we should add support for AstroPy Units, but we might migrate onto an AstroPy QTable based TimeSeries class in future, when one becomes available.

Part of the rational for this was that there's a limitation of the AstroPy Time class being immutable and without an insert method, therefore you can't rearrange or add rows to a QTable with Mixing column for Time without manually recreating the entire table.
I have raised this with AstroPy (here) and am looking at making a solution for this, but I'm not sure when this will be done.

Cadair added the Lightcurve label Aug 3, 2015

ehsteve changed the title ~~Replace pandas with Astropy table as the primary data representation for Lightcurve~~ Develop the next version of the LightCurve API Aug 10, 2015

Cadair closed this as completed Apr 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop the next version of the LightCurve API #1520

Develop the next version of the LightCurve API #1520

Cadair commented Aug 3, 2015

Cadair commented Aug 3, 2015

ehsteve commented Aug 3, 2015

ehsteve commented Aug 3, 2015

Cadair commented Aug 3, 2015

Cadair commented Aug 3, 2015

aringlis commented Aug 3, 2015

Cadair commented Aug 3, 2015

DanRyanIrish commented Aug 4, 2015

Cadair commented Aug 4, 2015

DanRyanIrish commented Aug 4, 2015

Cadair commented Aug 4, 2015

wafels commented Aug 4, 2015

Cadair commented Aug 4, 2015

ehsteve commented Aug 4, 2015

wafels commented Aug 4, 2015

Cadair commented Aug 4, 2015

DanRyanIrish commented Aug 4, 2015

Cadair commented Aug 4, 2015

DanRyanIrish commented Aug 4, 2015

DanRyanIrish commented Aug 10, 2015

Alex-Ian-Hamilton commented May 13, 2016 •

edited

Develop the next version of the LightCurve API #1520

Develop the next version of the LightCurve API #1520

Comments

Cadair commented Aug 3, 2015

Required/Desired functionality of LightCurve API:

LightCurve Truncation

Merging LightCurves

Subsampling

Resampling

Sorting

Unit aware

Unit manipulation

Pandas vs. Astropy Table: Open Questions

Miscellaneous

Cadair commented Aug 3, 2015

ehsteve commented Aug 3, 2015

ehsteve commented Aug 3, 2015

Cadair commented Aug 3, 2015

Cadair commented Aug 3, 2015

aringlis commented Aug 3, 2015

Cadair commented Aug 3, 2015

DanRyanIrish commented Aug 4, 2015

Cadair commented Aug 4, 2015

DanRyanIrish commented Aug 4, 2015

Cadair commented Aug 4, 2015

wafels commented Aug 4, 2015

Cadair commented Aug 4, 2015

ehsteve commented Aug 4, 2015

wafels commented Aug 4, 2015

Cadair commented Aug 4, 2015

DanRyanIrish commented Aug 4, 2015

Cadair commented Aug 4, 2015

DanRyanIrish commented Aug 4, 2015

DanRyanIrish commented Aug 10, 2015

Alex-Ian-Hamilton commented May 13, 2016 • edited

Alex-Ian-Hamilton commented May 13, 2016 •

edited