Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop the next version of the LightCurve API #1520

Closed
Cadair opened this issue Aug 3, 2015 · 21 comments
Closed

Develop the next version of the LightCurve API #1520

Cadair opened this issue Aug 3, 2015 · 21 comments

Comments

@Cadair
Copy link
Member

Cadair commented Aug 3, 2015

This issue began as a debate as to whether or not to switch the LightCurve object from pandas dataframes to astropy tables. Triggered by work going on in astropy/astropy#3915 to enable the use of astropy table for time series (more generally allowing a unique index column).
However, in order to make that decision, it was decided that we should define the functionality we need/want the LightCurve API to have. A list of required/desired functionality has been started:

Required/Desired functionality of LightCurve API:

LightCurve Truncation

  • getting shorter duration lightcurves from an existing lightcurve
  • use case: getting GOES data from around the peak of a flare only

Merging LightCurves

  • taking two or more lightcurves and merging them together to form one lightcurve
  • use case: looking at EVE summary data over the course of a month for irradiance studies

Subsampling

  • taking specific elements of an existing lightcurve (for example every n'th element), and creating a new lightcurve
  • use case: want to match the sampling rate of one lightcurve with another

Resampling

  • taking an existing lightcurve and resample it at different times to get a new lightcurve
  • use case: the original lightcurve is unevenly sampled, and is easier to handle when evenly sampled
    Summing
  • summing every 'n' elements of the original time-series
  • use case: increase the signal to noise ratio of the data

Sorting

  • sort the data so time is monotonically increasing
  • use case: can be hard to make sense of a lightcurve otherwise.

Unit aware

  • Being able to assign a physical unit to a column of the LightCurve data
  • use case: Is your data in keV or angstroms.

Unit manipulation

  • Convert between units or calculating new unit when two LightCurve columns are multiplied, etc.
  • use case: Convert keV -> angstrom or W/m^2 * s * m^2 = J, e.g. when calculating total radiated energy during a flare.

Pandas vs. Astropy Table: Open Questions

  • Is manipulating astropy tables significantly slower than pandas dataframes?

Miscellaneous

  • If SunPy is going to use astropy time, astropy tables in the LightCurve would make maintaining the LightCurve easier.
  • Pandas will always have more development resources than astropy and has become a standard dependence of scientific python programming.
@Cadair
Copy link
Member Author

Cadair commented Aug 3, 2015

ping @ehsteve @wafels

@ehsteve
Copy link
Member

ehsteve commented Aug 3, 2015

Disagree! Pandas is now standard in the scientific python stack so it is as much of a "dependency" as numpy, scipy, or matplotlib. Also it provides much more functionality than astropy tables do (and will continue to get more capabilities over time).

I am strongly against this.

@ehsteve
Copy link
Member

ehsteve commented Aug 3, 2015

I agree that it would be great for Pandas to support units. Perhaps we can hack that in somehow.

@Cadair
Copy link
Member Author

Cadair commented Aug 3, 2015

@ehsteve We would loose no functionality by doing this, as we would still provide access to the dataframe. This is about making it easier for us to maintain the underlying code.

The dependency point is the least important of the ones I made above, although it would still make certain things easier.

I do not think that "hacking" units onto pandas is ever going to be a good idea, as it will cause us much more work, rather than less, which is my main motivator. Also if we are going to move to astropy time in all other places in the code base, having this as astropy table will make it much easier to work with this.

I think we need to think about the functionality of pandas that we wrap, i.e. not enable users to use by LightCurve.data which is functionality that will still be available. (through something LIKE LightCurve.dataframe. In the current implementation of LightCurve and my current understanding of future factory changes we need the following core features in the underlying data type:

  • Unique time series index.
  • Truncation of series by time range.
  • Joining of multiple series together.

these three core things that we need (and we don't even use the last two in the current implementation) can all be provided by astropy.table, which would make it easier for us to integrate lightcurve into SunPy as the first class data type we all want it to be, while loosing no functionality for the end user.

@Cadair
Copy link
Member Author

Cadair commented Aug 3, 2015

To evidence my statement that it would make things easier for us if done properly, I would like to hold up @DanRyanIrish's GOES code:

See things like this:
https://github.com/sunpy/sunpy/blob/master/sunpy%2Finstr%2Fgoes.py#L164
Where he is having to require the units of the data in the lightcurve be correct, because he then converts it to a quantity later on, so that he could use units to make the rest of his code easier.

On the topic of time consider our future need for epoch conversions, and going between TAI time and UTC for JSOC, or downloading lightcurve data that has the time column in "utime" and the user wanting to then download data from JSOC based on the peak intensity in a LightCurve, in our current model that would involve doing a utime -> UTC -> TAI conversion (and loosing accuracy at each step), where as if we could use Astropy Table, with an Astropy Time index column, we could do utime -> TAI time, only when passed into the jsoc.attrs.Time constructor.

These two things are just to highlight how using more of the Astropy infrastructure should make our lives simpler.

@aringlis
Copy link
Member

aringlis commented Aug 3, 2015

@ehsteve has a good point that pandas is pretty standard and has a lot of great functionality. @Cadair to be honest I'm not sure I understand what you're proposing -we should discuss at a Dev meeting. Agree that 'hacking' pandas would be bad though.

@Cadair
Copy link
Member Author

Cadair commented Aug 3, 2015

I also agree that pandas is great, and to clarify I think that having it as a dependency is not a practical problem at all these days.

@DanRyanIrish
Copy link
Member

I was unable to make the dev meeting last night so perhaps you can tell me if a decision was made on this. I see real advantages to switching to astropy tables, as @Cadair showed with my GOES code. Although pandas has some great functionality, is there anything we are likely to want to do with LightCurve data that pandas would make far easier/better than astropy tables?

@Cadair
Copy link
Member Author

Cadair commented Aug 4, 2015

We did not discuss this at the dev meeting, we had lively debates about other things, like the nature of an example gallery.

I am with @DanRyanIrish on this, I have pointed out a couple of things that Astropy tables would improve over pandas, but I am yet to see any comments about what Pandas is bringing to the table that we need to implement the LightCurve object.

@DanRyanIrish
Copy link
Member

The only question I can think to ask is is accessing and manipulated a pandas dataframe significantly faster than for an astropy table? If people want to process a lot of data, e.g. years of GOES data or a lot of high cadence LYRA data, that might be something to consider if we want people to use the LightCurve object rather than reading FITS data straight into their own numpy arrays.

@Cadair
Copy link
Member Author

Cadair commented Aug 4, 2015

@DanRyanIrish That's an interesting question, and something that's worth investigating later on.

@wafels
Copy link
Member

wafels commented Aug 4, 2015

Probably worth while to come up with a list of operations you would want to
perform on a lightcurve in order to make it a useful object. Then we can
discuss implementation.

So, to kick things off, here are a list of operations and use cases.
Please add more as you see fit.

truncation

  • getting shorter duration lightcurves from an existing lightcurve
  • use case: getting GOES data from around the peak of a flare only

merging

  • taking two or more lightcurves and merging them together to form one
    lightcurve
  • use case: looking at EVE summary data over the course of a month for
    irradiance studies

subsampling

  • taking specific elements of an existing lightcurve (for example every
    n'th element), and creating a new lightcurve
  • use case: want to match the sampling rate of one lightcurve with another

resampling

  • taking an existing lightcurve and resample it at different times to get
    a new lightcurve
  • use case: the original lightcurve is unevenly sampled, and is easier to
    handle when evenly sampled

summing

  • summing every 'n' elements of the original time-series,
  • use case: increase the signal to noise ratio of the data

sorting

  • sort the data so time is monotonically increasing
  • use case: can be hard to make sense of a lightcurve otherwise.

On Tue, Aug 4, 2015 at 4:28 AM, Stuart Mumford notifications@github.com
wrote:

@DanRyanIrish https://github.com/DanRyanIrish That's an interesting
question, and something that's worth investigating later on.


Reply to this email directly or view it on GitHub
#1520 (comment).

@Cadair
Copy link
Member Author

Cadair commented Aug 4, 2015

@wafels thanks, that's great. Are you suggesting that we make all that functionality available through the LightCurve API?

@ehsteve
Copy link
Member

ehsteve commented Aug 4, 2015

In answer to @Cadair I would vote yes on that.

@wafels
Copy link
Member

wafels commented Aug 4, 2015

I don't know. What would be the alternative - a bunch of independent
functions that operate on the lightcurves?

lightcurve.operation.truncate(my_lc, "2010-01-01", "2010-01-02 12:34:58")

lightcurve.operation.sum(my_lc, bins=60*u.s)

lightcurve.operation.merge(my_lc1, my_lc2, overlap=np.mean)

On Tue, Aug 4, 2015 at 9:49 AM, Steven Christe notifications@github.com
wrote:

In answer to @Cadair https://github.com/Cadair I would vote yes on that.


Reply to this email directly or view it on GitHub
#1520 (comment).

@Cadair
Copy link
Member Author

Cadair commented Aug 4, 2015

I think all of that functionality is reasonable to be in the API for lightcurve. I think this is a good way to define this choice of underlying data object, in that if it turns out astropy table can not perform all the things we need it to, then it is not suitable as a replacement.

Having said that, I do not think that anything on that list given by @wafels is likely to cause a problem.

@DanRyanIrish
Copy link
Member

I would add to that list
Unit aware

  • Being able to assign a physical unit to a column of the LightCurve data
  • use case: Is your data in keV or angstroms.

Unit manipulation

  • Convert between units or calculating new unit when two LightCurve columns are multiplied, etc.
  • use case: Convert keV -> angstrom or W/m^2 * s * m^2 = J, e.g. when calculating total radiated energy during a flare.

@Cadair
Copy link
Member Author

Cadair commented Aug 4, 2015

@DanRyanIrish Those features are practically only going to be available if we use Astropy table with quantity columns. Which, of course, is my point ;)

@DanRyanIrish
Copy link
Member

@Cadair I agree. That's why they should to be in the list of features started by @wafels. These are features we really want.

@ehsteve ehsteve changed the title Replace pandas with Astropy table as the primary data representation for Lightcurve Develop the next version of the LightCurve API Aug 10, 2015
@DanRyanIrish
Copy link
Member

Here is a summary of what has been discussed on this issue so far including in SunPy dev meetings. @ehsteve can add this comment as a new summary for this issue.

This issue began as a debate as to whether or not to switch the LightCurve object from pandas dataframes to astropy tables. However, in order to make that decision, it was decided that we should define the functionality we need/want the LightCurve API to have. A list of required/desired functionality has been started:

Required/Desired functionality of LightCurve API
LightCurve Truncation
-- getting shorter duration lightcurves from an existing lightcurve
-- use case: getting GOES data from around the peak of a flare only
Merging LightCurves
-- taking two or more lightcurves and merging them together to form one lightcurve
-- use case: looking at EVE summary data over the course of a month for irradiance studies
Subsampling
-- taking specific elements of an existing lightcurve (for example every n'th element), and creating a new lightcurve
-- use case: want to match the sampling rate of one lightcurve with another
Resampling
-- taking an existing lightcurve and resample it at different times to get a new lightcurve
-- use case: the original lightcurve is unevenly sampled, and is easier to handle when evenly sampled Summing
-- summing every 'n' elements of the original time-series
-- use case: increase the signal to noise ratio of the data
Sorting
-- sort the data so time is monotonically increasing
-- use case: can be hard to make sense of a lightcurve otherwise.
Unit aware
-- Being able to assign a physical unit to a column of the LightCurve data
-- use case: Is your data in keV or angstroms.
Unit manipulation
-- Convert between units or calculating new unit when two LightCurve columns are multiplied, etc.
-- use case: Convert keV -> angstrom or W/m^2 * s * m^2 = J, e.g. when calculating total radiated energy during a flare.

Pandas vs. Astropy Table: Open Questions
-- Is manipulating astropy tables significantly slower than pandas dataframes?

Miscellaneous
-- If SunPy is going to use astropy time, astropy tables in the LightCurve would make maintaining the LightCurve easier.
-- Pandas will always have more development resources than astropy and has become a standard dependence of scientific python programming.

Fell free to add to this list or answer questions raised. And also to add anything else to this summary you feel I've missed. Once we've decided the functionality of the API and answered the open questions we have, we can decide whether or not to switch LightCurve to depend on Astropy tables.

@Alex-Ian-Hamilton
Copy link
Contributor

Alex-Ian-Hamilton commented May 13, 2016

At the most recent SunPy dev meeting the consensus was that the Lightcurve Refactor should be done using Pandas but we should add support for AstroPy Units, but we might migrate onto an AstroPy QTable based TimeSeries class in future, when one becomes available.

Part of the rational for this was that there's a limitation of the AstroPy Time class being immutable and without an insert method, therefore you can't rearrange or add rows to a QTable with Mixing column for Time without manually recreating the entire table.
I have raised this with AstroPy (here) and am looking at making a solution for this, but I'm not sure when this will be done.

@Cadair Cadair closed this as completed Apr 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants