Skip to content

Latest commit

 

History

History
197 lines (158 loc) · 14.2 KB

paper.md

File metadata and controls

197 lines (158 loc) · 14.2 KB
title tags authors affiliations date bibliography
`pgmuvi`: Quick and easy Gaussian Process Regression for multi-wavelength astronomical timeseries
Python
astronomy
timeseries
Gaussian processes
name orcid corresponding affiliation
Peter Scicluna
0000-0002-1161-3756
true
1, 2
name orcid affiliation
Stefan Waterval
0000-0002-5542-8624
3, 4
name orcid affiliation
Diego A. Vasquez-Torres
0009-0008-2354-0049
5
name orcid affiliation
Sundar Srinivasan
0000-0002-2996-305X
5
name orcid affiliation
Sara Jamal
0000-0002-3929-6668
6
name index
European Southern Observatory, Alonso de Córdova 3107, Vitacura, Santiago, Chile
1
name index
Space Science Institute, 4750 Walnut Street, Suite 205, Boulder, CO 80301, USA
2
name index
New York University Abu Dhabi, PO Box 129188, Abu Dhabi, United Arab Emirates
3
name index
Center for Astro, Particle and Planetary Physics (CAP$^3$), New York University Abu Dhabi, PO Box 129188, Abu Dhabi, United Arab Emirates
4
name index
IRyA, Universidad Nacional Autónoma de México, Morelia, Michoacán, México
5
name index
Max Planck Institute for Astronomy, Königstuhl 17, 69117 Heidelberg, Germany
6
XX August 2023
paper.bib

Summary

Time-domain observations are increasingly important in astronomy, and are often the only way to study certain objects. The volume of time-series data is increasing dramatically as new surveys come online - for example, the Vera Rubin Observatory will produce 15 terabytes of data per night, and its Legacy Survey of Space and Time (LSST) is expected to produce five-year lightcurves for $>10^7$ sources, each consisting of 5 photometric bands. Historically, astronomers have worked with Fourier-based techniques such as the Lomb-Scargle periodogram or information-theoretic approaches; however, in recent years Bayesian and data-driven approaches such as Gaussian Process Regression (GPR) have gained traction. However, the computational complexity and steep learning curve of GPR has limited its adoption. pgmuvi makes GPR of multi-band timeseries accessible to astronomers by building on cutting-edge open-source machine-learning libraries, and hence pgmuvi retains the speed and flexibility of GPR while being easy to use. It provides easy access to GPU acceleration and Bayesian inference of the hyperparameters (e.g. the periods), and is able to scale to large datasets.

Statement of need

Astronomical objects are in general not static, but vary in brightness over time. This is especially true for objects that are variable by nature, such as pulsating stars, or objects that are variable due to their orbital motion, such as eclipsing binaries. The study of these objects is called time-domain astronomy, and is a rapidly growing field. A wide range of approaches to time-series analysis have been developed, ranging from simple period-finding algorithms to more complex machine learning techniques [e.g. @Donoso-Oliva2023-transformer; @supersmoother; @Huijse_2018; @Palmer_2009 and many more]. Perhaps the most popular in astronomy is the Lomb-Scargle periodogram [@Lomb1976; @Scargle1982], which is a Fourier-based technique to find periodic signals in unevenly sampled data. However, the handling of unevenly sampled data is not the only challenge in time-series analysis.

A particular challenge in astronomy is handling heterogeneous, multiwavelength data [@VanderPlas2015]. Data must often be combined from a wide variety of instruments, telescopes or surveys, and so the systematics or noise properties of different datasets vary widely. In addition, by combining multiple wavelengths, we gain a better understanding of the physical processes driving the variability of the object. For example, some variability mechanisms differ as a function of wavelength only in amplitude (e.g. eclipsing binaries), while others may vary in phase (e.g. pulsating stars) or even period (e.g. multiperiodic systems). Thus, it is important to combine data from multiple wavelengths in a way that accounts for these differences.

Gaussian processes (GPs) have recently become a popular tool to handle these challenges. GPs are a flexible way to forward-model arbitrary signals, by assuming that the signal is drawn from a multivariate Gaussian distribution. By constructing a covariance function describing the covariance between any two points in the signal, we can model the signal as a Gaussian process. By doing so, we are freed from any assumptions about sampling, and can model the signal as a continuous function. We are also able to model the noise in the data, and thus account for heteroscedastic noise. We also gain the ability to directly handle multiple wavelengths, by constructing a multi-dimensional covariance function. Hence, Gaussian Process Regression (GPR) is a machine learning technique that is able to model non-periodic signals in unevenly sampled data, and is thus well suited for the analysis of astronomical time-series data.

However, GPR is not without its challenges. The most popular covariance functions are often not tailored to modelling complex signals, which implies that users must construct their own covariance functions. GPs are also computationally expensive, and thus approximations must be used to scale to large datasets. Finally, many GP packages either have very steep learning curves or only provide limited features, and thus are not suitable for the average astronomer.

In this paper we present a new Python package, pgmuvi, to perform GPR on multi-wavelength astronomical time-series data. The package is designed to be easy to use, and provide a quick way to perform GPR on multi-wavelength data by exploiting uncommon but powerful kernels and appoximations. The package is designed to be flexible and allow the user to customize the GPR model to their needs. pgmuvi exploits multiple strategies to scale regression to large datasets, making it suitable for the current era of large-scale astronomical surveys.

A number of other software packages exist to perform GPR, some of which, such as celerite [@celerite1; @celerite2], tinygp [@tinygp] or george [@ambikasaran2015george] were developed within the astronomical community with astronomical time-series in mind. However, these each have their own limitations. celerite is extremely fast, but is limited to one-dimensional inputs, and thus cannot handle multiwavelength data, except under certain restrictive assumptions. Furthermore, since it achieves its speed through a specific form of kernel decomposition, it is not able to handle arbitrary covariance functions. It is therefore restricted to a small number of kernels with specific forms; while it is able to handle the most common astronomical timeseries by combining these kernels, not all signals can be modelled in this way. tinygp is a more general package, able to retain a high degree of flexibility while still being fast thanks to a JAX-based implementation, whic makes it feasible to implement models in tinygp that are equivalent to those in pgmuvi. However, pgmuvi is designed to provide an easier learning curve by packaging GPs with data transforms and inference routines. In essence, tinygp could in principle be used by pgmuvi as a GP backend instead of GPyTorch. For a summary of the state of the art of GPR in astronomy, see the recent review by @arev_2023_gps.

pgmuvi is used in two ongoing projects by our group: one of the authors' (DAVT) masters thesis and the paper resulting from this work deals with the analysis of multiwavelength light curves for targets from the Nearby Evolved Stars Survey (NESS; @Scicluna2022, https://evolvedstars.space). This work served as the first test of the code and has analyzed thousands of light curves at optical and infrared wavelengths for over seven hundred dusty stars within 3 kpc of the Solar Neighborhood. The paper will be published in 2023 (Vasquez-Torres et al., in prep.). A different project related to dusty variable stars in M33 has also used pgmuvi to estimate the periods of these objects from infrared light curves. This work will be published in 2023 (Srinivasan et al., in prep.).

Method and Features

pgmuvi builds on the popular GPyTorch library. GPyTorch [@gardner2018gpytorch] is a Gaussian process library for PyTorch [@pytorch], which is a popular machine learning library for Python. By default, pgmuvi exploits the highly-flexible Spectral Mixture kernel [@wilson:2013] in GPyTorch, able to model a wide variety of signals. This kernel is particularly interesting for astronomical time-series data, as it can effectively model multi-periodic and quasi-periodic signals. The spectral mixture kernel models the power spectrum of the covariance matrix as Gaussian mixture model (GMM), making it highly flexible, easy to interpret and adaptable to multi-dimensional data. This kernel is known for its ability to extrapolate, and is thus well suited to cases where prediction is important (for example, preparing astronomical observations of variable stars). By modelling the power spectrum in this way, pgmuvi effectively filters out noise in the periodogram, and thus is able to find the dominant periods in noisy data more effectively than, for example, the Lomb-Scargle periodogram.

However, the flexibility of this kernel comes at a cost; for more than one component in the mixture, the solution space becomes highly non-convex, and the optimization of the kernel hyperparameters becomes difficult. pgmuvi addresses this by first exploiting the Lomb-Scargle periodogram to find the dominant periods in the data, and then using these periods as initial guesses for the means of the mixture components.

Multiple options are available to accelerate inference depending on the size of the dataset. For small datasets, the exact GPs can be used to scale to datasets of up to $\sim1000$ points. pgmuvi can exploit the Structured Kernel Interpolation (SKI) approximation [@wilson2015kernel] to scale to datasets of up to $\sim10^5$ points. Future work will include implementing approximations for even larger datasets: pgmuvi can in principle exploit the Sparse Variational GP (SVGP) or Variational Nearest Neighbour (VNN) approximations [@hensman2013gaussian; @wu2022variational] to scale to datasets of arbitrary size. pgmuvi can employ also GPU computing for both exact and variational GPs.

For exact GPs and SKI, pgmuvi performs maximum a posteriori (MAP) estimation of the hyperparameters, or full Bayesian inference. MAP estimation can exploit any PyTorch optimizer, but by default it uses ADAM [@kingma2014adam]. Bayesian inference uses the pyro [@bingham2018pyro] implementation of the No-U-Turn Sampler (NUTS) [@hoffman2014no], which is a Hamiltonian Monte Carlo (HMC) sampler.

Finally, by allowing arbitrary GPyTorch likelihoods to be used, pgmuvi can be extended to a wide range of problems. For example, an instance of gpytorch.likelihoods.StudentTLikelihood can be dropped in to turn pgmuvi into a T-Process regressor, or missing data can be handled using gpytorch.likelihoods.GaussianLikelihoodWithMissingObs.

To summarise, the key features of pgmuvi are:

  • Highly flexible kernel, able to model a wide range of multiwavelength signals
  • Able to exploit multiple strategies to scale to large datasets
  • GPU acceleration for both exact and variational GPs
  • Fully Bayesian inference using HMC
  • Initialisation of kernel hyperparameters using Lomb-Scargle periodogram
  • Automated creation of diagnostic and summary plots
  • Automated reporting of kernel hyperparameters and their uncertainties, and summary of MCMC chains.

Acknowledgements

This project was developed in part at the 2022 Astro Hack Week, hosted by the Max Planck Institute for Astronomy and Haus der Astronomie in Heidelberg, Germany. This work was partially supported by the Max Planck Institute for Astronomy, the European Space Agency, the Gordon and Betty Moore Foundation, the Alfred P. Sloan foundation.

SS and DAVT acknowledge support from UNAM-PAPIIT Program IA104822.

References