# Resampling data

This notebook will demonstrate how to resample data with Lilio.
Lilio is able to resample Pandas' `DataFrame` and `Series`, as well as xarray's `DataArray` and `Dataset`.

We start by importing the required libraries and generating an example Pandas Series and a DataFrame

In [None]:
import numpy as np
import xarray as xr
import pandas as pd
import lilio

time_index = pd.date_range('20171020', '20211001', freq='15d')
random_data = np.random.random(len(time_index))
example_series = pd.Series(random_data, index=time_index)
example_dataframe = pd.DataFrame(example_series.rename('data1'))
example_dataframe['data2'] = example_dataframe['data1']

example_series.head(3)

The DataFrame looks similar but has two named columns:

In [None]:
example_dataframe.head(3)

To resample we need to set up an calendar with the anchor date and frequency.
In this case we choose to use the `daily_calendar` shorthand.

(Passing `max_lag` is optional, as well as `allow_overlap`, but this will allow us to demonstrate that resampling works even when intervals overlap)

In [None]:
calendar = lilio.daily_calendar(
    anchor="10-15",
    length='90d',
    n_precursors=4,
    allow_overlap=True
)
calendar.map_years(2018, 2020)
calendar.visualize()

Next we pass the example data to the `resample` function. This requires a mapped calendar and the input data.

By default, `resample` will take the mean of all datapoints that fall within each interval. However, many other statistics are available, such as `min`, `max`, `median`, `std`, etc. For a full list see the docstring with:
```py
help(lilio.resample)
```

In [None]:
resampled_series = lilio.resample(calendar, example_series, how="mean")
resampled_series

As you see above, this generates a new DataFrame containing the data resampled for each interval, along with the corresponding interval index, and the anchor year that the interval belongs to.

This works the same if you input a pandas DataFrame:

In [None]:
resampled_dataframe = lilio.resample(calendar, example_dataframe)
resampled_dataframe

## Resampling `xarray` data

Resampling works the same for an `xarray` `Dataset`. Let's make an example dataset with latitude and longitude coordinates:

In [None]:
import xarray as xr

time_index = pd.date_range('20171020', '20211001', freq='15d')

np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 2, len(time_index))
precipitation = 10 * np.random.rand(2, 2, len(time_index))

lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]


ds = xr.Dataset(
    data_vars=dict(
        temperature=(["x", "y", "time"], temperature),
        precipitation=(["x", "y", "time"], precipitation),
    ),
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        time=time_index,
    ),
    attrs=dict(description="Weather related data."),
)
ds

As you see below, the `temperature` and `precipitation` variables can 
easily be resampled to a new index using the calendar specified intervals.

This index dimension has as coordinates the anchor year and the interval count.



Note: both the `temperature` and `precipitation` variables have kept their 
`lat` and `lon` coordinates.

In [None]:
ds_r = lilio.resample(calendar, ds)
ds_r

## Calculating bin counts
To check if you have sufficient data for each of the Calendar's intervals, you can make use of `resample(how="size")`. This will give you the number of data points that are within each interval.

Especially when the intervals are of varying sizes, or your data is more sparse, then this can be very useful.

As an example, let's make a calendar with varyingly sized intervals:

In [None]:
calendar = lilio.Calendar(anchor="10-15")
calendar.add_intervals("target", length="5d")
calendar.add_intervals("precursor", length="1d")
calendar.add_intervals("precursor", length="3d")
calendar.add_intervals("precursor", length="10d")
calendar.map_years(2018, 2018)
calendar.visualize()

Now if we resample a dataset with a 1-day frequency, using `how="size"`, you can see that the smallest interval contains only a single data point, while the largest interval contains ten.

Some of the resampling methods (such as "min" or "std") of course would not make sense with such few data points per interval.

In [None]:
time_index = pd.date_range('20171020', '20191001', freq='1d')
random_data = np.random.random(len(time_index))
example_series = pd.Series(random_data, index=time_index)
example_dataframe = pd.DataFrame(example_series.rename('data1'))

lilio.resample(calendar, example_dataframe, how="size")

## Custom resampling methods
The `how` argument also accepts any function with a single input and a single output argument. This allows you to use custom defined functions to resample the data. 

For example:

In [None]:
def root_mean_squared(data):
    return np.mean(data ** 2) ** 0.5

lilio.resample(calendar, example_dataframe, how=root_mean_squared)