# Downsampling

Aggregating data to a regular, lower frequency is a pretty normal time series task. The data you’re aggregating doesn’t need to be fixed frequently; the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate. For example, to convert to monthly, 'M' or 'BM', the data need to be chopped up into one month intervals. Each interval is said to be half-open; a data point can only belong to one interval, and the union of the intervals must make up the whole time frame. There are a couple things to think about when using resample to downsample data:

- Which side of each interval is closed
- How to label each aggregated bin, either with the start of the interval or the end

To illustrate, let's look at some one-minute data:

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [2]:
rng = pd.date_range('1/1/2000', periods = 12, freq = 'T')

ts = Series(np.arange(len(rng)), index = rng)

ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

Suppose you wanted to aggregate this data into five minute chunks or bars by taking the sum of each group:

In [3]:
ts.resample('5min').sum()

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

The frequency you pass defines bin edges in five-minute increments. By default, the right bin edge is inclusive, so the 00:05 value is included in the 00:00 to 00:05 interval.1 Passing closed='left' changes the interval to be closed on the left:

In [4]:
ts.resample('5min', closed= 'left').sum()

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

As you can see, the resulting time series is labeled by the timestamps from the right side of each bin. By passing label='left' you can label them with the left bin edge:

In [5]:
ts.resample('5min', closed='left', label = 'left').sum()

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

![5-minute resampling illustration of closed, label conventions](../../Pictures/5-minute%20resampling%20illustration%20of%20closed%2C%20label%20conventions.png)

Lastly, you might want to shift the result index by some amount, say subtracting one second from the right edge to make it more clear which interval the timestamp refers to. To do this, pass a string or date offset to loffset:

In [9]:
ts.resample('5min', loffset='-1s').agg('sum')


>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min', loffset='-1s').agg('sum')


1999-12-31 23:59:59    10
2000-01-01 00:04:59    35
2000-01-01 00:09:59    21
Freq: 5T, dtype: int32

This also could have been accomplished by calling the *shift* method on the result without the *loffset*.

## Open-High-Low-Close (OHLC) resampling

In finance, an ubiquitous way to aggregate a time series is to compute four values for each bucket: the first (open), last (close), maximum (high), and minimal (low) values. By passing *how = 'ohloc'* you will obtain a DataFrame having columns containing these four aggregates, which are efficiently computed in a single sweep of the data:

In [8]:
ts.resample('5min').agg('ohlc')

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0,4,0,4
2000-01-01 00:05:00,5,9,5,9
2000-01-01 00:10:00,10,11,10,11


## Resampling with GroupBy

An alternate way to downsample is to use pandas's *groupby* functionality. For example, you can group by month or weekday by passing a function that accesses those fields on the time series's index:

In [None]:
rng = pd.date_range('1/1/2020', periods = 100, freq = 'd')

ts = Series(np.arange(len(rng)), index = rng)

In [None]:
ts.groupby(lambda x: x.month).sum()

1     465
2    1305
3    2325
4     855
dtype: int32

In [None]:
ts.groupby(lambda x:x.weekday).mean()

0    50.5
1    51.5
2    49.0
3    50.0
4    47.5
5    48.5
6    49.5
dtype: float64