# TimeSeries Operations

In this lesson we'll explore time shifting and resampling (grouping). Two of the most common operations with Time Series.

In [1]:
import pandas as pd
import numpy as np

### Time Shifting

In [2]:
ts = pd.Series(
    np.random.randn(10) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=10, freq='D'))

In [3]:
ts

2018-01-01    501.234286
2018-01-02    493.716639
2018-01-03    504.774312
2018-01-04    507.286344
2018-01-05    484.088206
2018-01-06    503.133241
2018-01-07    497.514691
2018-01-08    491.456866
2018-01-09    504.404706
2018-01-10    510.683477
Freq: D, dtype: float64

In [4]:
ts.shift(1)

2018-01-01           NaN
2018-01-02    501.234286
2018-01-03    493.716639
2018-01-04    504.774312
2018-01-05    507.286344
2018-01-06    484.088206
2018-01-07    503.133241
2018-01-08    497.514691
2018-01-09    491.456866
2018-01-10    504.404706
Freq: D, dtype: float64

In [5]:
pd.DataFrame({
    'Original': ts,
    'Shfit (1)': ts.shift(1),
    'Shift (2)': ts.shift(2)
})

Unnamed: 0,Original,Shfit (1),Shift (2)
2018-01-01,501.234286,,
2018-01-02,493.716639,501.234286,
2018-01-03,504.774312,493.716639,501.234286
2018-01-04,507.286344,504.774312,493.716639
2018-01-05,484.088206,507.286344,504.774312
2018-01-06,503.133241,484.088206,507.286344
2018-01-07,497.514691,503.133241,484.088206
2018-01-08,491.456866,497.514691,503.133241
2018-01-09,504.404706,491.456866,497.514691
2018-01-10,510.683477,504.404706,491.456866


These operations are usually employed to compare the timeseries with previous values of the same time series. For example, calculating the percent change over the previous period:

In [6]:
df = pd.DataFrame({
    'Original': ts,
    'Shifted': ts.shift(1)
})
df

Unnamed: 0,Original,Shifted
2018-01-01,501.234286,
2018-01-02,493.716639,501.234286
2018-01-03,504.774312,493.716639
2018-01-04,507.286344,504.774312
2018-01-05,484.088206,507.286344
2018-01-06,503.133241,484.088206
2018-01-07,497.514691,503.133241
2018-01-08,491.456866,497.514691
2018-01-09,504.404706,491.456866
2018-01-10,510.683477,504.404706


In [7]:
(df['Original'] / df['Shifted']) - 1

2018-01-01         NaN
2018-01-02   -0.014998
2018-01-03    0.022397
2018-01-04    0.004977
2018-01-05   -0.045730
2018-01-06    0.039342
2018-01-07   -0.011167
2018-01-08   -0.012176
2018-01-09    0.026346
2018-01-10    0.012448
Freq: D, dtype: float64

You can see how much sales grew or shrank vs the previous month.

This is a particularly silly example, because there's a pandas method specially intended for percentage changes: [`pct_change()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pct_change.html), so we don't even need `shift`:

In [8]:
ts.pct_change()

2018-01-01         NaN
2018-01-02   -0.014998
2018-01-03    0.022397
2018-01-04    0.004977
2018-01-05   -0.045730
2018-01-06    0.039342
2018-01-07   -0.011167
2018-01-08   -0.012176
2018-01-09    0.026346
2018-01-10    0.012448
Freq: D, dtype: float64

Shifting also works with smaller periods, just changing the time of the original timestamps:

In [9]:
ts.shift(1, freq='15Min')

2018-01-01 00:15:00    501.234286
2018-01-02 00:15:00    493.716639
2018-01-03 00:15:00    504.774312
2018-01-04 00:15:00    507.286344
2018-01-05 00:15:00    484.088206
2018-01-06 00:15:00    503.133241
2018-01-07 00:15:00    497.514691
2018-01-08 00:15:00    491.456866
2018-01-09 00:15:00    504.404706
2018-01-10 00:15:00    510.683477
Freq: D, dtype: float64

## Time Frequency

We'll now see how to change the frequency of our indexes. These will be just raw adjustments we'll do to directly modify the frequency of our data structure:

In [10]:
ts = pd.Series(
    np.random.randn(10) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=10, freq='H'))
ts

2018-01-01 00:00:00    495.923528
2018-01-01 01:00:00    500.470628
2018-01-01 02:00:00    506.498649
2018-01-01 03:00:00    493.943460
2018-01-01 04:00:00    505.499514
2018-01-01 05:00:00    498.041784
2018-01-01 06:00:00    496.484607
2018-01-01 07:00:00    489.843080
2018-01-01 08:00:00    484.456180
2018-01-01 09:00:00    506.811008
Freq: H, dtype: float64

In [11]:
ts.asfreq('45min')

2018-01-01 00:00:00    495.923528
2018-01-01 00:45:00           NaN
2018-01-01 01:30:00           NaN
2018-01-01 02:15:00           NaN
2018-01-01 03:00:00    493.943460
2018-01-01 03:45:00           NaN
2018-01-01 04:30:00           NaN
2018-01-01 05:15:00           NaN
2018-01-01 06:00:00    496.484607
2018-01-01 06:45:00           NaN
2018-01-01 07:30:00           NaN
2018-01-01 08:15:00           NaN
2018-01-01 09:00:00    506.811008
Freq: 45T, dtype: float64

In [12]:
ts.asfreq('45Min', method='ffill')

2018-01-01 00:00:00    495.923528
2018-01-01 00:45:00    495.923528
2018-01-01 01:30:00    500.470628
2018-01-01 02:15:00    506.498649
2018-01-01 03:00:00    493.943460
2018-01-01 03:45:00    493.943460
2018-01-01 04:30:00    505.499514
2018-01-01 05:15:00    498.041784
2018-01-01 06:00:00    496.484607
2018-01-01 06:45:00    496.484607
2018-01-01 07:30:00    489.843080
2018-01-01 08:15:00    484.456180
2018-01-01 09:00:00    506.811008
Freq: 45T, dtype: float64

In [13]:
ts.asfreq('45Min', method='bfill')

2018-01-01 00:00:00    495.923528
2018-01-01 00:45:00    500.470628
2018-01-01 01:30:00    506.498649
2018-01-01 02:15:00    493.943460
2018-01-01 03:00:00    493.943460
2018-01-01 03:45:00    505.499514
2018-01-01 04:30:00    498.041784
2018-01-01 05:15:00    496.484607
2018-01-01 06:00:00    496.484607
2018-01-01 06:45:00    489.843080
2018-01-01 07:30:00    484.456180
2018-01-01 08:15:00    506.811008
2018-01-01 09:00:00    506.811008
Freq: 45T, dtype: float64

In [14]:
ts.asfreq?

In these examples, we've gone from a "less frequent" index to a "more frequent" index. But we could go the other way:

In [15]:
ts = pd.Series(
    np.random.randn(20) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=20, freq='30min'))
ts

2018-01-01 00:00:00    495.781759
2018-01-01 00:30:00    484.041645
2018-01-01 01:00:00    507.895081
2018-01-01 01:30:00    487.605316
2018-01-01 02:00:00    508.826438
2018-01-01 02:30:00    504.814993
2018-01-01 03:00:00    491.729141
2018-01-01 03:30:00    496.421283
2018-01-01 04:00:00    510.332442
2018-01-01 04:30:00    496.612527
2018-01-01 05:00:00    485.377065
2018-01-01 05:30:00    504.601887
2018-01-01 06:00:00    490.598199
2018-01-01 06:30:00    514.362441
2018-01-01 07:00:00    493.276122
2018-01-01 07:30:00    504.709501
2018-01-01 08:00:00    475.797679
2018-01-01 08:30:00    500.485632
2018-01-01 09:00:00    504.972745
2018-01-01 09:30:00    500.370554
Freq: 30T, dtype: float64

In [16]:
ts.asfreq('2H')

2018-01-01 00:00:00    495.781759
2018-01-01 02:00:00    508.826438
2018-01-01 04:00:00    510.332442
2018-01-01 06:00:00    490.598199
2018-01-01 08:00:00    475.797679
Freq: 2H, dtype: float64

In [17]:
ts.asfreq('2H25min')

2018-01-01 00:00:00    495.781759
2018-01-01 02:25:00           NaN
2018-01-01 04:50:00           NaN
2018-01-01 07:15:00           NaN
Freq: 145T, dtype: float64

In [18]:
ts.asfreq('2H25min', method='ffill')

2018-01-01 00:00:00    495.781759
2018-01-01 02:25:00    508.826438
2018-01-01 04:50:00    496.612527
2018-01-01 07:15:00    493.276122
Freq: 145T, dtype: float64

But, what if you want to do some more "advanced" filling. For example, filling the new freq values with the "mean"? For that, we'll use resampling:

### Resampling

Resampling a timeseries is converting it to another time frequency. If you're going from high frequency to low frequency, the process is called "downsampling", and it involves an aggregation process. For example, you have daily sales data, and you want to aggregate it by month. You'll be "grouping" your daily sales per month, and you need to decide the aggregation operation to perform. For example, `sum` to get the total sales per month, or `mean` to get the average sale. Let's use an example:

In [23]:
all_days_2018 = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
ts = pd.Series(
    np.random.randn(20) * 10 + 500,
    index=np.random.choice(all_days_2018, size=20))

ts.sort_index(inplace=True)
ts

2018-01-16    498.767961
2018-01-18    499.898482
2018-02-02    510.747610
2018-02-27    493.487023
2018-03-22    488.410878
2018-04-20    494.345752
2018-04-26    506.777721
2018-05-04    505.595108
2018-05-12    515.096785
2018-06-30    513.510190
2018-08-05    532.511595
2018-08-28    490.414338
2018-09-26    494.161538
2018-10-05    505.716774
2018-10-10    483.577099
2018-12-12    492.315935
2018-12-23    514.084808
2018-12-25    498.880580
2018-12-28    518.482966
2018-12-31    490.358918
dtype: float64

January sales:

In [24]:
ts['2018-01']

2018-01-16    498.767961
2018-01-18    499.898482
dtype: float64

In [25]:
ts['2018-01'].sum()

998.6664427720889

February sales:

In [26]:
ts['2018-02']

2018-02-02    510.747610
2018-02-27    493.487023
dtype: float64

In [27]:
ts['2018-02'].sum()

1004.2346331586655

**Downsampling**: We'll now use `resample` to "group" the sales monthly (downsampling our TimeSeries), and calculate the total sales per month:

In [28]:
ts.resample('M').sum()

2018-01-31     998.666443
2018-02-28    1004.234633
2018-03-31     488.410878
2018-04-30    1001.123473
2018-05-31    1020.691892
2018-06-30     513.510190
2018-07-31       0.000000
2018-08-31    1022.925933
2018-09-30     494.161538
2018-10-31     989.293873
2018-11-30       0.000000
2018-12-31    2514.123207
Freq: M, dtype: float64

The parameter `M` means "month end frequency. We could instead choose "Month Start":

In [29]:
ts.resample('MS').sum()

2018-01-01     998.666443
2018-02-01    1004.234633
2018-03-01     488.410878
2018-04-01    1001.123473
2018-05-01    1020.691892
2018-06-01     513.510190
2018-07-01       0.000000
2018-08-01    1022.925933
2018-09-01     494.161538
2018-10-01     989.293873
2018-11-01       0.000000
2018-12-01    2514.123207
Freq: MS, dtype: float64

Which would of course yield the same results, but the index contains the first day of each month. More correctly speaking, in this example, we're collecting sales of _"the period January 2018"_. Pandas also has a `Period` type, which we can use with the `kind` parameter:

In [30]:
monthly_sales = ts.resample('M', kind='period').sum()
monthly_sales

2018-01     998.666443
2018-02    1004.234633
2018-03     488.410878
2018-04    1001.123473
2018-05    1020.691892
2018-06     513.510190
2018-07       0.000000
2018-08    1022.925933
2018-09     494.161538
2018-10     989.293873
2018-11       0.000000
2018-12    2514.123207
Freq: M, dtype: float64

In [31]:
monthly_sales.index

PeriodIndex(['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
             '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12'],
            dtype='period[M]', freq='M')

As you can see, the Index is a `PeriodIndex`. Each entry in the index is of type `pd.Period`: 

In [32]:
monthly_sales.index[0]

Period('2018-01', 'M')

Period support basic arithmetic operations which makes them convenient to express these time ranges:

In [33]:
pd.Period('2018-01') + 5

Period('2018-06', 'M')

In [34]:
pd.Period('2018-01', freq='H') + 9

Period('2018-01-01 09:00', 'H')

**Upsampling**: With upsampling we'll convert a low-frequency time series to a higher frequency time series. We'll add more "time points". Let's use an example:

We'll start with 3 months of sales, only 3 data points:

In [35]:
ts = pd.Series(
    np.random.randn(3) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=3, freq='MS'))
ts

2018-01-01    492.025978
2018-02-01    498.184135
2018-03-01    492.677314
Freq: MS, dtype: float64

We'll now `resample` it to be "Semi Month", every 15 days:

In [36]:
ts.resample('SMS').asfreq()

2018-01-01    492.025978
2018-01-15           NaN
2018-02-01    498.184135
2018-02-15           NaN
2018-03-01    492.677314
Freq: SMS-15, dtype: float64

And as you can see, we have a few missing values, because we don't have data for those specific time periods. What can you do with that missing data? One option is to fill it with previous data:

In [37]:
ts.resample('SMS').ffill()

2018-01-01    492.025978
2018-01-15    492.025978
2018-02-01    498.184135
2018-02-15    498.184135
2018-03-01    492.677314
Freq: SMS-15, dtype: float64