# Time Series

---

Time series data is an important form of structured data in many different fields, such as finance, economics, ecology, neuroscience, and physics. Anything that is observed or measured at many points in time forms a `Time Series`. Many time series are fixed frequency. However we may have time series with irregular frequency. **We will not deal with irregular frequency time series**.

In today's lecture, where we'll be looking at the time series and date functionally in Pandas. Manipulating dates and time is quite flexible in Pandas and thus allows us to conduct more analysis. Actually, Pandas was originally created by Wes McKinney to handle date and time data when he worked as a consultant for hedge funds.


### Lecture outline

---

* Date and Time data types


* Dealing with Datetime Objects


* Indexing, Selection, Sub-setting


* Periods and Period Arithmetic


* Date and Time Conversion


* Time Shifting


* Resampling


* Moving Window Functions

#### Reference


[Timeseries](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#timeseries)


[Time series / date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)


[Time deltas](https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html)


[Windowing Operations](https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html)


[datetime — Basic date and time types](https://docs.python.org/3/library/datetime.html)


[Python Datetime](https://www.w3schools.com/python/python_datetime.asp)

In [1]:
import pandas as pd

import numpy as np

import datetime

## Date and Time data types

---

Pandas has four main time related classes:

* `Timestamp`


* `DatetimeIndex`


* `Period`


* `PeriodIndex`


Before we investigate what are those time classes, we have to know what is `datetime` object at all.

### datetime

---

Python has builtin module `Datetime` to work with date and time objects. Data and time are objects containing date and time specific characteristics and when we manipulate them, we manipulate objects and not strings.


`datetime` module consists of the following type:


* `date` - Store calendar date (year, month, day) using the Gregorian calendar


* `time` - Store time of day as hours, minutes, seconds, and microseconds


* `datetime` - Stores both date and time


* `timedelta` - Represents the difference between two datetime values (as days, seconds, and microseconds)


* `tzinfo` - Base type for storing time zone information

In [6]:
now = datetime.datetime.now()

now

datetime.datetime(2021, 2, 13, 12, 23, 53, 929280)

In [7]:
type(now) # datetime object

datetime.datetime

We can extract those different characteristics from datetime object by using appropriate methods or attributes.

In [11]:
now.date() # Extract data

now.time() # Extract time

now.year # Extract year

now.month # Extract month

now.day # Extract day

now.hour # Extract hour

now.minute # Extract minute

now.second # Extract second

now.microsecond # Extract microsecond

929280

In [12]:
(now.second, now.microsecond)

(53, 929280)

### Timestamp

---

`Timestamp` represents a single timestamp and associates values with points in time. In other words, it's a specific instants in time.


For example, let's create a timestamp using a string `1/5/2021 10:05AM`, and here we have our timestamp.
Timestamp is interchangeable with Python's `datetime` in most cases.

In [21]:
pd.Timestamp('1/5/2021 10:05:55AM')

Timestamp('2021-01-05 10:05:55')

We can also create a timestamp by passing multiple parameters such as year, month, date, hour, minute, separately.

In [22]:
pd.Timestamp(2021, 1, 5, 10, 15)

Timestamp('2021-01-05 10:15:00')

In [23]:
pd.Timestamp(2021, 1, 5, 10, 15).isoweekday() # Return the day of the week represented by the date. Monday == 1 … Sunday == 7

2

As it was in case of Python's builtin `datetime` module, we can extract different parts of timestamp object by using appropriate methods and/or attributes

In [24]:
single_timestamp = pd.Timestamp(2021, 1, 5, 10, 15, 23, 154, 4450)

single_timestamp

Timestamp('2021-01-05 10:15:23.000158450')

In [33]:
single_timestamp.date() # Extract date

single_timestamp.time() # Extract time

single_timestamp.year # Extract year

single_timestamp.month # Extract month as a number January == 1...December == 12

single_timestamp.month_name() # Return actual name of the month

single_timestamp.week # Return week number

single_timestamp.weekday() # Return weekday as a number Monday == 0 … Sunday == 6

single_timestamp.day_name() # Return actual name of the weekday

single_timestamp.hour # Extract hour

single_timestamp.minute # Extract minute

single_timestamp.second # Extract second

single_timestamp.microsecond # Extract microsecond

single_timestamp.nanosecond # Extract nanosecond

4450

### Period

---

If we are interested in a span of a time, we have to use the `Period` object instead of datetime or anything else. `Period` represents fixed period of time. For example: January 2021, etc.

In [34]:
pd.Period(value="1/2021", freq="M") # A period object, that is January 2021,

Period('2021-01', 'M')

In [35]:
pd.Period(value='1/5/2021', freq="D") # More granular period object - January 5th, 2021

Period('2021-01-05', 'D')

<div class="alert alert-info">

**Note:** We can extract date and time characteristics from a `Period` object as we did in case of `Timestamp`
    

> [**pandas.Period**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.html)

</div>

### DatetimeIndex and PeriodIndex

---

The `PeriodIndex` class stores a sequence of `Periods` and can serve as an axis index in any Pandas data structure. The `DatetimeIndex` class stores sequence of `Datetime` and it also can serve as an index for an axis.



[pandas.DatetimeIndex](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html)


[pandas.PeriodIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.PeriodIndex.html)

The index of a Timestamp is `DatetimeIndex`. When we look at the series, each Timestamp is the index and has a value associated with it, in this case, `a`, `b`, `c`, `d`, and `e`.

In [36]:
t1 = pd.Series(list("abcde"), [pd.Timestamp('2021-01-05'),
                               pd.Timestamp('2021-01-06'),
                               pd.Timestamp('2021-01-07'),
                               pd.Timestamp('2021-01-08'),
                               pd.Timestamp('2021-01-09')])


t1

2021-01-05    a
2021-01-06    b
2021-01-07    c
2021-01-08    d
2021-01-09    e
dtype: object

In [37]:
t1.index

DatetimeIndex(['2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09'],
              dtype='datetime64[ns]', freq=None)

In [38]:
type(t1.index) # Looking at the type of our series index, we see that it's DatetimeIndex

pandas.core.indexes.datetimes.DatetimeIndex

Similarly, we can create a `Period-based` index as well.

In [39]:
t2 = pd.Series(list("abcde"), [pd.Period('2021-01'),
                               pd.Period('2021-02'),
                               pd.Period('2021-03'),
                               pd.Period('2021-04'),
                               pd.Period('2021-05')])



t2

2021-01    a
2021-02    b
2021-03    c
2021-04    d
2021-05    e
Freq: M, dtype: object

In [40]:
t2.index

PeriodIndex(['2021-01', '2021-02', '2021-03', '2021-04', '2021-05'], dtype='period[M]', freq='M')

In [41]:
type(t2.index) # Looking at the type of the ts2.index, we can see that it's PeriodIndex.

pandas.core.indexes.period.PeriodIndex

### Timedelta

---

Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative. This is not the same as a a period, but conceptually similar. For instance, if we want to take the difference between January 11th and  January 10th, we get a Timedelta of one days.

In [42]:
pd.Timestamp('01/11/2021') - pd.Timestamp('01/10/2021')

Timedelta('1 days 00:00:00')

We can also find what the date and time is for 12 days and 3 hours past January 2nd, at 8:10 AM.

In [44]:
pd.Timestamp('01/2/2021 8:10AM') + pd.Timedelta('12D 3H')

Timestamp('2021-01-02 11:10:00')

## Dealing with Datetime Objects

---

Next, let's look at a few tricks for working with dates in a DataFrame. Suppose we want to look at nine measurements, taken bi-weekly, every Sunday, starting in October 2020. Using `date_range()` method, we can create this DatetimeIndex. In `data_range()`, we have to either specify the `start` or `end` date. If it is not explicitly specified, by default, the date is considered the start date. Then we have to specify the number of periods, and a frequency. Here, we set it to `2W-SUN`, which means biweekly on Sunday.

In [45]:
dates = pd.date_range(start="10-01-2020", periods=9, freq="2W-SUN")


dates

DatetimeIndex(['2020-10-04', '2020-10-18', '2020-11-01', '2020-11-15',
               '2020-11-29', '2020-12-13', '2020-12-27', '2021-01-10',
               '2021-01-24'],
              dtype='datetime64[ns]', freq='2W-SUN')

There are many other frequencies that you can specify. For example, you can do business day.

In [46]:
pd.date_range(start="10-01-2020", periods=9, freq="B")

DatetimeIndex(['2020-10-01', '2020-10-02', '2020-10-05', '2020-10-06',
               '2020-10-07', '2020-10-08', '2020-10-09', '2020-10-12',
               '2020-10-13'],
              dtype='datetime64[ns]', freq='B')

We can do quarterly as well, with the quarter start in June

In [47]:
pd.date_range(start="04-01-2020", periods=12, freq="QS-JUN")

DatetimeIndex(['2020-06-01', '2020-09-01', '2020-12-01', '2021-03-01',
               '2021-06-01', '2021-09-01', '2021-12-01', '2022-03-01',
               '2022-06-01', '2022-09-01', '2022-12-01', '2023-03-01'],
              dtype='datetime64[ns]', freq='QS-JUN')

Now, let's go back to our weekly on Sunday example and create a DataFrame using these dates, and some random data, and see what we can do with it.

In [48]:
dates

DatetimeIndex(['2020-10-04', '2020-10-18', '2020-11-01', '2020-11-15',
               '2020-11-29', '2020-12-13', '2020-12-27', '2021-01-10',
               '2021-01-24'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [49]:
np.random.seed(425)



dates = pd.date_range(start="10-11-2020", periods=20, freq="2W-SUN")


df = pd.DataFrame({"count_1": np.random.randint(1, 10, 20),
                   "count_2": np.random.randint(1, 10, 20)},
                  index=dates)


df

Unnamed: 0,count_1,count_2
2020-10-11,3,1
2020-10-25,1,7
2020-11-08,3,5
2020-11-22,4,6
2020-12-06,8,1
2020-12-20,5,2
2021-01-03,8,9
2021-01-17,5,5
2021-01-31,5,5
2021-02-14,8,9


Here, we can see that all the dates in our index are on a Sunday, which matches the frequency that we set.

In [50]:
df.index.day_name()

Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday',
       'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday',
       'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday'],
      dtype='object')

We can also use `diff()` to find the first discrete difference between each date's value. We will talk about the `diff()` method later on.

In [51]:
df.diff()

Unnamed: 0,count_1,count_2
2020-10-11,,
2020-10-25,-2.0,6.0
2020-11-08,2.0,-2.0
2020-11-22,1.0,1.0
2020-12-06,4.0,-5.0
2020-12-20,-3.0,1.0
2021-01-03,3.0,7.0
2021-01-17,-3.0,-4.0
2021-01-31,0.0,0.0
2021-02-14,3.0,4.0


Suppose we want to know what is the mean count for each month in our DataFrame. We can do this using
`resample()`. Converting from a higher frequency from a lower frequency is called `downsampling` (we'll talk about this in a moment)

In [52]:
df.resample("M").mean()

Unnamed: 0,count_1,count_2
2020-10-31,2.0,4.0
2020-11-30,3.5,5.5
2020-12-31,6.5,1.5
2021-01-31,6.0,6.333333
2021-02-28,8.0,6.0
2021-03-31,3.0,5.0
2021-04-30,2.5,2.0
2021-05-31,3.5,5.0
2021-06-30,3.5,5.5
2021-07-31,7.0,7.0


Now let's talk about datetime indexing and slicing, which is a wonderful feature of the pandas DataFrame.For instance, we can use partial string indexing to find values from a particular year.

In [56]:
df.loc["2020"] # Select only 2020 year

df.loc["2021"] # Select only 2021 year

df.loc["2020-12"] # Select particular year and month

df.loc["2020-12":] # Select range

Unnamed: 0,count_1,count_2
2020-12-06,8,1
2020-12-20,5,2
2021-01-03,8,9
2021-01-17,5,5
2021-01-31,5,5
2021-02-14,8,9
2021-02-28,8,3
2021-03-14,1,3
2021-03-28,5,7
2021-04-11,3,1


## Indexing, Selection, Sub-setting

---

Time series behaves like any other Pandas Series when you are indexing and selecting data based on label.


> **While pandas does not force you to have a sorted date index, some of these methods may have unexpected or incorrect behavior if the dates are unsorted.**

In [57]:
df

Unnamed: 0,count_1,count_2
2020-10-11,3,1
2020-10-25,1,7
2020-11-08,3,5
2020-11-22,4,6
2020-12-06,8,1
2020-12-20,5,2
2021-01-03,8,9
2021-01-17,5,5
2021-01-31,5,5
2021-02-14,8,9


In [58]:
df.index

DatetimeIndex(['2020-10-11', '2020-10-25', '2020-11-08', '2020-11-22',
               '2020-12-06', '2020-12-20', '2021-01-03', '2021-01-17',
               '2021-01-31', '2021-02-14', '2021-02-28', '2021-03-14',
               '2021-03-28', '2021-04-11', '2021-04-25', '2021-05-09',
               '2021-05-23', '2021-06-06', '2021-06-20', '2021-07-04'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [59]:
df.shape

(20, 2)

In [60]:
df[:10] # Select first 10 rows

df[10:15] # Select 5 rows

df[15:] # Select last 5 rows

Unnamed: 0,count_1,count_2
2021-05-09,5,2
2021-05-23,2,8
2021-06-06,6,3
2021-06-20,1,8
2021-07-04,7,7


In [64]:
df.loc["2020-10-11"] # Select one row

df.loc[datetime.datetime(2020, 10, 11)]


df.loc["2020-10"] # Select all row for a month

df.loc["2020"] # Select all rows for a year



df.loc[:"2020"] # Select year range

df.loc["2021":] # Select year range


df.loc["2021-01":"2021-05"] # Select month range

df.loc[datetime.datetime(2020, 10, 11): datetime.datetime(2021, 2, 28)]

Unnamed: 0,count_1,count_2
2020-10-11,3,1
2020-10-25,1,7
2020-11-08,3,5
2020-11-22,4,6
2020-12-06,8,1
2020-12-20,5,2
2021-01-03,8,9
2021-01-17,5,5
2021-01-31,5,5
2021-02-14,8,9


In [66]:
df

Unnamed: 0,count_1,count_2
2020-10-11,3,1
2020-10-25,1,7
2020-11-08,3,5
2020-11-22,4,6
2020-12-06,8,1
2020-12-20,5,2
2021-01-03,8,9
2021-01-17,5,5
2021-01-31,5,5
2021-02-14,8,9


In [67]:
df.truncate(before="2021-01-31") # Truncate all rows before this index value

df.truncate(after="2021-01-31") # Truncate all rows after this index value

df.truncate(before="2021-01-03", after="2021-04-11") # Truncate before and after these index values

Unnamed: 0,count_1,count_2
2021-01-03,8,9
2021-01-17,5,5
2021-01-31,5,5
2021-02-14,8,9
2021-02-28,8,3
2021-03-14,1,3
2021-03-28,5,7
2021-04-11,3,1


## Periods and Period Arithmetic

---

Periods represent timespans, like days, months, quarters, or years. The Period class represents this data type, requiring a string or integer and a frequency from the below table.


![alt text](images/base_ts_frq.png "Title")

Period object represents the full timespan from January 1, 2020, to December 31, 2020, inclusive.

In [68]:
first_period = pd.Period(value=2020, freq="A-Dec")


first_period

Period('2020', 'A-DEC')

The frequency for our period object is `Annual`, hence adding and subtracting integers from this object has the effect of shifting by their frequency not actual dates.

In [69]:
first_period - 5

Period('2015', 'A-DEC')

In [70]:
first_period + 3

Period('2023', 'A-DEC')

If two periods have the same frequency, their difference is the number of units between them

In [72]:
second_period = pd.Period(value="2015", freq="A-DEC")

In [73]:
second_period - first_period

<-5 * YearEnds: month=12>

In [74]:
first_period - second_period

<5 * YearEnds: month=12>

It's not possible to add two period object

In [75]:
first_period + second_period

TypeError: unsupported operand type(s) for +: 'Period' and 'Period'

Also, it's not possible to do an arithmetic operations on period objects with different frequency.

In [76]:
third_period = pd.Period(value="01-04-2021", freq="D")

In [77]:
third_period

Period('2021-01-04', 'D')

In [78]:
first_period - third_period

IncompatibleFrequency: Input has different freq=D from Period(freq=A-DEC)

In [79]:
first_period + third_period

TypeError: unsupported operand type(s) for +: 'Period' and 'Period'

The key here is that the `Period` object encapsulates the granularity for arithmetic.

### Period Frequency Conversion


---

If it's not possible to do operations on period objects with different frequency. However we can perform frequency conversion and then do different operations.

In [80]:
first_period

Period('2020', 'A-DEC')

Convert annual period into monthly period!

In [81]:
first_period.asfreq(freq="M", how="start") # Annual period to Monthly period

Period('2020-01', 'M')

In [82]:
first_period_daily = first_period.asfreq(freq="D", how="start")


first_period_daily

Period('2020-01-01', 'D')

In [84]:
third_period

Period('2021-01-04', 'D')

In [83]:
third_period - first_period_daily

<369 * Days>

## Date and Time Conversion

---

Converting strings into proper date object or vice versa is a crucial operation in Time Series data. For some type of operation we need to have a string representation of date and in some cases we need proper date/time object.

### Converting Between String and Datetime

---

We can convert string representation of a date into proper date object and vice versa either by using Python builtin `datetime` module or Pandas.

![alt text](images/format_table.png "Title")


$$
$$


[**See this link for full reference**](https://www.w3schools.com/python/python_datetime.asp)

**Datetime to String**

In [85]:
single_stamp = datetime.datetime(year=2021, month=1, day=5, hour=10, minute=45)


single_stamp

datetime.datetime(2021, 1, 5, 10, 45)

In [86]:
str(single_stamp) # From datetime to string

'2021-01-05 10:45:00'

In [90]:
single_stamp.strftime(format="%Y-%m-%d")

single_stamp.strftime(format="%y-%m-%d")

single_stamp.strftime(format="%y:%m:%d:%H:%M")

'21:01:05:10:45'

In [95]:
single_stamp.strftime("%b")

single_stamp.strftime("%B")

'2021'

**String to Datetime**

In [104]:
value = "2021-1-05"

In [106]:
pd.to_datetime(value) # from string to datetime

Timestamp('2021-01-05 00:00:00')

In [105]:
datetime.datetime.strptime(value, "%Y-%m-%d") # We need to indicate EXACT format. Otherwise conversion does not work

datetime.datetime(2021, 1, 5, 0, 0)

In [107]:
date_strings = ["7/6/2011", "8/6/2011"]


[datetime.datetime.strptime(x, '%m/%d/%Y') for x in date_strings]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

$$
$$

Pandas `to_datetime()` is not such a picky compared to `datetime.strptime()` in terms of date formatting.

$$
$$

In [108]:
new_dates = ["2 June 2013", "Aug 29, 2014", "2015-06-26", "7/12/16"]

[pd.to_datetime(i) for i in new_dates]

[Timestamp('2013-06-02 00:00:00'),
 Timestamp('2014-08-29 00:00:00'),
 Timestamp('2015-06-26 00:00:00'),
 Timestamp('2016-07-12 00:00:00')]

`to_datetime()` method has option to change the date parse order.

In [109]:
pd.to_datetime("4.7.12", dayfirst=True)

Timestamp('2012-07-04 00:00:00')

In [110]:
pd.to_datetime("2010/11/12", format="%Y/%m/%d") # We can even indicate format argument

Timestamp('2010-11-12 00:00:00')

$$
$$

Missing values in Time Series is represented as `Not a Time` or `NaT`

In [111]:
pd.to_datetime([None])

DatetimeIndex(['NaT'], dtype='datetime64[ns]', freq=None)

### Converting Timestamps to Periods

---

Series and DataFrame objects indexed by Timestamps can be converted to Periods with the `to_period()` method.

In [112]:
rng = pd.date_range("2000-01-01", periods=3, freq="M")


ts = pd.Series(np.random.randint(low=1, high=10, size=3), index=rng)


ts

2000-01-31    4
2000-02-29    3
2000-03-31    3
Freq: M, dtype: int64

In [113]:
type(ts.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [114]:
new_ts = ts.to_period() # Convert Series from DatetimeIndex to PeriodIndex.


new_ts

2000-01    4
2000-02    3
2000-03    3
Freq: M, dtype: int64

In [None]:
type(new_ts.index)

### Converting Periods to Timestamps

---

Series and DataFrame objects indexed by Periods can be converted to Timestamps with the `to_timestamp()` method.

In [115]:
new_ts

2000-01    4
2000-02    3
2000-03    3
Freq: M, dtype: int64

In [116]:
new_ts.to_timestamp()

2000-01-01    4
2000-02-01    3
2000-03-01    3
Freq: MS, dtype: int64

In [117]:
type(new_ts.to_timestamp().index)

pandas.core.indexes.datetimes.DatetimeIndex

## Time Shifting

---

Shifting (Leading and Lagging) data refers to moving data backward and forward through time. Both Series and DataFrame have a `shift()` method for doing naive shifts forward or backward, leaving the index unmodified.

When we shift like this, missing data is introduced either at the start or the end of the time series.

In [118]:
np.random.seed(425)

ts = pd.Series(data=np.random.randint(low=1, high=10, size=7),
               index=pd.date_range(start="2021-01-11", periods=7))


ts

2021-01-11    3
2021-01-12    1
2021-01-13    3
2021-01-14    4
2021-01-15    8
2021-01-16    5
2021-01-17    8
Freq: D, dtype: int64

**Forward Shift**

In [122]:
pd.DataFrame(ts).shift(periods=2) # Shift index by desired number of periods

Unnamed: 0,0
2021-01-11,
2021-01-12,
2021-01-13,3.0
2021-01-14,1.0
2021-01-15,3.0
2021-01-16,4.0
2021-01-17,8.0


If `freq` argument is specified then the index values are shifted but the data is not realigned.

In [123]:
pd.DataFrame(ts).shift(periods=2, freq="D")

Unnamed: 0,0
2021-01-13,3
2021-01-14,1
2021-01-15,3
2021-01-16,4
2021-01-17,8
2021-01-18,5
2021-01-19,8


In [124]:
pd.DataFrame(ts)

Unnamed: 0,0
2021-01-11,3
2021-01-12,1
2021-01-13,3
2021-01-14,4
2021-01-15,8
2021-01-16,5
2021-01-17,8


**Backward Shift**

In [125]:
pd.DataFrame(ts).shift(periods=-2) # Shift index by desired number of periods

Unnamed: 0,0
2021-01-11,3.0
2021-01-12,4.0
2021-01-13,8.0
2021-01-14,5.0
2021-01-15,8.0
2021-01-16,
2021-01-17,


In [126]:
pd.DataFrame(ts).shift(periods=-2, freq="D") # Shift index by desired number of periods

Unnamed: 0,0
2021-01-09,3
2021-01-10,1
2021-01-11,3
2021-01-12,4
2021-01-13,8
2021-01-14,5
2021-01-15,8


In [127]:
pd.DataFrame(ts)

Unnamed: 0,0
2021-01-11,3
2021-01-12,1
2021-01-13,3
2021-01-14,4
2021-01-15,8
2021-01-16,5
2021-01-17,8


## Rasampling

---

Resampling refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called `downsampling`, while converting lower frequency to higher frequency is called `upsampling`.


`resample()` is a time-based `groupby()`, followed by a reduction method on each of its groups.

#### Reference


[Resampling](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-resampling)

In [128]:
date_range = pd.date_range("2021-01-01-12-00", periods=120, freq="D")

df = pd.Series(data=np.random.randint(low=1, high=20, size=120),
               index=date_range)


df

2021-01-01 12:00:00+00:00    10
2021-01-02 12:00:00+00:00    13
2021-01-03 12:00:00+00:00     5
2021-01-04 12:00:00+00:00     8
2021-01-05 12:00:00+00:00    16
                             ..
2021-04-26 12:00:00+00:00     1
2021-04-27 12:00:00+00:00    14
2021-04-28 12:00:00+00:00    14
2021-04-29 12:00:00+00:00     7
2021-04-30 12:00:00+00:00    18
Freq: D, Length: 120, dtype: int64

In [129]:
type(df.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [130]:
df.index

DatetimeIndex(['2021-01-01 12:00:00+00:00', '2021-01-02 12:00:00+00:00',
               '2021-01-03 12:00:00+00:00', '2021-01-04 12:00:00+00:00',
               '2021-01-05 12:00:00+00:00', '2021-01-06 12:00:00+00:00',
               '2021-01-07 12:00:00+00:00', '2021-01-08 12:00:00+00:00',
               '2021-01-09 12:00:00+00:00', '2021-01-10 12:00:00+00:00',
               ...
               '2021-04-21 12:00:00+00:00', '2021-04-22 12:00:00+00:00',
               '2021-04-23 12:00:00+00:00', '2021-04-24 12:00:00+00:00',
               '2021-04-25 12:00:00+00:00', '2021-04-26 12:00:00+00:00',
               '2021-04-27 12:00:00+00:00', '2021-04-28 12:00:00+00:00',
               '2021-04-29 12:00:00+00:00', '2021-04-30 12:00:00+00:00'],
              dtype='datetime64[ns, tzutc()]', length=120, freq='D')

### Downsampling

In [131]:
df

2021-01-01 12:00:00+00:00    10
2021-01-02 12:00:00+00:00    13
2021-01-03 12:00:00+00:00     5
2021-01-04 12:00:00+00:00     8
2021-01-05 12:00:00+00:00    16
                             ..
2021-04-26 12:00:00+00:00     1
2021-04-27 12:00:00+00:00    14
2021-04-28 12:00:00+00:00    14
2021-04-29 12:00:00+00:00     7
2021-04-30 12:00:00+00:00    18
Freq: D, Length: 120, dtype: int64

Aggregate data into month chunks by taking the sum of each group.

In [132]:
df.resample(rule="M").sum()

2021-01-31 00:00:00+00:00    327
2021-02-28 00:00:00+00:00    285
2021-03-31 00:00:00+00:00    335
2021-04-30 00:00:00+00:00    301
Freq: M, dtype: int64

In [133]:
df.resample(rule="M").mean()

2021-01-31 00:00:00+00:00    10.548387
2021-02-28 00:00:00+00:00    10.178571
2021-03-31 00:00:00+00:00    10.806452
2021-04-30 00:00:00+00:00    10.033333
Freq: M, dtype: float64

In [134]:
df.resample(rule="M", kind="period").sum() # Convert resulting index to "PeriodIndex"

2021-01    327
2021-02    285
2021-03    335
2021-04    301
Freq: M, dtype: int64

#### Open-High-Low-Close (OHLC) resampling

---

In finance, a popular way to aggregate a time series is to compute four values for each bucket: the first (open), last (close), maximum (high), and minimal (low) values.





* (open, first)


* (high, max)


* (low, min)


* (close, last)

In [135]:
df

2021-01-01 12:00:00+00:00    10
2021-01-02 12:00:00+00:00    13
2021-01-03 12:00:00+00:00     5
2021-01-04 12:00:00+00:00     8
2021-01-05 12:00:00+00:00    16
                             ..
2021-04-26 12:00:00+00:00     1
2021-04-27 12:00:00+00:00    14
2021-04-28 12:00:00+00:00    14
2021-04-29 12:00:00+00:00     7
2021-04-30 12:00:00+00:00    18
Freq: D, Length: 120, dtype: int64

In [136]:
df.resample("M").ohlc()

Unnamed: 0,open,high,low,close
2021-01-31 00:00:00+00:00,10,19,1,3
2021-02-28 00:00:00+00:00,8,19,1,19
2021-03-31 00:00:00+00:00,9,19,3,19
2021-04-30 00:00:00+00:00,12,19,1,18


In [137]:
df = df.resample(rule="W").mean() # Week average

df

2021-01-03 00:00:00+00:00     9.333333
2021-01-10 00:00:00+00:00    13.285714
2021-01-17 00:00:00+00:00     9.428571
2021-01-24 00:00:00+00:00     9.857143
2021-01-31 00:00:00+00:00    10.142857
2021-02-07 00:00:00+00:00     7.000000
2021-02-14 00:00:00+00:00    10.571429
2021-02-21 00:00:00+00:00    12.285714
2021-02-28 00:00:00+00:00    10.857143
2021-03-07 00:00:00+00:00    11.142857
2021-03-14 00:00:00+00:00    13.142857
2021-03-21 00:00:00+00:00     8.000000
2021-03-28 00:00:00+00:00    10.714286
2021-04-04 00:00:00+00:00     9.428571
2021-04-11 00:00:00+00:00    13.428571
2021-04-18 00:00:00+00:00     7.285714
2021-04-25 00:00:00+00:00    10.000000
2021-05-02 00:00:00+00:00    10.800000
Freq: W-SUN, dtype: float64

### Upsampling


---

Compared to downsampling, when doing upsampling we don't need an aggregation function. We use the asfreq method to convert to the higher frequency without any aggregation.

In [138]:
df # Week average

2021-01-03 00:00:00+00:00     9.333333
2021-01-10 00:00:00+00:00    13.285714
2021-01-17 00:00:00+00:00     9.428571
2021-01-24 00:00:00+00:00     9.857143
2021-01-31 00:00:00+00:00    10.142857
2021-02-07 00:00:00+00:00     7.000000
2021-02-14 00:00:00+00:00    10.571429
2021-02-21 00:00:00+00:00    12.285714
2021-02-28 00:00:00+00:00    10.857143
2021-03-07 00:00:00+00:00    11.142857
2021-03-14 00:00:00+00:00    13.142857
2021-03-21 00:00:00+00:00     8.000000
2021-03-28 00:00:00+00:00    10.714286
2021-04-04 00:00:00+00:00     9.428571
2021-04-11 00:00:00+00:00    13.428571
2021-04-18 00:00:00+00:00     7.285714
2021-04-25 00:00:00+00:00    10.000000
2021-05-02 00:00:00+00:00    10.800000
Freq: W-SUN, dtype: float64

In [139]:
df.resample(rule="D").asfreq()

2021-01-03 00:00:00+00:00     9.333333
2021-01-04 00:00:00+00:00          NaN
2021-01-05 00:00:00+00:00          NaN
2021-01-06 00:00:00+00:00          NaN
2021-01-07 00:00:00+00:00          NaN
                               ...    
2021-04-28 00:00:00+00:00          NaN
2021-04-29 00:00:00+00:00          NaN
2021-04-30 00:00:00+00:00          NaN
2021-05-01 00:00:00+00:00          NaN
2021-05-02 00:00:00+00:00    10.800000
Freq: D, Length: 120, dtype: float64

In [140]:
df.resample(rule="D").ffill()

2021-01-03 00:00:00+00:00     9.333333
2021-01-04 00:00:00+00:00     9.333333
2021-01-05 00:00:00+00:00     9.333333
2021-01-06 00:00:00+00:00     9.333333
2021-01-07 00:00:00+00:00     9.333333
                               ...    
2021-04-28 00:00:00+00:00    10.000000
2021-04-29 00:00:00+00:00    10.000000
2021-04-30 00:00:00+00:00    10.000000
2021-05-01 00:00:00+00:00    10.000000
2021-05-02 00:00:00+00:00    10.800000
Freq: D, Length: 120, dtype: float64

$$
$$


We can resample our Series or DataFrame and then apply different functions

In [141]:
res = df.resample(rule="M")

In [142]:
res

<pandas.core.resample.DatetimeIndexResampler object at 0x7fcff9416090>

In [143]:
res.aggregate([np.sum, np.mean, np.std])

Unnamed: 0,sum,mean,std
2021-01-31 00:00:00+00:00,52.047619,10.409524,1.64082
2021-02-28 00:00:00+00:00,40.714286,10.178571,2.247826
2021-03-31 00:00:00+00:00,43.0,10.75,2.116906
2021-04-30 00:00:00+00:00,40.142857,10.035714,2.545838
2021-05-31 00:00:00+00:00,10.8,10.8,


## Moving Window Functions

---

An important class of array transformations used for Time Series operations are statistics and other functions evaluated over a sliding window - the function that performs an aggregation over a sliding partition of values.

### Rolling Functions

Let create some simple Pandas Series to understand how the Window functions work.

In [144]:
np.random.seed(425)

ts = pd.Series(data=np.random.randint(low=1, high=10, size=7),
               index=pd.date_range(start="2021-01-11", periods=7))


ts

2021-01-11    3
2021-01-12    1
2021-01-13    3
2021-01-14    4
2021-01-15    8
2021-01-16    5
2021-01-17    8
Freq: D, dtype: int64

In [154]:
ts.rolling(window=2).sum() # Sum every two period

2021-01-11     3.0
2021-01-12     4.0
2021-01-13     4.0
2021-01-14     7.0
2021-01-15    12.0
2021-01-16    13.0
2021-01-17    13.0
Freq: D, dtype: float64

In [148]:
ts.rolling(window=2).mean() # Mean of every two period

2021-01-11    NaN
2021-01-12    2.0
2021-01-13    2.0
2021-01-14    3.5
2021-01-15    6.0
2021-01-16    6.5
2021-01-17    6.5
Freq: D, dtype: float64

In [149]:
ts.rolling(window=3).sum() # Sum every third period

2021-01-11     NaN
2021-01-12     NaN
2021-01-13     7.0
2021-01-14     8.0
2021-01-15    15.0
2021-01-16    17.0
2021-01-17    21.0
Freq: D, dtype: float64

### Expanding Functions

---

The expanding functions starts the time window from the beginning of the time series and increases the size of the window until it encompasses the whole series.

In [150]:
ts

2021-01-11    3
2021-01-12    1
2021-01-13    3
2021-01-14    4
2021-01-15    8
2021-01-16    5
2021-01-17    8
Freq: D, dtype: int64

In [151]:
ts.expanding().sum()

2021-01-11     3.0
2021-01-12     4.0
2021-01-13     7.0
2021-01-14    11.0
2021-01-15    19.0
2021-01-16    24.0
2021-01-17    32.0
Freq: D, dtype: float64

In [152]:
ts.expanding().mean()

2021-01-11    3.000000
2021-01-12    2.000000
2021-01-13    2.333333
2021-01-14    2.750000
2021-01-15    3.800000
2021-01-16    4.000000
2021-01-17    4.571429
Freq: D, dtype: float64

# Summary

---

Time series data requires different types of analysis and transformation. Knowing how to deal with time series data, either it will be a regular or irregular is a great tool under the belt.