<p><font size="6"><b>04 - Pandas: Working with time series data</b></font></p>

> *DS Data manipulation, analysis and visualisation in Python*  
> *December, 2017*

> *© 2016, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [2]:
# %matplotlib notebook
%matplotlib inline 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')
pd.options.display.max_rows = 8

# Introduction: `datetime` module

Standard Python contains the `datetime` module to handle date and time data:

In [41]:
import datetime

In [18]:
dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30)
dt

datetime.datetime(2016, 12, 19, 13, 30)

In [19]:
print(dt) # .day,...

2016-12-19 13:30:00


In [20]:
print(dt.strftime("%d %B %Y"))

19 December 2016


# Dates and times in pandas

## The ``Timestamp`` object

Pandas has its own date and time objects, which are compatible with the standard `datetime` objects, but provide some more functionality to work with.  

The `Timestamp` object can also be constructed from a string:

In [21]:
ts = pd.Timestamp('2016-12-19')
ts

Timestamp('2016-12-19 00:00:00')

Like with `datetime.datetime` objects, there are several useful attributes available on the `Timestamp`. For example, we can get the month (experiment with tab completion!):

In [None]:
ts.month

12

In [None]:
ts.day

19

In [None]:
ts.year

2016

In [None]:
ts

Timestamp('2016-12-19 00:00:00')

There is also a `Timedelta` type, which can e.g. be used to add intervals of time:

In [24]:
ts + pd.Timedelta('9 days')

Timestamp('2016-12-28 00:00:00')

## Parsing datetime strings 

![](http://imgs.xkcd.com/comics/iso_8601.png)


Unfortunately, when working with real world data, you encounter many different `datetime` formats. Most of the time when you have to deal with them, they come in text format, e.g. from a `CSV` file. To work with those data in Pandas, we first have to *parse* the strings to actual `Timestamp` objects.

<div class="alert alert-info">
<b>REMEMBER</b>: <br><br>

To convert string formatted dates to Timestamp objects: use the `pandas.to_datetime` function
</div>



In [25]:
pd.to_datetime("2016-12-09")

Timestamp('2016-12-09 00:00:00')

In [26]:
pd.to_datetime("12/13/2016")

Timestamp('2016-12-13 00:00:00')

In [27]:
pd.to_datetime("13/12/2016", dayfirst=True)

Timestamp('2016-12-13 00:00:00')

In [28]:
pd.to_datetime("09/12/2016", format="%d/%m/%Y")

Timestamp('2016-12-09 00:00:00')

A detailed overview of how to specify the `format` string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior

## `Timestamp` data in a Series or DataFrame column

In [29]:
s = pd.Series(['2016-12-04 10:00:00', '2016-12-05, 11:00:00', '2016-12-06 12:00:00'])

In [30]:
s

0     2016-12-04 10:00:00
1    2016-12-05, 11:00:00
2     2016-12-06 12:00:00
dtype: object

The `to_datetime` function can also be used to convert a full series of strings:

In [31]:
ts = pd.to_datetime(s)

In [32]:
ts

0   2016-12-04 10:00:00
1   2016-12-05 11:00:00
2   2016-12-06 12:00:00
dtype: datetime64[ns]

Notice the data type of this series has changed: the `datetime64[ns]` dtype. This indicates that we have a series of actual datetime values.

The same attributes as on single `Timestamp`s are also available on a Series with datetime data, using the **`.dt`** accessor:

In [33]:
ts.dt.hour

0    10
1    11
2    12
dtype: int64

In [34]:
ts.dt.weekday

0    6
1    0
2    1
dtype: int64

To quickly construct some regular time series data, the [``pd.date_range``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) function comes in handy:

In [None]:
pd.Series(pd.date_range(start="2020-11-03", periods=100, freq='3H'))

0    2020-11-03 00:00:00
1    2020-11-03 03:00:00
2    2020-11-03 06:00:00
3    2020-11-03 09:00:00
             ...        
96   2020-11-15 00:00:00
97   2020-11-15 03:00:00
98   2020-11-15 06:00:00
99   2020-11-15 09:00:00
Length: 100, dtype: datetime64[ns]

# Time series data: `Timestamp` in the index

## River discharge example data

For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the [Waterinfo website](https://www.waterinfo.be/).

In [4]:
import pandas as pd

In [5]:
data = pd.read_csv("vmm_flowdata.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'vmm_flowdata.csv'

In [37]:
data.head()

NameError: name 'data' is not defined

We already know how to parse a date column with Pandas:

In [None]:
data['Time'] = pd.to_datetime(data['Time'])

With `set_index('datetime')`, we set the column with datetime values as the index, which can be done by both `Series` and `DataFrame`.

In [None]:
data = data.set_index("Time")

In [None]:
data

Unnamed: 0_level_0,L06_347,LS06_347,LS06_348
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-01-01 00:00:00,0.137417,0.097500,0.016833
2009-01-01 03:00:00,0.131250,0.088833,0.016417
2009-01-01 06:00:00,0.113500,0.091250,0.016750
2009-01-01 09:00:00,0.135750,0.091500,0.016250
...,...,...,...
2013-01-01 15:00:00,1.420000,1.420000,0.096333
2013-01-01 18:00:00,1.178583,1.178583,0.083083
2013-01-01 21:00:00,0.898250,0.898250,0.077167
2013-01-02 00:00:00,0.860000,0.860000,0.075000


The steps above are provided as built-in functionality of `read_csv`:

In [None]:
data = pd.read_csv("vmm_flowdata.csv", index_col=0, parse_dates=True)

In [None]:
data

Unnamed: 0_level_0,L06_347,LS06_347,LS06_348
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-01-01 00:00:00,0.137417,0.097500,0.016833
2009-01-01 03:00:00,0.131250,0.088833,0.016417
2009-01-01 06:00:00,0.113500,0.091250,0.016750
2009-01-01 09:00:00,0.135750,0.091500,0.016250
...,...,...,...
2013-01-01 15:00:00,1.420000,1.420000,0.096333
2013-01-01 18:00:00,1.178583,1.178583,0.083083
2013-01-01 21:00:00,0.898250,0.898250,0.077167
2013-01-02 00:00:00,0.860000,0.860000,0.075000


<div class="alert alert-info">
<b>REMEMBER</b>: <br><br>

`pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the help of the read_csv function...
</div>

## The DatetimeIndex

When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:

In [14]:
data.index

NameError: name 'data' is not defined

Similar to a Series with datetime data, there are some attributes of the timestamp values available:

In [None]:
data.index.day

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  2,  2,
            ...
            31,  1,  1,  1,  1,  1,  1,  1,  1,  2],
           dtype='int64', name='Time', length=11697)

In [None]:
data.index.dayofyear

Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   2,   2,
            ...
            366,   1,   1,   1,   1,   1,   1,   1,   1,   2],
           dtype='int64', name='Time', length=11697)

In [None]:
data.index.year

Int64Index([2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,
            ...
            2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013],
           dtype='int64', name='Time', length=11697)

The `plot` method will also adapt it's labels (when you zoom in, you can see the different levels of detail of the datetime labels):


In [None]:
%matplotlib notebook

In [None]:
data.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f8924665d30>

We have to much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections.

## Selecting data from a time series

We can use label based indexing on a timeseries as expected:

In [13]:
data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")]

NameError: name 'data' is not defined

But, for convenience, indexing a time series also works with strings:

In [12]:
data["2012-01-01 09:00":"2012-01-01 19:00"]

NameError: name 'data' is not defined

A nice feature is **"partial string" indexing**, where we can do implicit slicing by providing a partial datetime string.

E.g. all data of 2013:

In [11]:
data['2013']

NameError: name 'data' is not defined

Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.

Or all data of January up to March 2012:

In [2]:
data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")]

NameError: name 'data' is not defined

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>select all data starting from 2012</li>
</ul>
</div>

In [10]:
data['2012']

datetime.datetime(2012, 1, 1, 0, 0)

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>select all data in January for all different years</li>
</ul>
</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>select all data in April, May and June for all different years</li>
</ul>
</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>select all 'daytime' data (between 8h and 20h) for all days</li>
</ul>
</div>

## The power of pandas: `resample`

A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).

The time series has a frequency of 1 hour. I want to change this to daily:

In [None]:
data.resample('D').mean().head()

Other mathematical methods can also be specified:

In [None]:
data.resample('D').max().head()

<div class="alert alert-info">
<b>REMEMBER</b>: <br><br>

    The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases <br><br>
            
    These strings can also be combined with numbers, eg `'10D'`...

</div>



In [None]:
data.resample('M').mean().plot() # 10D

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>plot the monthly standard deviation of the columns</li>
</ul>
</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>plot the monthly mean and median values for the years 2011-2012 for 'L06_347'<br><br></li>
</ul>
    
    **Note** <br>You can create a new figure with `fig, ax = plt.subplots()` and add each of the plots to the created `ax` object (see documentation of pandas plot function)
</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>plot the monthly mininum and maximum daily average value of the 'LS06_348' column</li>
</ul>
</div>

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>make a bar plot of the mean of the stations in year of 2013 (Remark: create a `fig, ax = plt.subplots()` object and add the plot to the created ax</li>
</ul>

</div>