# Working with Time Series

Pandas was developed in the context of financial modeling. As such, it contains a fairly extensive set of tools for working with dates, times, and time-indexed data. Date and time data comes in a few flavors:

- `time stamps` reference particular moments in time.
- `time intervals` and `periods` reference a length of time between a particular beginning and end point.
    - `periods` usually reference a special case of time intervals in which each interval is of uniform length that is ordered, sequential, and without overlap.
- `time deltas` or `durations` reference an exact length of time, disconnected or unconcerned with information referencing specific dates and time.

In this section, we will introduce how to work with each of these types of data/time data in Pandas.

## Dates and Times in Python

While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.

### Native Python dates and times: `datetime` and `dateutil`

Python's basic objects for working with dates and timess reside in the `datetime` module. Along with the third-party `dateutil` module, you can use it to quickly perform a host of useful functionalities on dates and times.

In [2]:
from datetime import datetime
datetime(year=2015, month=7, day=4)

datetime.datetime(2015, 7, 4, 0, 0)

In [3]:
from dateutil import parser
date = parser.parse('4th of July, 2015')
date

datetime.datetime(2015, 7, 4, 0, 0)

And with `datetime` objects, it is possible to perform convertions like below:

In [4]:
date.strftime('%A')

'Saturday'

`strftime` takes _standard string format codes_ arguments and interprets them accordingly. `%A` in the above case is asking for **the day of the week**.

A related package to be aware of is `pytz`, which contains tools for working with the most migrane-inducing piece of time series data: _time zones_.


The power of `datetime` and `dateutil` lie in their flexibility and easy syntax: you can use these objects and their built-in methods to easily perform nearly operation you might be interested in. Where they break down is when you wish to work with large arrays of dates and times.

Just as lists of Python numerical variables are suboptimal compared to _NumPy_ typed numerical arrays, lists of Python datetime objects are suboptimal compared to typed arrays of encoded dates.

### Typed arrays of times: *NumPy*

The weakness of Python's datetime format inspired the *NumPy* team to add a set of native time series data type to *NumPy*. The `datetime64` dtype encodes dates as 64-bit integers, thus allowing arrays of dates to be represented very compactly.

In [5]:
import numpy as np
date = np.array('2105-07-04', dtype=np.datetime64)
date

array('2105-07-04', dtype='datetime64[D]')

Once we have formatted the date, we can quickly perform vectorized operations on it.

In [6]:
date + np.arange(12)

array(['2105-07-04', '2105-07-05', '2105-07-06', '2105-07-07',
       '2105-07-08', '2105-07-09', '2105-07-10', '2105-07-11',
       '2105-07-12', '2105-07-13', '2105-07-14', '2105-07-15'],
      dtype='datetime64[D]')

Because of the uniform type in *NumPy* `datetime64` arrays, this type of operation can be accomplished more quickly than if we were directly with Python's native `datetime` objects, especially as the arrays increase in size.

One detail of the `datetime64` and `timedelta64` objects is that they are built on a _fundamental time unit_. Because the `datetime64` object is limited to 64-bit precision, the range of encodable times is $2^{64} \ \times\ $ this fundamental unit. In other words, `datetime64` imposes a trade-off between _time resolution_ and _maximum time span_.

> If you want a _time resolution_ of 1 nanosecond, you only have enough information to encode a range of $2^{64}$ nanoseconds, or just under 600 years. _NumPy_ will infer the desired unit from the input.

In [7]:
np.datetime64('2015-07-04')

numpy.datetime64('2015-07-04')

In [9]:
np.datetime64('2015-07-04 12:00')

numpy.datetime64('2015-07-04T12:00')

Notice that the time zone is automatically set to thee local time on the computer executing the code. You can force any desired fundamental unit using one of many format codes. The example below forces a nanosecond-based time.

In [10]:
np.datetime64('2015-07-04 12:59:59.50', 'ns')

numpy.datetime64('2015-07-04T12:59:59.500000000')

For the types of data we see in the real world, a useful default is `datetime[ns]`, as it can encode a useful range of modern dates with a suitably fine precision.

Finally, we will note that while the `datetime64` data type addresses some of the deficiencies of the built-in Python `datetime` type, it lacks many of the convenient methods and functions provided by `datetime` and especially `dateutil`.

### Dates and Times in Pandas: The Best of Python/NumPy Worlds

Pandas builds all the tools just discussed to provide a `Timestamp` object, which combines the ease-of-use of `datetime` and `dateutil` with the efficient storage and vectorized interface of `numpy.datetime64`. From a group of these `Timestamp` objects, Pandas can construct a `DattimeIndex` that can be used to index data in a `Series` or `DataFrame`.

In [11]:
import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date

Timestamp('2015-07-04 00:00:00')

In [12]:
date.strftime('%A')

'Saturday'

Additionally, we can do NumPy-stype vectorized operations directly.

In [13]:
date + pd.to_timedelta(np.arange(12), 'D')

DatetimeIndex(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
               '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
               '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
              dtype='datetime64[ns]', freq=None)

### Pandas Time Series: Indexing by Time

Where the Pandas Time series tools really become useful is when you begin to _index data by timestamps_. For example, we can construct a `Series` object that has time indexed data.

In [14]:
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])

Now that we have this data in a `Series`, we can make use of any of the `Series` indexing patterns we discussed in previous sections, passing values that can be coerced into dates.

In [15]:
data = pd.Series([0, 1, 2, 3], index=index)

In [16]:
data['2014-07-04' : '2015-07-04']

2014-07-04    0
2014-08-04    1
2015-07-04    2
dtype: int64

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year.

In [17]:
data['2015']

2015-07-04    2
2015-08-04    3
dtype: int64

### Pandas Time Series Data Structures

This section will introduce the fundamental Pandas data structures for working with time series data.

- For _time stamps_, Pandas provides the `Timestamp` type. As mentioned before, it is essentially a replacement for Python's native `datetime`, but is based on the more efficient `numpy.datetime64` data type. The associated Index structure is `DatetimeIndex`.
- For _time Periods_, Pandas provides the `Period` type. This encodes a fixed frequency interval based on `numpy.datetime64`. The associated index structure is `PeriodIndex`.
- For _time deltas_ or _durations_, Pandas provides the `Timedelta` type. `Timedelta` is a more efficient replacement for Python's native `datetime.timedelta` type, and is based on `numpy.timedelta64`. The associated index structure is `TimedeltaIndex`.

The most fundamental of these date/time objects are the `Timestamp` and `DatetimeIndex` objects. While these class objects can be invoked directly, it is more common to use the `pd.to_datetime()` function, which can parse a wide variety of formats. Passing a single date to `pd.to_datetime()` yields a `Timestamp`; passing a series of dates by default yields a `DatetimeIndex`.

In [18]:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                        '2015-Jul-6', '07-07-2015', '20150708'])

dates

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)

Any `DatetimeIndex` can be converted to a `PeriodIndex` with the `to_period()` function with the addition of a frequency code; here we'll use `D` to indicate daily frequency.

In [19]:
dates.to_period('D')

PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='period[D]')

A `TimedeltaIndex` is created, for example, when a date is subtracted from another.

In [20]:
dates - dates[0]

TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)

### Regular Sequences: `pd.date_range()`

To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: `pd.date_range()` for timestamps, `pd.period_range()` for periods, and `pd.timedelta_range()` for time deltas. We've seen that Python's `range()` and NumPy's `np.range()` turn a startpoint, endpoint, and optional stepsize into a sequence. Similarly, `pd.date_range()` accepts a start date, an end state, and an optional frequency code to create a regular sequence of dates.

By default, the frequency is one day.

In [21]:
pd.date_range('2015-07-03', '2015-07-10')

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')

Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods.

In [22]:
pd.date_range('2015-07-03', periods=8)

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')

The spacing can be modified by altering the `freq` argument, which defaults to `D`. For example, here we will construct a range of hourly timestamps.

In [23]:
pd.date_range('2015-07-03', periods=8, freq='H')

DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
               '2015-07-03 02:00:00', '2015-07-03 03:00:00',
               '2015-07-03 04:00:00', '2015-07-03 05:00:00',
               '2015-07-03 06:00:00', '2015-07-03 07:00:00'],
              dtype='datetime64[ns]', freq='H')

To create regular sequences of `Period` or `Timedelta` values, the very similar `pd.period_range()` and `pd.timedelta_range()` functions are useful. Here are some monthly periods.

In [24]:
pd.period_range('2015-07', periods=8, freq='M')

PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
             '2016-01', '2016-02'],
            dtype='period[M]')

And a sequence of durations increasing by an hour.

In [25]:
pd.timedelta_range(0, periods=10, freq='H')

TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00'],
               dtype='timedelta64[ns]', freq='H')

### Frequencies and Offsets

Fundamental to these Pandas time series tools is the concept of frequency or date offset. Just as we saw the `D` (day) and `H` (hour) codes above, we can use such codes to specify any desired frequency spacing.

The monthly, quarterly, and annual frequencies are all marked at the end of the specified period. By adding an `S` suffix to any of these, they will instead be marked at the beginning.