<a href="https://colab.research.google.com/github/Shuraimi/DataScience-Handbook-Notes/blob/main/2.%20Data_manipulation_with_Pandas/12.%20Working_with_Time_Series_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Time Series

Pandas was developed in the context is financial modelling, therefore it sontains set of tools for working with dates, times and time indexed data.
<br>
Date and time comes in few flavours such as:-
1. *Time stamps* reference particular moments in time
2. *Time intervals* and *periods* reference a length of time between a start point and end point (eg the year 2015). *Periods* reference to special case of time intervals in which each interval is of uniform time interval and these don't overlap.(eg 24 hours log period comprising days)
3. *Time deltas* or *durations* reference an exact length of time. (Eg duration of 22.34 seconds)

This is a short section and dies give a complete guide to time series in Pandas but as a broad overview of how you should start working with time series data in pandas

##Date and Times in Pandas

The python world has many representations of date, time, deltas and timespans but pandas has time series tools that are most useful for data scientists and it is also useful to see the relationship to other packages used in python.

### Native Python dates and times: `datetime` and `dateutil`

Python's basic objects for working with date and time is available in the built-in `datetime` module. Among with the third party `dateutuil` module, we can perform a host of useful functionalities on dates and times.

For example, manually constructing a date using `datetime` module.

In [None]:
from datetime import datetime
datetime(year=2023,month=7,day=6)

datetime.datetime(2023, 7, 6, 0, 0)

Or by using the`dateutil` , we can parse a string to get dates

In [None]:
from dateutil import parser
date=parser.parse('15th August 1947')
date

datetime.datetime(1947, 8, 15, 0, 0)

Once you have a `datetime`object, you can print the day of the week.

In [None]:

#this is the standard code format for printing date
date.strftime('%A')

'Friday'

A related package is `pytz` for working with most migrane inducing pieces of time series data: time zones.

The power of `datetime` and `dateutil` is it's easy syntax and flexibility and you can use its objects and their built-in methods to perform any operation you like on this time series data.

Where they break is when we want to work with large arrays of dates and times.
<br>
Just as Python lists of numerical values are suboptimal (less than optimal) compared to Numpy styled typed numerical arrays, lists of python date time objects are suboptimal compared to typed arrays of encoded dates.

### Typed arrays of times: Numpy's `datetime64`

 The weakness of Python's datetime format inspired Numpy's team to develop a native time series datatype to Numpy.
The `datetime64` dtype encodes dates as a 64 bit integer and allows arrays of dates to be represented compactly.

The `datetime64` requires dates to be given in a specific order.

In [None]:
import numpy as np
date=np.array('2023-06-04',dtype=np.datetime64)
date

array('2023-06-04', dtype='datetime64[D]')

We can do vectorised operations on this formatted date.

In [None]:
date+np.arange(10)

array(['2023-06-04', '2023-06-05', '2023-06-06', '2023-06-07',
       '2023-06-08', '2023-06-09', '2023-06-10', '2023-06-11',
       '2023-06-12', '2023-06-13'], dtype='datetime64[D]')

Because of the uniform type in Numpy `datetime64` operations like these can be performed very quickly than with the standard `datetime` python object especially when array gets larger.

One detail of the `datetime64` and `timedelta64` objects is that they are build on a *fundamental time unit*. The `datetime64` object is limited to a 64 bit precision, range of encodable times is $2^{64}$ times  the fundamental unit. In other words, the `datetime64` imposes a trade-off between *time resolution* and *maximum timespan*.

For example, you want a time resolution of one nanosecond, you only have info to encode these range $2^{64}$ nanoseconds or just 600 years. Numpy will infer desired unit from input.

In [None]:

#day based time
np.datetime64('2012-12-15')

numpy.datetime64('2012-12-15')

In [None]:
#minute based time
np.datetime64('2012-12-15 12:00')

numpy.datetime64('2012-12-15T12:00')

The time zone is automatically set to your local machine executing the code.

You can force any desired fundamental unit using one of the many format codes.

In [None]:
#nano second based time
np.datetime64('2014-04-14 12:56:56.50','ns')

numpy.datetime64('2014-04-14T12:56:56.500000000')

The following table gives the list of available format codes with relative and absolute timespans that they can encode.

|Code  | Meaning     | Time span (relative) | Time span (absolute)   |
|------|-------------|----------------------|------------------------|
| `Y`  | Year        | ± 9.2e18 years       | [9.2e18 BC, 9.2e18 AD] |
| `M`  | Month       | ± 7.6e17 years       | [7.6e17 BC, 7.6e17 AD] |
| `W`  | Week        | ± 1.7e17 years       | [1.7e17 BC, 1.7e17 AD] |
| `D`  | Day         | ± 2.5e16 years       | [2.5e16 BC, 2.5e16 AD] |
| `h`  | Hour        | ± 1.0e15 years       | [1.0e15 BC, 1.0e15 AD] |
| `m`  | Minute      | ± 1.7e13 years       | [1.7e13 BC, 1.7e13 AD] |
| `s`  | Second      | ± 2.9e12 years       | [ 2.9e9 BC, 2.9e9 AD]  |
| `ms` | Millisecond | ± 2.9e9 years        | [ 2.9e6 BC, 2.9e6 AD]  |
| `us` | Microsecond | ± 2.9e6 years        | [290301 BC, 294241 AD] |
| `ns` | Nanosecond  | ± 292 years          | [ 1678 AD, 2262 AD]    |
| `ps` | Picosecond  | ± 106 days           | [ 1969 AD, 1970 AD]    |
| `fs` | Femtosecond | ± 2.6 hours          | [ 1969 AD, 1970 AD]    |
| `as` | Attosecond  | ± 9.2 seconds        | [ 1969 AD, 1970 AD]    |

The `datetime64[ns]` is a useful default while working with real world dates as it can handle modem dates with suitably fine precision.

Finally note that even though Numpy's `datetime64` data type addresses many deficiencies of the buil-in python `datetime` type, as it lacks many of the convinient methods provided by these `datetime`and `dateutil` module.

### Date and times in Pandas: The best of both worlds

Pandas builds upon all the tools just discussed to provide a `Timestamp` object whic combines the ease of `datetime` and `dateutil` with efficient storage and vectorised interface of Numpy's `datetime64`.

From a group of these`Timestamp` objects, pandas can create a `DatetimeIndex` that can be used to index data in a series or DataFrame.

For example, we can demonstrate the previous examples of dates using Pandas tools. We parse a formatted string date and use format codes to get the day of the week.

In [None]:
import pandas as pd
date=pd.to_datetime('19th July 1920')
date

Timestamp('1920-07-19 00:00:00')

In [None]:
date.strftime('%A')

'Monday'

We can also perform Numpy style vectorised operations like

In [None]:
date+pd.to_timedelta(np.arange(12))

DatetimeIndex([          '1920-07-19 00:00:00',
               '1920-07-19 00:00:00.000000001',
               '1920-07-19 00:00:00.000000002',
               '1920-07-19 00:00:00.000000003',
               '1920-07-19 00:00:00.000000004',
               '1920-07-19 00:00:00.000000005',
               '1920-07-19 00:00:00.000000006',
               '1920-07-19 00:00:00.000000007',
               '1920-07-19 00:00:00.000000008',
               '1920-07-19 00:00:00.000000009',
               '1920-07-19 00:00:00.000000010',
               '1920-07-19 00:00:00.000000011'],
              dtype='datetime64[ns]', freq=None)

Later on we'll manipulate time series data with tools covered in

## Pandas Time Series: Indexing by time

The Pandas time series tools become useful when we *index* data using *timestamps*.


We can construct a Series which has time indexed data.

In [None]:
index=pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data=pd.Series(np.arange(4),index=index)
data

2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64

Now that we have a Series object, we can use Series indexing patterns.

In [None]:
data['2014-07-04':'2015-07-04']

2014-07-04    0
2014-08-04    1
2015-07-04    2
dtype: int64

There are special date only indexing such as mentioned the year to get a slice of data.

In [None]:
data['2014']

2014-07-04    0
2014-08-04    1
dtype: int64

Later we'll see how to use dates as index but now we'll look at the various data structures in time series.

## Pandas Time Series Data Structures

Now we'll look at the fundamental Pandas data structures for working with time series data:
- For *timestamps* we have the Pandas `Timestamp` object which is essentially the replacement to Python's `datetime` and also provides vectorised interface like `numpy.datetime`. The associated `Index` is the `DatetimeIndex`.
- For *timeperiods*, Pandas provides the `Period` type. This encodes a fixed frequency interval based on `numpy.datetime64`. The index structure is `PeriodIndex`.
- For *timedeltas* and periods, Pandas provides `Timedelta` type. The `Timedelta` type is more efficient than the pythons native `datetime.timedelta` type and is based on `numpy.timedelta64`. The index structure is `TimedeltaIndex`.

The most fundamental of these are the `Timestamp` and `DatetimeIndex` objects. While these class objects can be invoked directly, it is more common to use the `pd.to_datetime` function which can pass a wide variety of formats.

Passing a single date time to this function gives a `Timestamp` object and passing a series gives a `DatetimeIndex`

In [None]:
dates=pd.to_datetime([datetime(2015,7,3),'5th of March, 2018','17-06-2003','2019-Mar-8','20231218'])
dates

  dates=pd.to_datetime([datetime(2015,7,3),'5th of March, 2018','17-06-2003','2019-Mar-8','20231218'])


DatetimeIndex(['2015-07-03', '2018-03-05', '2003-06-17', '2019-03-08',
               '2023-12-18'],
              dtype='datetime64[ns]', freq=None)

Any `DatetimeIndex` can be covered to `PeriodIndex` using the `to_period()` function with the addition of a frequency code, here 'D' indicates daily frequency.

In [None]:
dates.to_period(freq='D')

PeriodIndex(['2015-07-03', '2018-03-05', '2003-06-17', '2019-03-08',
             '2023-12-18'],
            dtype='period[D]')

The `TimedeltaIndex` is created when a date is subtracted from the other.

In [None]:
dates[1]- dates[0]

Timedelta('976 days 00:00:00')

### Regular sequences:
`pd.date_range()`

To make creation of regular date sequences convenient, Pandas offers a few functions like
- `pd.date_range()` for timestamps
- `pd.period_range()` for periods
- `pd.timedelta_range()` for timedeltas

We've seen that Python's `range`, Numpy's `np.arange()` have a starting point, ending point and an optional step parameter, this `pd.date_range()` takes a start date,end date and an optional frequency code to create regular sequence of dates. By default, frequency is one day.

In [None]:
daty=pd.date_range('2023-12-21','2023-12-31')
daty

DatetimeIndex(['2023-12-21', '2023-12-22', '2023-12-23', '2023-12-24',
               '2023-12-25', '2023-12-26', '2023-12-27', '2023-12-28',
               '2023-12-29', '2023-12-30', '2023-12-31'],
              dtype='datetime64[ns]', freq='D')

Instead of specifying the start and end date, we can specify the the start date and number of periods.

In [None]:
d=pd.date_range('2024-06-12',periods=10)
d

DatetimeIndex(['2024-06-12', '2024-06-13', '2024-06-14', '2024-06-15',
               '2024-06-16', '2024-06-17', '2024-06-18', '2024-06-19',
               '2024-06-20', '2024-06-21'],
              dtype='datetime64[ns]', freq='D')

The spacing is by default is daily('D') and can be altered by specifying the freq parameter.

We can get hourly timestamps with `freq='H'`

In [None]:
s=pd.date_range('2012-04-14', periods=8,freq='H')
s

DatetimeIndex(['2012-04-14 00:00:00', '2012-04-14 01:00:00',
               '2012-04-14 02:00:00', '2012-04-14 03:00:00',
               '2012-04-14 04:00:00', '2012-04-14 05:00:00',
               '2012-04-14 06:00:00', '2012-04-14 07:00:00'],
              dtype='datetime64[ns]', freq='H')

`pd.period_range()` and `pd.timedelta_range()` can be used in the similar way to the `pd.date_range()`. Here are some monthly periods:

In [None]:
p=pd.period_range('2019-06-14',periods=7,freq='M')
p

PeriodIndex(['2019-06', '2019-07', '2019-08', '2019-09', '2019-10', '2019-11',
             '2019-12'],
            dtype='period[M]')

Sequence of durations increasing by an hour.

In [None]:
de=pd.timedelta_range(0,periods=6,freq='H')
de

TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00'],
               dtype='timedelta64[ns]', freq='H')

All these require description of Pandas frequency codes discussed next.

## Frequencies and Offsets

Fundamental to Pandas Series tools is the frequency code or data offset. Just as we used the code D for day , H fir hour, below is the table which summarises the list of frequency codes.

| Code | Description       | Code | Description          |
|------|-------------------|------|----------------------|
| `D`  | Calendar day      | `B`  | Business day         |
| `W`  | Weekly            |      |                      |
| `M`  | Month end         | `BM` | Business month end   |
| `Q`  | Quarter end       | `BQ` | Business quarter end |
| `A`  | Year end          | `BA` | Business year end    |
| `H`  | Hours             | `BH` | Business hours       |
| `T`  | Minutes           |      |                      |
| `S`  | Seconds           |      |                      |
| `L`  | Milliseconds       |      |                      |
| `U`  | Microseconds      |      |                      |
| `N`  | Nanoseconds       |      |                      |

The monthly, quarterly and annual frequencies are all marked at the end of the period. Adding a suffix `S` causes them to be marked at the beginning instead of the end.

| Code  | Description       | Code  | Description            |
|-------|-------------------|-------|------------------------|
| `MS`  | Month start       |`BMS`  | Business month start   |
| `QS`  | Quarter start     |`BQS`  | Business quarter start |
| `AS`  | Year start        |`BAS`  | Business year start    |

Additionally, we can change the month used to denote the quarterly or annual by adding a three letter month as suffix
- `Q-JAN` , `BQ-FEB`, `QS-MAR`,etc
- `A-JAN`, `BA-MAR`,`QS-DEC',etc

Similarly, the split point of week can be changed by using a three letter weekday code
- `W-MON`,`W-SUM`,etc

On top these, codes can be combined with number frequencies. Fri example, to get frequency of 2 hours and 30minutues

In [None]:
f=pd.date_range('2014-05-14',periods=6,freq='2H30T')
f

DatetimeIndex(['2014-05-14 00:00:00', '2014-05-14 02:30:00',
               '2014-05-14 05:00:00', '2014-05-14 07:30:00',
               '2014-05-14 10:00:00', '2014-05-14 12:30:00'],
              dtype='datetime64[ns]', freq='150T')

All of these short codes refer to specific instances of Period time series pd.offsets, which can be found in the `pd.tseries.offsets` module. For example, we can create a business day offset directly as follows

## Resampling, Shifting and Windowing

The ability to use dates and times as indices to intuitively organise and access data is an important in Pandas time series tools. The benefits of these tools in general (automatic indentation during operations, data slicing etc) still apply and pandas provides several additional time series specific operations.

We'll take a look at few of them using stock price data. Because Pandas was developed for financial context is has many financial specific tools.

For example:- the `pandas-reader` package (installed using `conda install pandas-reader`) can import financial data from number of sources including Yahoo finance, Google finance etc.
Here we'll use Google's closing price history.

In [None]:
from pandas_datareader import data
goog =data.DataReader('GOOG',start='2004',end='2016',data_source='google')
goog.head()

NotImplementedError: ignored