# Working with Time Series Data

## Import Statements

--------------------------

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 40)
pd.set_option("display.max_rows", 6)

---------------------------

## Reading Time Data and Handling Timezone Information

-----------------------

### *Basic informations and some knowhows about Timezones*

Coordinated Universal Time (UTC) is the time standard at 0 degrees longitude. It has an excellent property, that it is monotonically increasing. For example, Salt Lake City, Utah is in the America/Denver timezone, which is 6 or 7 hours offset of UTC depending on the time of
year (due to Day light saving). Thus we see, a timezone may contain one or more offsets.

Some terminologies: A time without a timezone or offset is called ”naive” time. A time specified in local time is also called ”civil time” or ”wall time”.

Timezones that have daylight savings time can have ”ambiguous time” in the fall when the time goes back. For this reason, if you are dealing with **local times**, you will want three things: **the time, the timezone, and an offset**. If you are only concerned with **duration**, you can just use **UTC time** or seconds since **UNIX epoch**.

A general recommendation for programmers is to store dates in UTC times and then convert them to local time as needed.


Getting the correct timezone name is important. The recommendation is prefacing your search with ”IANA” (ie. ”IANA Timezone for Salt Lake City”) and then double checking your result in this Wikipedia article (https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).

### *Working with UTC time data*

In [3]:
utc_data = pd.read_csv("./utc_time_data.csv").UTC_Date

In [4]:
utc_data.head(3)

0    2015 -03 -08 08:00:00-07:00
1    2015 -03 -08 08:30:00-07:00
2    2015 -03 -08 09:00:00+06:00
Name: UTC_Date, dtype: object

- To convert a series containing utc date into a datetime64 object we can use the `pd.to_datetime(args, utc)` function and we need to set, **utc=True**.

In [5]:
utc_time = pd.to_datetime(utc_data, utc=True)

In [6]:
utc_time.tail(3)

11   2015-11-01 16:00:00+00:00
12   2015-11-01 16:30:00+00:00
13   2015-11-01 17:00:00+00:00
Name: UTC_Date, dtype: datetime64[ns, UTC]

**Note:** It is `not necessary` that the UTC data has an `offset of 00:00`. Setting, `utc=True` will `automatically convert` the data to an `offset of 00:00`. If we need, we can change it using the .dt.tz_convert(timezone) method.

Once we have converted a series to a datetime object we can utilize the `.dt` accessor, which gives us some awesome tools for dealing with dates.

- To convert UTC datetime data to a certain timezone we can use the `.dt.tz_convert(timezone)` method.

In [7]:
# say, we wanted to convert the utc_time to the Dhaka timezone (capital of Bangladesh)
# from internet we determine that the correct timezone for Dhaka is, 'Asia/Dhaka'
utc_time.dt.tz_convert("Asia/Dhaka").sample(3)

9   2015-11-01 21:00:00+06:00
2   2015-03-08 09:00:00+06:00
5   2015-11-01 07:00:00+06:00
Name: UTC_Date, dtype: datetime64[ns, Asia/Dhaka]

### *Working with Local time data*

To load local date information, we need to have the date, the offset, and the timezone.

In [8]:
local = pd.read_csv("./local_time_data.csv")
local_date = local.local_date
offset = local.offset

In [9]:
local_date.sample(3)

12    2015 -11 -01 02:00:00
14    2015 -11 -01 01:00:00
3     2015 -03 -08 02:30:00
Name: local_date, dtype: object

<u>**Workflow:**</u>
1. First, we will convert the local_date to a naive (i.e, w/o timezone information) datetime format using the `pd.to_datetime()` function.
2. We will group the local_date datetime object by the offset. If we pass in a series to a `groupby` object it will first align the passed series to the original data (will enter NaN if series length is smaller) [much like adding a new column to the data]. After that the actual grouping will be done. The groups will have the series value (that it was grouped by) as the name (accessible by `series.name`).
3. If ordinary agg func is used the grouped data will have group index. To retain the original index positions we will be using the groupby `.transform()` method. 
4. After that, the naive data will be converted to a `timezone-aware` format with the help of, `.dt.tz_localize(tz:str, pytz.timezone)` method.
5. Finally we will be converting the timezone-aware object to the local timezone using the `.dt.tz_convert()` method.

One caveat in this process is that, our offset series doesn't have the proper timezone format i,e, `HH:MM`. As a result if we tried to produce a timezone aware datetime object it will only modify the MM and not the HH. So, first we convert the offset series entries to proper format. 

In [10]:
# see that, offset doesn't have the proper formatting
# offset.sample(2)

In [11]:
# formatting offset series entries to HH:MM
offset = offset.replace({-7: "-07:00", -6: "-06:00"})

In [12]:
# see what changed after formatting
# offset.sample(2)

In [13]:
# The actual work (we will convert the time to America/Denver timezone)
local_time = pd.to_datetime(local_date).groupby(offset)\
            .transform(lambda s: s.dt.tz_localize(s.name).dt.tz_convert("America/Denver"))

In [14]:
local_time.head(3)

0   2015-03-08 01:00:00-07:00
1   2015-03-08 01:30:00-07:00
2   2015-03-08 03:00:00-06:00
Name: local_date, dtype: datetime64[ns, America/Denver]

In [15]:
# pd.to_datetime(local_date).groupby(offset).transform(lambda s: print(s.name))

-Converting local time to UTC

In [16]:
local_time.dt.tz_convert("UTC").head(3)

0   2015-03-08 08:00:00+00:00
1   2015-03-08 08:30:00+00:00
2   2015-03-08 09:00:00+00:00
Name: local_date, dtype: datetime64[ns, UTC]

### *Working with UNIX Epoch time*

- To find the time elapsed (in seconds) since UNIX epoch at Jan 1, 1970 midnight UTC -

In [17]:
unix_local = local_time.view(int).floordiv(1e9).astype(int)
unix_local.head(3)

0    1425801600
1    1425803400
2    1425805200
Name: local_date, dtype: int64

- To convert epoch information into UTC -

In [30]:
pd.to_datetime(unix_local, unit='s').dt.tz_localize("UTC")

0    2015-03-08 08:00:00+00:00
1    2015-03-08 08:30:00+00:00
2    2015-03-08 09:00:00+00:00
                ...           
16   2015-11-01 09:00:00+00:00
17   2015-11-01 09:30:00+00:00
18   2015-11-01 10:00:00+00:00
Name: local_date, Length: 19, dtype: datetime64[ns, UTC]

--------------------------------------------

## The `.dt` Accessor

-------------------------------------------

The .dt accessor provides us access to some very useful datatime properties and mehtods.

DateTime properties: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-properties

DateTime methods: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-methods

Period properties: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#period-properties 

TimeDelta properties: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-properties

TimeDelta methods: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-methods

In [18]:
# For example to calculate year, week, and day according to the ISO 8601 standard
local_time.dt.isocalendar().sample(3)

Unnamed: 0,year,week,day
15,2015,44,7
11,2015,44,7
10,2015,44,7


### *The `dt.strftime()` method*

The `.dt.strftime(format)` method will convert a pandas datetime object to a string using the specified formatting style. To see all the available format codes see the documentation @https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

In [19]:
# for example, let's say we want our local_time series data as a string in the form of 
# e.g, 03 Jan 2000 02:44 PM MDT
local_time.dt.strftime('%d %b %Y %I:%M %p %Z').head(3)

0    08 Mar 2015 01:00 AM MST
1    08 Mar 2015 01:30 AM MST
2    08 Mar 2015 03:00 AM MDT
Name: local_date, dtype: object

----------------------

## Dates in the Index

---------------------

First let us read in the alta-noaa-1980-2019.csv dataset. This dataset contains information about the amount of snow fall in a ski resort.

In [20]:
alta_df = pd.read_csv("./Data/alta-noaa-1980-2019.csv", date_parser="DATE").set_index("DATE")
alta_df.index = alta_df.index.astype("datetime64")

In [21]:
alta_df_snow = alta_df.SNOW.rename('snow')
alta_df_snow

DATE
1980-01-01    2.0
1980-01-02    3.0
1980-01-03    1.0
             ... 
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: snow, Length: 14160, dtype: float64

- Slicing Time series

As the index is a datetime object, we can slice it with a string (or a partial string) that represents the date. If we specify just the month on a slice, it includes all entries from that month (on both the start and end slices).

In [22]:
alta_df_snow.loc["1980/2":"1980/3"]

DATE
1980-02-01     0.0
1980-02-02     0.0
1980-02-03     0.0
              ... 
1980-03-29     0.0
1980-03-30     0.0
1980-03-31    13.0
Name: snow, Length: 60, dtype: float64

#### *Finding missing data*

In [23]:
# at first, let us check whether the data has any missing values at all
alta_df_snow.isna().any()

True

In [24]:
# looks like it indeed has missing values. let's create a filter to see which dates has missing values
alta_df_snow.loc[alta_df_snow.isna()]

DATE
1985-07-30   NaN
1985-09-12   NaN
1985-09-19   NaN
              ..
2017-10-02   NaN
2017-12-23   NaN
2018-12-03   NaN
Name: snow, Length: 365, dtype: float64

**Note:** The series object has no `.query` method. If it was a dataframe then we could have used the .query() method.

#### *Handling Missing data*

**The best way to deal with missing data is to talk with a subject matter expert and determine why it is missing.**

-> <u>**Dropping missing values**</u> with the `dropna()` method

Be careful with the method and only use it after talking to a subject matter expert who confirms that it is ok to drop the data. It can be hard to tell later if the data is missing. For example, if you plotted this data, you might not see that data was dropped unless you pay close attention.

In [25]:
alta_df_snow.dropna().isna().any()

False

-> <u>**Filling missing values**</u> with the help of `.fillna()`, `.ffill()`, `.bfill()` etc. Depending on the data contents, the `.mean()`, `.mode()`, `.median` and other such methods may come in handy while using the `.fillna()` method.

-> <u>**Interpolating missing values**</u> with the `.interpolate()` method may also be appropriate in some cases. By default the interpolating method will be linear interpolation.

-> <u>**Using filling and interpolation in combination**</u> with the help of `.where()`/`.mask()` method

As is often the case, the trend and characteristics of the data is such that handling missing values requires us to utilize different methods for different parts of the data. Our `alta_df_snow` data series is a good example of this. 

-> In winter (1st and 4th quarter of the year) we should use interpolate/ffill/bfill methods as, in winter it is most likey that there were snow in the days that are missing data.

-> But, in summer (2nd and 3rd quarter of the year) we can assume that there were no snow and thus use fillna method to fill the missing values with 0.

To find out which quarter a date (datetime object) falls on we can call the `.dt.quarter` property. In case of DateTimeIndex object we can directly call the `.quarter` property to serve the same purpose.

In [26]:
winter = (alta_df_snow.index.quarter == 1) | (alta_df_snow.index.quarter == 4)

alta_df_snow.mask(winter & alta_df_snow.isna(), alta_df_snow.interpolate())\
            .mask(~winter & alta_df_snow.isna(), 0).isna().any()

False

### *Shifting Data*

The `.shift()` method works on any pandas series but comes in really useful with time series when we want to compare to the previous or subsequent entry.

- Forward shift

In [27]:
alta_df_snow.shift(1)

DATE
1980-01-01    NaN
1980-01-02    2.0
1980-01-03    3.0
             ... 
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: snow, Length: 14160, dtype: float64

- Backward shift

In [28]:
alta_df_snow.shift(-1)

DATE
1980-01-01    3.0
1980-01-02    1.0
1980-01-03    0.0
             ... 
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    NaN
Name: snow, Length: 14160, dtype: float64

### *Rolling window calculations*

The `.rolling()` function provides the feature of rolling window calculations. The concept of rolling window calculation is most primarily used in signal processing and time-series data. In very simple words we take a window size of k at a time and perform some desired mathematical operation on it. A window of size k means k consecutive values at a time. In a very simple case, all the ‘k’ values are equally weighted.

The aggregate functions that works on the **rolling object** are, 

<img src='./aggregate_methods_that_work_on_rolling_objects.png'>

<u>**.rolling() function Parameters**</u>

-> window : Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size. If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. To learn more about the offsets & frequency strings, please see @https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

-> min_periods : Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1. 

-> win_type : If win_type=none, then all the values in the window are evenly weighted. There is various other types of rolling window type. To learn more about the other rolling window type see @https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.get_window.html#scipy.signal.get_window.

-> on : For a DataFrame, column on which to calculate the rolling window, rather than the index.

- Example: Calculating 3 day moving average snow fall

In [29]:
alta_df_snow.rolling(window=3, min_periods=2).mean()

DATE
1980-01-01    NaN
1980-01-02    2.5
1980-01-03    2.0
             ... 
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: snow, Length: 14160, dtype: float64

------------------

## Resampling

---------------------