![Erudio logo](img/erudio-logo-small.png)
---
![Pandas logo](img/pandas-logo-small.png)

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from src.training import *

# Time Series

One of the most flexible things in Pandas is its handling of time series data.  We have seen some of that already in the `.dt` accessor for datetime columns.  It gives us quite a bit more, and it does it across many time scales.  I.e. it will work well with yearly observations of eons, and it will work well with microscond observations over a millisecond of some short event.

In this module, we will look at a subset of the NOAA data again.  This time we will look at just one weater station to focus on the time series aspect only.  The fields that are therefore the same for every row are pulled out, and are these values:

```
NAME         VERLEGENHUKEN, NO
STATION             1002099999
LATITUDE                 80.05
LONGITUDE                16.25
ELEVATION                    8
```

What remains is only fields that are variable with date, or at least that can be.  Moreover, the data you will read in has no meaningful order, although it was deterministically "randomized" based on the data (*Extra bonus credit to any student who figures out the original order*).

In [None]:
df = pd.read_csv('data/verlegenhuken.csv', parse_dates=['DATE'])
df.head()

## Time Series Index

Very often the most useful way to treat a date column is by making it the index column.  Moroever, in time series data, we almost always want to treat it in sequential order.

In [None]:
df = df.set_index('DATE')
df = df.sort_index()
df

## Missing Index Values

There is a problem that may not be immediately obvious.  The periodicity of the data is generally daily measurements, but some are missing.  We can get a hint about this by noticing there are 344 rows in our DataFrame.  But comparing the now sorted date index:

In [None]:
# We could have done same with the regular DATE column
df.index[-1] - df.index[0]

In [None]:
# Alternately, if perhaps not sorted (same for regular column)
df.index.max() - df.index.min()

## Filling Indices

For many analyses we want regular intervals in our data.  Pandas lets us fill in our DataFrame to match whatever frequency we would like.  Notice that January 2 is added, although so far with NaNs for the missing values.

In [None]:
# Has 354, not 353, rows because of "fence posts"
df.asfreq('D')

It does not make sense for this example, but other frequencies are equally possible.

In [None]:
# Do NOT modify the original DataFrame, just demonstrating
df.copy().asfreq('6h').head()

In [None]:
# Do NOT modify the original DataFrame, just demonstrating
df.asfreq('5d').head()

In [None]:
# We will use the daily sampling now
df = df.asfreq('1d')

## Filling Values

So far we just have `NaN` values filling in all the features for the added days.  That is not wrong, necessarily, but we might want to impute some values for that missing data in order to fit some smooth model we are working with.  Obviouly, invented values are not true observations, but they are often useful if used with awareness of this limit.

There are three main techniques for imputing values:

* Forward fill: take the prior value and assume the next is the same
* Backward fill: take the subsequent value and use that where missing
* Interpolation: Use some functional form to impute values based on others in the series
  * The default and far most common technique is `linear` which just averages adjacent values
  * The `time` option can be useful if your datetime index is not regularized; essentially it is linear if the missing rows *had been* added
  * A number of other more complex curve fitting techniques are available, but much more specialized

In [None]:
# Look at Jan 2 example
df.loc[:, ['TEMP', 'DEWP']].ffill().head(4)

In [None]:
# Look at Jan 2 example
df.loc[:, ['TEMP', 'DEWP']].bfill().head(4)

In [None]:
# Look at Jan 2 example
# "linear" is default if not specified
df = df.interpolate(method="linear")
df.loc[:, ['TEMP', 'DEWP']].head(4)

## Resampling

Although it goes by a different name, resampling is essentially the same thing as `.groupby()`.  The difference is that the groups are defined by time periods.  In principle, we could achieve the same effect by creating a synthetic feature for the time period, but this use is common enough that a shortcut is much easier.

In [None]:
# The average temperature and dew point in each 2 week window
df.loc[:, ['TEMP', 'DEWP']].resample('2w').mean()

In [None]:
# The maximum temperature and dew point in each 2 week window
# Some aggregation is always required, as with .groupby
df.loc[:, ['TEMP', 'DEWP']].resample('2w').max().head()

## Rolling Windows

Strictly speaking, rolling windows do not depend on having datetime columns, only on ordered data.  The idea is only to look at some adjacent element in whatever order the data is sorted.  However, this is particularly likely to be useful for time series.

A rolling window is often more powerful than fixed windows.  Rather than, for example, take every two week block as a distinct group to aggregate over, we can effectively take every two week window around the current row as the aggregation unit.

In [None]:
rolling = df.loc[:, ["TEMP", "DEWP"]].rolling(window=14).mean()
rolling.columns = ["roll_%s" % col for col in rolling.columns]
rolling.loc['2019-01-13':'2019-01-20', ['roll_TEMP', 'roll_DEWP']]

Let us compare the rolling average temperature to the actual daily temperature.

In [None]:
(rolling
     .join(df)
     .iloc[12:26]
     .loc[:, ['TEMP', 'roll_TEMP']]
)

Likewise as a plot to compare the columns.

In [None]:
(rolling
     .join(df)
     .loc[:, ['TEMP', 'roll_TEMP']]
     .plot(title="Daily and smoothed temperature")
     .set(ylabel='°Fahrenheit')
);

# Exercises

For the below exercises, we will read in the same dataset for Verlegenhuken as in this module.  You may want to perform the same or similar cleanup we did in the lessons to solve the problems.

For some of these problems, it might be useful to look at more rows than the default configuration of 10.  We provide a small context manager to do this within one cell.  For example:

```python
# If new_max not given, show all possible
with show_all_rows(new_max=100):
    print(my_df)
```

In [None]:
from src.pandas_exercises import *
verl = pd.read_csv('data/verlegenhuken.csv', parse_dates=['DATE'], index_col='DATE')
verl.head()

Question: Which month had the most forceful wind gust? 

In [None]:
# Identify month with most foreceful gust
...

Followup: Can you say with certainty based on the data available? Why not?

In [None]:
# Explain answer with prose or code
...

Question: Which month had the lowest typical wind speed? Again explain any caveats.

In [None]:
# Identify least windy month
...

---

Produce a two month rolling average of the dew point and select 10-day intervals from that result.

In [None]:
# The desired answer is in ex7_1.result for guidance
ex7_1.result

---

Month by month, what was the change in atmospheric pressure between the first and last day of the month?

In [None]:
# The desired answer is in ex7_2.result for guidance
with show_all_rows():
    print(ex7_2.result)


---

Materials licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by the authors