<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Time series aggregation


---

### Learning Objectives

- "Rolling" statistics through time, such as the rolling mean.
- Exponentially weighted statistics.
- Differencing and Shifting

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id="load-the-unemployment-data"></a>
<a id="load-the-unemployment-data"></a>
## Load the Unemployment Data
---
This is historical monthly data regarding unemployment rates in the United States, from [the Federal Reserve Bank of St. Louis](https://fred.stlouisfed.org/series/UNRATE).  First, perform any necessary cleaning.

In [2]:
df = pd.read_csv('./datasets/unemployment.csv')


In [3]:
# Check the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295 entries, 0 to 294
Data columns (total 2 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Month                                   295 non-null    object
 1   Seasonally Adjusted Unemployment Rates  295 non-null    object
dtypes: object(2)
memory usage: 4.7+ KB


In [4]:
# view the first few lines of data
df.head(3)

Unnamed: 0,Month,Seasonally Adjusted Unemployment Rates
0,01/01/1999,4.30%
1,01/02/1999,4.40%
2,01/03/1999,4.20%


### Rename the columns

Shorter names than the original ones could be good.

In [None]:
df.columns=['']

### Clean the unemployment rate column

Remove the percentage sign and transform to float by applying the lambda function:

`lambda x: float(str(x).replace('%','')`

In [None]:
# Apply the lambda function to the unemployment rate column


In [None]:
# Check the info

In [None]:
# Drop any nulls if there are

In [None]:
# Check the first five lines


## Create a Datetime Index
---

This is monthly data and has been stored as a string in day/month/year format. 

- We need to first transform the column with the dates as `str` objects to datetime. Specify the format was `'%d/%m/%Y'`
- Then use `.dt.to_period('M')` to represent the string as a datetime time period object, instead of a timestamp.
    - a Timestamp object represents a single point in time
    - a Period object represents a time period (a month, a year, a week, etc)
- Finally, set this datetime column as the index.


In [None]:
# Convert the date as a monthly datetime period using pd.to_datetime


In [None]:
# For time series, set the date as the index.


In [None]:
# Check the info of the dataframe  now. What is the index?



## Visually Examine the Unemployment Rate
---

Plot the unemployment rate.

In [None]:
df['unemployment_rate'].plot()
plt.xticks(rotation=45)
plt.xlabel('Date')
plt.ylabel('Unemployment Rate')
plt.title('Unemployment Rate Over Time')

plt.show()


### Resample the time series to report weekly or annual means.

Applying `.resample(freq)` to a pandas timeseries with a specified frequency, for example:
- `'W'` : weekly 
- `'A'` : year end
- `'M'` : month end
- `'MS'`: month start 
- `'H'` : hour
- `'T'` or `'min'` : minute

followed by an aggregation function like `.mean()`, creates aggregates across the indicated frequency (similar to `group by`).

For more codes refer to the [Pandas Timeseries documentation](http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) (under "Frequency String")

#### Resample annually using the mean for aggregation. Plot and compare to the original time series.

In [None]:
yearly = df['unemployment_rate'].resample('A').mean()
yearly.head()

In [None]:
# Try to add a quarterly 


In [None]:
plt.figure(figsize=(12,6))
df.unemployment_rate.plot(label='orig',legend=True)
yearly.plot(label='yearly',legend=True);

# add the quarterly plot 

plt.show()


<a id="rolling-functions"></a>
<a id="rolling-functions"></a>
## "Rolling" Functions
---

With time series, we can "roll" statistics across time. For example, the rolling mean is the mean of a moving window across time periods. Pandas offers a variety of functionalities for creating rolling statistics, which we'll only scratch the surface of here. 

The syntax for these can be a little tricky at first. We'll be using a `rolling()` function with a statistical function chained to it. Let's dive into more detail.

<a id="parameters-for-rolling-functions"></a>
<a id="parameters-for-rolling-functions"></a>
### Parameters for `rolling()` Functions

**`rolling().mean()`** (as well as **`rolling().median()`**) can take the following parameters:

- The first indicates the time series to aggregate.
- **`window`** indicates the number of days to include in the average.
- **`center`** indicates whether the window should be centered on the date or use data prior to that date.
- **`freq`** indicates which level to roll the averages up to (as used in **`resample`**): **`D`** for day, **`M`** for month, **`A`** for year, etc.

> **Note:** For more information, see Pandas' [resample documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample).

To compare the resampled values from the rolling values, let's create some plots.

**Extract the dates from the index as timestamps.**

> *Hint: The `.to_timestamp()` function allows you to extract the timestamps.*

In [None]:
# Get the index from the original series as a timestampdate_ticks_orig.head()
date_ticks_orig = df.index.to_timestamp()
date_ticks_orig

In [None]:
# Now get the index from the yearly resampled series as a timestamp.


#### Use matplotlib to plot your time series.

- Choose the time series as y-values and the timestamp-indices extracted in the last part as x-values.
- Plot both, the original timeseries and the one averaged over the windows.

In [None]:
plt.figure(figsize=(14,7))

# plot the original time series with the date_ticks_org as x-values, the unemployment rate as y-values
plt.plot(???)

# plot the original yearly series with timestamp index values as x-values, yearly.values as y-values
plt.plot(???)
plt.tick_params(labelsize=14)

#### Resample annually and calculate the rolling mean of years with `window=3`  without centering.

In [None]:
yearly_3 = yearly.rolling(window=???, center=???).mean()
yearly_3.head()

In [None]:
plt.figure(figsize=(12,6))
df.unemployment_rate.plot(label='orig',legend=True)
yearly_3.plot(label='yearly_3',legend=True);

#### Calculate the rolling mean of the original monthly time series with `window=5` and `window=15`. Plot both together.

In [None]:
uroll_w5 = 
uroll_w15 = 

In [None]:
uroll_w5

In [None]:
plt.figure(figsize=(12,6))
# plot the original time series

# plot the uroll_w5 and see the difference

# next plot the uroll_w15

# Add the quarterly plot to view the difference

## The Expanding Mean
---

The "expanding mean" uses all data points up to the current time to calculate the mean, as opposed to a moving window.

Let's see what this means by considering the unemployment for the year 1999 only.

In [None]:
# Get 12 months of 1999
df_1999 = df.iloc[0:12,]
df_1999

In [None]:
# Resample into Quarters and find the mean for each quarter
quarterly_means_1999 = df_1999.unemployment_rate.resample('Q').mean()
quarterly_means_1999

In [None]:
# Find the mean for Quarter 1, 2 3, 4 each to check how the above values are calculated:
df_1999.iloc[0:3].mean()  # Quarter 1, for example

### Expanding Mean

The expanding mean, on the other hand, calculates the mean cumulatively up to each time period. You can compare the values by using the first four values of the quarterly_means, and expanding them:

In [None]:
quarterly_means_1999[0:4].expanding().mean()


In [None]:
# Check the result for the expanding mean for 1999Q4
# Expanded mean for Q4 is the mean of each of the means from Q1 up to Q4

quarterly_means_1999[0:4].mean()


#### Calculate and plot the expanding mean for the original time series. Resample by quarter.

Use

`df.unemployment_rate.resample('Q').mean().expanding().mean()`.

In [None]:
expanding_mean = 
expanding_mean.head()

In [None]:
plt.figure(figsize=(12,6))

#plot the original time series


# plot the expanding mean


<a id="exponentially-weighted-windows"></a>
<a id="exponentially-weighted-windows"></a>
## Exponentially Weighted Windows
---

Exponentially weighted windows are one of the most common and effective ways of averaging out noise in time series data. The averaging is done with an "exponential decay" on the contribution of prior means, decreasing the contribution of time points that are further in the past.

The (adjusted) exponentially weighted mean for time $t$, is defined as:

$$ x_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2x_{t-2} + ... + (1 - \alpha)^{t}x_0} {1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^{t}} $$

> **Note:** Review Pandas' [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html) for more information.

#### Calculate and plot the exponentially weighted sum along with the rolling sum. What is the difference?

For example: `.resample('Q').sum().ewm(span=10).mean()`.

- Find out what `span` does.

In [None]:
exp_weighted_mean = df.unemployment_rate.resample('Q').mean().ewm(span=10).mean()

In [None]:
plt.figure(figsize=(12,6))

# plot the original time series

# plot the exponentially weighted mean

# compare with the quarterly plot

### Use `.diff()` to calculate changes in values from one time period to another

- Why do you get missing values?
- Plot the resulting time series. Compare to the original time series.

In [None]:
df.diff().head(6)

In [None]:
df.head(6)

In [None]:
plt.figure(figsize=(12,6))

# plot the unemployment rate

# plot the differenced unemployment rate


### What is the difference between applying `.diff(2)` and `.diff().diff()`?

### Use `df.pct_change()` to calculate percent changes from period to period

- Plot the resulting time series. Compare to the original time series.

In [None]:
# check the values after calculating pct_change()


In [None]:
# Plot the pct_change() of the unemployment rate


### Shifting

`df.shift()` is used to bring values from previous dates forward in time.

- How is that different from `.diff()`?