# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Pandas: Time Series](#Pandas:-Time-Series)
	* [Set-Up](#Set-Up)
* [Short Demo: Weather Data](#Short-Demo:-Weather-Data)
	* [Read data](#Read-data)
	* [Resampling](#Resampling)
	* [Compute Rolling Mean](#Compute-Rolling-Mean)
* [Long Demo: NYC Bicycle Share Data](#Long-Demo:-NYC-Bicycle-Share-Data)
	* [Data Download](#Data-Download)
	* [Data Read](#Data-Read)
	* [Data Resampling](#Data-Resampling)
	* [Data Visualization](#Data-Visualization)
	* [Data Group Separation](#Data-Group-Separation)
		* [Motivation](#Motivation)
		* [Steps](#Steps)
* [Time Series Operations](#Time-Series-Operations)
	* [Frequency](#Frequency)
	* [Time Zones](#Time-Zones)
	* [Time Deltas](#Time-Deltas)
	* [Time Resampling](#Time-Resampling)
	* [Missing Values](#Missing-Values)
* [Computational Tools](#Computational-Tools)
* [Section Review](#Section-Review)


# Learning Objectives

After this notebook, the learner will be able to use pandas to:
* Perform exploratory data analysis of time series data
* Read, filter, and resample time series data
* Index time series data by time steps of different sizes
* Convert time zones in a time series
* Compute summary statistics and rolling statistics of time series data

# Pandas: Time Series

## Set-Up

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
pd.options.display.max_rows = 12
pd.options.display.max_columns = 8
pd.options.display.width = 80

***

# Short Demo: Weather Data

## Read data

In [None]:
weather = pd.read_csv('data/pittsburgh2013.csv', parse_dates=True, index_col='Date')
weather.head()

In [None]:
# Inspect data columns

weather.columns

In [None]:
# Inspect data times

weather.index

In [None]:
# Simply plotting with matplotlib

weather['Mean TemperatureF'].plot()

In [None]:
# Fancy plotting with Bokeh
# !conda install -y bokeh=0.11

import bokeh 
bokeh.__version__

In [None]:
from bokeh.io import output_notebook, show
output_notebook()

In [None]:
from bokeh.plotting import figure

# construct figure object
fig = figure( x_axis_type="datetime", responsive=True, height=200 )

# plot line on figure
fig.line( weather.index, weather['Mean TemperatureF'] )

# Make pretty
fig.title            = 'Mean TemperatureF'
fig.xaxis.axis_label = 'Date'
fig.yaxis.axis_label = 'Temp °F'

show(fig)

## Resampling

In [None]:
# Recall the data frequency is daily

weather['Mean TemperatureF'].head(10)

In [None]:
# Resampling every 4 hours

weather['Mean TemperatureF'].resample('4h').head(13)

In [None]:
# Resampling with "forward-fill"

weather['Mean TemperatureF'].resample('4h').ffill().head(13)

## Compute Rolling Mean

In [None]:
biweekly_mean_temp = pd.rolling_mean(weather['Mean TemperatureF'], freq='W', window=2)
biweekly_mean_temp.head(13)

In [None]:
weekly_mean_temp = weather['Mean TemperatureF'].resample('W') # default resampling uses mean()
weekly_mean_temp.head(13)

In [None]:
# Create a bokeh figure
from bokeh.plotting import figure
fig = figure(x_axis_type="datetime", responsive=True, height=200)

# Label the figure
fig.xaxis.axis_label='Date'
fig.yaxis.axis_label='Temp °F'
fig.title = 'Bi Weekly Rolling Mean'

# plot line, weekly mean
fig.line(weekly_mean_temp.index, weekly_mean_temp, 
         color='blue', legend='Weekly Mean')

# plot line, biweekly rolling mean
fig.line(biweekly_mean_temp.index, biweekly_mean_temp, 
         color='red', legend='Bi Weekly Rolling Mean')

show(fig)

# Long Demo: NYC Bicycle Share Data

This example is inspired by a nice blog on Seattle bikeshare written by Jake VanDerPlas:

https://jakevdp.github.io/blog/2015/10/17/analyzing-pronto-cycleshare-data-with-python-and-pandas/

## Data Download

We've downloading similar data for the NYC bicycle share program:
 * Data for all dates: https://www.citibikenyc.com/system-data
 * Data for 2015 September: https://s3.amazonaws.com/tripdata/201509-citibike-tripdata.zip

## Data Read

In [None]:
# Large file. This read takes 15-20 seconds on my MacBookPro
df = pd.read_csv('data/201509-citibike-tripdata.csv.gz',
                  infer_datetime_format=True,
                  parse_dates=['starttime', 'stoptime'])

In [None]:
df.head()

## Data Resampling

Let's create a new DataFrame by keeping only one column and an index:
* select the `starttime` as the index
* select the `bikeid` as the only data column
* resample by counting the number of `bikeid` values in each hour

In [None]:
# Resample to produce **COUNTS** of bikeids in each hour
hourly_bike_count = (df
         .set_index('starttime')
         .bikeid
         .resample('H',how='count')
        )

In [None]:
# Inspect the resulting DataFrame
hourly_bike_count.head()

In [None]:
type( hourly_bike_count )

*Notice the use pattern? Using a succession of `.dot` accessors to transform data is a common use pattern in `pandas`. The parentheses are used purely for scope.*

## Data Visualization

Often, plotting the data helps you answer questions and helps you ask new ones. Plot the number of riders as a function of start time.
* Notice the horizontal axis, range of values
* Look for periodicity and trends

In [None]:
# Plotting with Matplotlib

hourly_bike_count.plot(figsize=(12,3))

In [None]:
# Selecting just one week, take a closer look

hourly_bike_count.loc['20150907':'20150914'].plot(figsize=(12,3))

In [None]:
# Plotting with Bokeh provdes interactive tools

fig = figure( x_axis_type="datetime", responsive=True, height=200 )

# plot line on figure
fig.line( hourly_bike_count.index, hourly_bike_count )

# Make pretty
fig.title            = 'NYC Bicycle Ride Share'
fig.xaxis.axis_label = 'Date'
fig.yaxis.axis_label = 'Hourly Bike Count'

show(fig)

Questions:
* Can you see a daily pattern?
* Can you tell weekdays for weekends?
* Is there signs of a holiday? Rainy day?

Now that the data is tidied up a bit, it is easy to plot subsets using square backet `[]` index slicing to "zoom in"

## Data Group Separation

### Motivation

We may have two populations mixed together, each with different trends:
* We need to separate riders by type, e.g. "casuals" versus "regular users".
* Concept of "split-apply-combine" work-flow or strategy in analysis. 
* Pandas supports groupby operations.
* Analogous to SQL's GROUP BY and the Excel Pivot Table.

### Steps

Start with the full data set, create a new `DataFrame` as follows:
* Create a `list` of "layers" in a hierarchical index.
* Groupby that hierarchical index, first by `starttime`, then by `usertype`
* Use pivot to rearrange the `DataFrame` `Index` and `Columns` for ease of use.

In [None]:
step1 = df.groupby([pd.Grouper(key='starttime',freq='D'),'usertype'])
type(step1)

The result is a `DataFrameGroupBy` object
* can be used to perform varous operations.
* operations return Series and DataFrames, etc.

In [None]:
# Example: Group by starttime and by usetype, and then count.
#    step1.starttime.count() retunrns the total number of non-null values in each column. 
#    step1.starttime.size() returns the total number of ALL records in each group.

step2 = step1.starttime.count()
print(type(step2))
step2

Object step2 is a `Series` with a hierarchical index called a `MultiIndex`.

We can get back to a `DataFrame` by resetting the index to one "level" of the `MultiIndex`

In [None]:
step3 = step2.reset_index(level='usertype')
print(type(step3))
step3

Well, that's kind of a mess:
* Would be nice to have the data with a single layer index `starttime`
* Would be nice to have the different values of `usertype` as columns
* use pandas `pivot` to do this!

In [None]:
step4 = step3.pivot(index=step3.index, columns='usertype')
step4

And the representation could be better:

In [None]:
step5 = step4.T.reset_index(level=0,drop=True).T
step5

But the `pandas` idiomatic way of doing all that, putting it all together, is as follows.

Note that ``.reset_index()`` does not currently have an axis argument, so we must transpose, perform the op and transpose again to get the same effect.

In [None]:
tidy_df = (df
   .groupby([pd.Grouper(key='starttime',freq='D'),'usertype'])
   .starttime
   .count()
   .reset_index(level='usertype')
   .pivot(index=step3.index, columns='usertype')
   .T.reset_index(level=0,drop=True).T # .reset_index(level=0,axis=1)
 )
tidy_df

Now plot it!

In [None]:
mpl_ax1 = tidy_df.plot(figsize=(12,3))
mpl_ax1.legend(loc=1)
mpl_ax1.grid(True)

*Note that pd.plot() returns a Matplotlib Axes object! Very useful!*

***

# Time Series Operations

## Frequency

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects

Create an `Index` that spans 5 days. Notice the defaults:
* "frequency" is the interval between entries, defaults to "D" for "day"
* "tz" is "timezone", and defaults to "None"

In [None]:
i = pd.date_range('20130101 09:00:00',periods=5)
i

Offset each element of the `Index` array by 1 hour

In [None]:
i + pd.offsets.Hour(1)

This time, explicitly set the "frequency" to "start of months" ("MS") by using the input parameter `freq`

In [None]:
i = pd.date_range('20130101 09:00:00',periods=5,freq='MS')
i

Again, add an offset to each element in the Index:
* Notice the behviour for February.

In [None]:
i + pd.offsets.MonthEnd()

Very particular offsets are supported. Example: 10 Microseconds (10 $\mu s$) 

*Note: $\mu$ is often denoted "U"*

In [None]:
pd.date_range(i[0], periods=10, freq='1D10U')

## Time Zones

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-zone-handling

Pandas supports time zone conversions.

In [None]:
test_zones = pd.Series(pd.date_range('20130101 09:00:00',periods=5,tz='US/Eastern'))
test_zones

In [None]:
test_zones.dt.tz_convert('UTC')

In [None]:
test_zones.astype('datetime64[ns]')

## Time Deltas

http://pandas.pydata.org/pandas-docs/stable/timedeltas.html

Create a `Series` of times 1 day plus 2N seconds, for N=0 to 4

In [None]:
test_deltas = pd.Series(pd.timedelta_range('1 day',periods=5,freq='2 s'))
test_deltas

In [None]:
test_deltas.iloc[0]

In [None]:
test_deltas.iloc[1]

In [None]:
# create deltas from a date_range
test_range = pd.date_range('20130101 09:00:00',periods=5,freq='MS')
test_range - test_range[0]

In [None]:
# Add two different set of deltas
test_add = test_deltas + (test_range-test_range[0])
test_add

In [None]:
# Show representation as seconds
test_add.astype('timedelta64[s]')

In [None]:
# Show time components, separating days, hours, minutes, seconds...
# Notice the days and seconds
test_add.dt.components

## Time Resampling

Being a demonstration of time series resampling that leads to a problem of missing values...

In [None]:
# Create a DateTimeIndex with deltas of milliseconds (ms):
rng = pd.date_range('20130101 09:30:00',periods=1000,freq='ms')
rng

Create a `Series` with data generated by adding some random noise to the time index

In [None]:
np.random.seed(1234)
test_series = pd.Series(np.random.randn(1000)*.1+50,
              index=rng.take(np.random.randint(0, len(rng), size=len(rng)))
             )
test_series

Now plot the resulting time series

In [None]:
test_series.sort_index().plot(figsize=(12,4))

In [None]:
resampled_series = test_series.resample('1ms',how='ohlc')
resampled_series

*Notice the NaN values... what to do?*

## Missing Values

Use the pandas `.ffill()` method to get rid of NaNs

In [None]:
test_series.resample('1ms',how='ohlc').ffill()

***

# Computational Tools

http://pandas.pydata.org/pandas-docs/stable/computation.html

Being some commonly used computation tools for time series in pandas...

In [None]:
pd.rolling_mean(test_series.sort_index(),freq='10ms',window=1).plot(figsize=(12,6))
pd.expanding_mean(test_series.sort_index(),freq='10ms').plot(figsize=(12,6))

# Section Review

Short Demo: Weather Data
* Read and resampled data
* Computed a Rolling Mean
* Visualization with Matplotlib and Bokeh

Long Demo: NYC Bicycle Share Data
* Download and read data
* Resampling, cleaning, tidying data
* Group Separation
* Data visualization

Time Series Operations
* Frequency and resampling
* Time Zones
* Time Deltas
* Resampling and time step size
* Missing Values
* Computational Tools


***