## Basic Time Series Analysis

This notebook demonstrates basic time series data analysis using scientific Python libraries such as [NumPY](http://www.numpy.org/).  This example uses air temperature data that is stored in HydroShare to derive daily aggregated values and store them in a new HydroShare resource.

## 1. Script Setup and Preparation

Before we begin our processing, we must import several libaries into this notebook.  The `%matplotlib inline` command tells the notebook server to place plots and figures directly into the notebook.

**Note:** You may see some matplotlib warnings if this is the first time you are running this notebook. These warnings can be ignored.

In [1]:
import os
import pandas
import itertools as it
from functools import reduce
from datetime import datetime
import matplotlib.pyplot as plt

%matplotlib inline

This resource contains a file with observation data called `BeaverDivideTemp.csv`

We'll use this file to derive daily minimum, maximum, and average air temperatures.  Lets preview this Beaver Divide temperature data by looping over the first 10 lines of the csv file that was downloaded.

In [2]:
# preview the content of the BeaverDivideTemp file
air_temp_csv = 'BeaverDivideTemp.csv'
with open(air_temp_csv) as f:
    for i in range(0, 10):
        print(f.readline())

Date,AirTemp-degC

10-30-2013 10:15:00, 20.91 

10-30-2013 12:00:00, 21.62 

10-30-2013 12:15:00, 21.87 

10-30-2013 12:30:00, 21.92 

10-31-2013 09:15:00, -174.4 

10-31-2013 09:30:00, -9999 

10-31-2013 09:45:00, -9999 

10-31-2013 10:30:00, -2.693 

10-31-2013 10:30:00, -2.693 



## 2. Time Series Analysis

Pandas is a data analysis library that we will be using to read and summarize our temperature data.  To get started, load the csv data into a Pandas DataFrame object using `read_csv`. This is a powerful function that allows us to skip commented lines, strip whitespace, as well as transform date strings into python objects.

In [None]:
# read all of the data into pandas
dateparse = lambda x: datetime.strptime(x, '%m-%d-%Y %H:%M:%S')
df  = pandas.read_csv(air_temp_csv, comment='#', delimiter=',', parse_dates=['Date'], date_parser=dateparse)

In [None]:
df

Determine if nodata values are included. Often these are represented by a very large negative number. 

In [None]:
df.min()

Subset our dataset to exclude nodata values. We'll also set a lower limit of -50C to remove any errors in the dataset.

In [None]:
df = df[df['AirTemp-degC'] > -50]

We can use built-in Pandas functions to derive daily aggregate temperatures. 

In [None]:
daily_min = df.groupby(pandas.Grouper(key='Date', freq='1D')).min()
daily_max = df.groupby(pandas.Grouper(key='Date', freq='1D')).max()
daily_ave = df.groupby(pandas.Grouper(key='Date', freq='1D')).mean()

In [None]:
daily_max

Visualize our derived data by using matplotlib.

In [None]:
# create a figure
fig, ax = plt.subplots(1,1,figsize=(15, 5))

# plot each temperature time series
tmax = daily_max.plot(ax=ax, style='r-')
tmin = daily_min.plot(ax=ax, style='b-')
tave = daily_ave.plot(ax=ax, style='g-')

# display a legend
ax.legend(['TMax', 'TMin', 'TAve'])

# set the figure title
fig.suptitle('Beaver Divide Temperatures')
plt.ylabel('Temperature (degrees C)')

# format the ticks
ax.grid(True)

Combine our data and save to CSV

In [None]:
# rename columns
daily_min.rename(columns={'AirTemp-degC':'MinAirTemp'}, inplace=True)
daily_max.rename(columns={'AirTemp-degC':'MaxAirTemp'}, inplace=True)
daily_ave.rename(columns={'AirTemp-degC':'AveAirTemp'}, inplace=True)
        
# merge all the data
df_merged = reduce(lambda  left,right: pandas.merge(left,right,on=['Date'], how='outer'), [daily_min, daily_max, daily_ave])

# save to csv
df_merged.to_csv('min_max_ave.csv', sep=',')


---
## 3. Save the results back into HydroShare

Using the `hs_utils` library, the results of our timeseries analysis can be saved back into HydroShare.  First, define all of the required metadata for resource creation, i.e. *title*, *abstract*, *keywords*, and *content files*.  In addition, we must define the type of resource that will be created, in this case *genericresource*.  

***Optional*** : define the resource from which this "new" content has been derived.  This is one method for tracking resource provenance.

In [None]:
# define HydroShare required metadata
title = 'Daily Aggregate Temperature for Beaver Divide'
abstract = 'This daily average air temperature for the Beaver Divide gauging station that is maintained by iUtah researchers.'
keywords = ['Temperatire', 'Beaver Divide', 'Time Series']

# create a list of files that will be added to the HydroShare resource.
data_files = ['BeaverDivideTemp.csv', 'min_max_ave.csv']  

!hs create -t {title} -a {abstract} -k {' '.join(keywords)} -f {' '.join(data_files)}