# Timeseries Analysis using Pandas
-------------------------------------------------------------------

Kushal Keshavamurthy Raviprakash

kushalkr2992@gmail.com

This notebook is a part of the [Python for Earth and Atmospheric Sciences](https://github.com/Kushalkr/Python_for_Earth_and_Atmospheric_Sciences) workshop.

In [None]:
%matplotlib inline

## Introduction
-------------------------------------------------------------------

<img src="images/pandas.png" style="float: right" />**`Pandas`** is a python package which provides fast, flexible and expressive data structures and is a package designed to make working with "relational" or "labelled" data easy and intuitive.

I find pandas very useful for timeseries analysis.

The pandas package is generally imported as :

```py
import pandas as pd
```

The most common data structures in `pandas` are:
* `Series` (1-Dimensional, labeled, homogeneous array)
* `DataFrame` (2-Dimensional, labeled with potentially heterogenous columns)
* `Panel` (3-Dimensional, size mutable array) [**DEPRECATED** and will be removed in the future]

In this lecture, we will look at the `Series` and `DataFrame` data structures in some detail.

Let us first import the necessary packages.

In [None]:
# Import necessary packages

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

pd.__version__

We will use the [Arctic Oscillation (AO)](http://en.wikipedia.org/wiki/Arctic_oscillation) and [North Atlantic Oscillation (NAO)](http://en.wikipedia.org/wiki/North_Atlantic_oscillation) datasets as an example.

You can get the data from [here](http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii). But, as a sample, you will find it being present in the `data/` directory of my repository already.

`pandas` has some very good I/O facilities. But, for now we will stick to what we have learnt so far.

## Loading Data
-------------------------------------------------------------------

In [None]:
ao = np.loadtxt('data/monthly.ao.index.b50.current.ascii')
print(ao[0:2])

The data has three columns: year, month and the AO index value.

## Timeseries
-------------------------------------------------------------------

Timeseries of a single variable are usually put into the `Series` data structure provided by `pandas`.

To create a timeseries, we need to first create a time range for which we have the data and later use it to index the data. This is done with the help of the `date_range` function defined in the `pandas` module.

The data we have loaded ranges from January-1950 until May-2017. You may have to change the range if you obtain a newer version of the data. We use `freq = 'M'` which means the data is available every month.

In [None]:
dates = pd.date_range('1950-01','2017-01', freq='M') # Actual data is until 2017-06
dates

In [None]:
dates.shape

We will use the data only until Dec-2016.

In [None]:
AO = pd.Series(ao[:804,2], index=dates)
AO

Let's see how the data looks.

In [None]:
AO.plot()

In [None]:
AO['1980':'1990'].plot()

In [None]:
AO['1992-01':'1993-08'].plot()

Indexing data is very intuituve. For example:

In [None]:
AO['1992']

In [None]:
AO['2016-07']

You can also use fancy indexing as with arrays.

In [None]:
AO[np.logical_and((AO < 0.3),(AO > 0))]

## DataFrame
-------------------------------------------------------------------

2-Dimensional data are handled well by the data structure called `Dataframe`. We will use the [North Atlantic Oscillation (NAO)](http://en.wikipedia.org/wiki/North_Atlantic_oscillation) data as an example. This data is available [here](http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii).

Create a `Series` the same way as the previously done for the AO data.

In [None]:
nao = np.loadtxt('data/norm.nao.monthly.b5001.current.ascii.txt')
dates_nao = pd.date_range('1950-01', '2017-01',freq='M')
print(dates_nao.dtype)
NAO = pd.Series(nao[:804,2],index=dates_nao)

In [None]:
aonao = pd.DataFrame({'AO' : AO, 'NAO' : NAO})

In [None]:
aonao.plot()

In [None]:
aonao.head()

Slicing works for both data.

In [None]:
aonao['1980':'1985']

You can add more complexity to the slicing this way.

In [None]:
import datetime
aonao.loc[(aonao.AO > 0) & (aonao.NAO < 0)
         & (aonao.index > datetime.datetime(1980,1,1))
         & (aonao.index < datetime.datetime(1989,1,1)), 'NAO'].plot(kind='barh')

Here I have used the [DataFrame.loc](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc) advanced indexing attribute. We choose all the necessary constraints such as, AO > 0, NAO < 0, date range between years 1980-1989 (I have used the datetime module for this) and then, voila! you have your sliced data according to your requirements.

## Statistics with Pandas
-------------------------------------------------------------------

In [None]:
aonao.min()

In [None]:
aonao.max()

In [None]:
aonao.mean()

In [None]:
aonao.median()

In [None]:
print(aonao.mean(0))# This is the regular mean (column-wise)
print("\n Now let's see the mean of both indices:\n")
print(aonao.mean(1)) # Mean along row

Or you can get all necessary statistics in one command.

In [None]:
aonao.describe()

Getting correlation coefficients is also very easy.

In [None]:
aonao.corr()

## Exercise

Try out basic pandas on some irregular data like the Bloomington station temperature and dew point temperature. 

In [None]:
import datetime as dt
data = pd.read_csv('data/KBMG-2013.csv')

data['date'] = data.apply(lambda x: dt.datetime(x['yyyy'], x['mm'],x['dd'],x['hh'], x['mm.1']), axis=1)
data.set_index(data['date'], inplace=True)

T = pd.Series(data['temp'], index=data.index)
TD = pd.Series(data['dwpt'], index=data.index)

TTD = pd.DataFrame({'$T$': T, '$T_d$' : TD})

TTD.plot()

## Further Reading
-------------------------------------------------------------------
* [Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) - This link has links to multiple tutorials at the end apart from the official pandas tutorial.