This notebook covers the basics of operations on Series and DataFrames. Again, a quick check on whether you know the material.

1. Can you use numpy ufuncs (e.g. `np.exp`) on DataFrames?
2. What happens when you add two DataFrames with different indicies?

If you know the answers already, you can skip to our first [application](#Application:-Economic-Timeseries). Maybe download some additional datasets from [FRED](https://research.stlouisfed.org) and play with those; try to figure out how to fix the economy. You've got about 5 minutes.

`DataFrame`s and `Series` support all the usual math operations. Additionally,
numpy [`ufuncs`](http://docs.scipy.org/doc/numpy/reference/ufuncs.html) (universial function, e.g. `np.log`) can  be used as expected.

In [None]:
import numpy as np
import pandas as pd

from ph2t import side_by_side  # came with the notebooks

# `ufuncs`

`ufunc`s operate on arrays elementwise.
They can be used just like you'd expect on a `Series` or `DataFrame`.

In [None]:
np.random.seed(42)
df = pd.DataFrame(np.random.uniform(0, 10, size=(3, 3)))
df

In [None]:
df + 1

In [None]:
df ** 2

In [None]:
np.log(df)

For any decent-sized `DataFrame`, using a `ufunc` will be much faster than applying the function to each row in a for loop or by using `DataFrame.apply`.

# Alignment

When you do binary operations between two `numpy` arrays, like `+`, the arrays are in some sense aligned by position.
It's up to you to ensure that the correct values are in the same position across arrays.

In pandas, we rely on row and column *labels* to do the alignment for us.
Pandas automatically aligns by label when doing operations between `DataFrames` and `Series`.

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [2, 4, 6], 'B': [1, 4, 9]})

side_by_side(df1, df2, "df1", "df2")

In [None]:
df1 + df2

In [None]:
# Note the index order
df2 = pd.DataFrame({'A': [6, 2, 4], 'B': [9, 1, 4]}, index=[2, 0, 1])
side_by_side(df1, df2, 'df1', 'df2')

In [None]:
df1 + df2

In [None]:
# Different index entirely
df3 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
side_by_side(df1, df3, 'df1', 'df3')

In [None]:
df1 + df3

# Recap

- `ufuncs` work just fine
- Operations **align by label** and then are performed

# Application: Economic Timeseries

In [None]:
from itertools import chain

import pandas as pd

import seaborn as sns
import statsmodels.api as sm
from pandas_datareader import data

import matplotlib.pyplot as plt
from toolz import partitionby

%matplotlib inline
pd.options.display.max_rows = 10
sns.set_style('ticks')

Pandas recently split off some web-based data reading functionality into its own package `pandas-datareader`. Earlier, I used it to grab data from [FRED](http://research.stlouisfed.org).

```python
gdp = data.DataReader("GDP", data_source='fred', start='1929', end='2014').squeeze()
cpi = data.DataReader("CPIAUCSL", data_source='fred', start='1947-01', end='2015-05').squeeze()
rec = data.DataReader('USREC', data_source='fred', start='1854-12-01', end='2014-08-01').squeeze()
gdp.to_csv('data/gdp.csv', header=True)
cpi.to_csv('data/cpi.csv', header=True)
rec.to_csv('data/rec.csv', header=True)
```

In [None]:
with open('data/gdp.csv') as f:
    print(''.join(f.readlines(100)))

In [None]:
pd.read_csv?

In [None]:
gdp = pd.read_csv('data/gdp.csv', index_col='DATE', parse_dates=['DATE']).squeeze()
cpi = pd.read_csv('data/cpi.csv', index_col='DATE', parse_dates=['DATE']).squeeze()
rec = pd.read_csv('data/rec.csv', index_col='DATE', parse_dates=['DATE']).squeeze()

In [None]:
gdp.head(n=5)

In [None]:
gdp.tail()

In [None]:
cpi.head()

In [None]:
rec.head()

# Table Summarization

Pandas has a few methods for summarizing the contents of a DataFrame or Series

In [None]:
print('mean:    ', gdp.mean())
print('std:     ', gdp.std())
print('quantile:', gdp.quantile(.66))
print('max:     ', gdp.max())

It can also be useful to get the argmax, or the index label where the maximum occurs.

In [None]:
gdp.idxmax()

# Plotting

DataFrames and Series can make decent looking plots with a few lines of code.

I wanted to incude bars for recession indicators.
The only pandas-relavent bit is `Series.iteritems`, which you *almost* never want to use since it's relatively slow.

In [None]:
next(rec.iteritems())  

In [None]:
groups = partitionby(lambda x: x[1] == 1, rec.iteritems())
recessions = filter(lambda x: x[0][1] == 1, groups)
spans = [(months[0][0], months[-1][0]) for months in recessions]

In [None]:
def add_rec_bars(ax=None):
    ax = ax or plt.gca()
    ylim = ax.get_ylim()
    for span in spans:
        ax.fill_between(span, *ax.get_ylim(), color='k', alpha=.25)
    ax.set_ylim(*ylim)
    return ax

sns.set(style="white", context="talk")
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
ax = gdp.plot(title='GDP', linewidth=3)
add_rec_bars(ax)
sns.despine()

In [None]:
ax = cpi.plot(title='CPI', linewidth=3)
add_rec_bars(ax)
sns.despine()

Let's put some of those operations to use.

# Execrise: Convert CPI to be base 2009

CPI is the Consumer Price Index. The Index part just means that it doesn't really have any units,
just some (arbitrary) time span that is set to 100, and every other observation is relative to that time.
The CPI we have is indexed to the average over 1982-1984.

**Convert the CPI base-2009 by dividing the entire Series `cpi` by the average CPI in 2009.**

Try breaking the problem into pieces:
1. Select just the rows from 2009 (`.loc`)
  + Timeseries (dates in the index) have [speical rules](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex-partial-string-indexing) for slicing. Pass in a string with the subset you want
    + `.loc['2010-01-01']` select all observations from that day
    + `.loc['2010-01']` selects all observations from that month
2. calculate the average on those rows
3. divide `cpi` by that number (assign that to `cpi09`)

In [None]:
# Your code goes here
# Assign the result to cpi09

In [None]:
%load solutions/solutions_operations.py

In [None]:
ax = cpi09.plot(title='CPI-2009')
add_rec_bars(ax=ax)
sns.despine();

Next, try calculating real GDP, with 2009 as the base year.
This is defined to be `gdp` divided by the base-2009 CPI for that time period.

Call the result `rgdp` and try plotting it.

In [None]:
# Your code here
rgdp = ...

If you have trouble with the plotting, look into the `.fillna` and `.dropna` methods.
The next section explains what's going on.

# Real GDP: Alignment

Let's say we want real GDP (adjusted for inflation).

\begin{equation}
    rGDP_t = \frac{GDP_t}{CPI_t}
\end{equation}

Problem: our CPI is monthly but GDP is quarterly. Also the two Series have different start and end points. If you didn't have automatic label alignment, you'd have to jump through hoops to select the correct subset of each series.


```python
# this is boolean indexing, we'll see more later
gdp / cpi09[(cpi.index.month % 3 == 1) & (cpi09.index.year <= 2014)]
```

... but that's unneccesary. The operations will automatically align for you.

In [None]:
rgdp = (gdp / cpi09)
rgdp

The `NaN`s are missing value indicators. `NaN`s can crop up for many reasons, but in this case it's because the labels didn't overlap perfectly.

Binary operations (addition, multiplication, ...) consists of two steps.

1. align by label
2. operation

In [None]:
l, r = gdp.align(cpi09)
side_by_side(l.head(), r.head())


While binary operations *propogate* missing values, aggregations like `sum` or `mean`, will ignore them.
Pandas provides methods for detecting missing data `.isnull`, filling missing data `.fillna`,
or you can dropping it, which is what I'll do here.

In [None]:
rgdp.dropna()

Pandas methods are non-mutating by default. This means that even though I called `.dropna()` above, `rgdp` still has missing values.

In [None]:
rgdp

To capture the change, assign a name to the result. In this case, I just the same name `rgdp`.

In [None]:
real_gdp = rgdp.dropna()
real_gdp.head()

In [None]:
ax = gdp.pct_change().plot(
    figsize=(15, 5), label='GDP', legend=True, linewidth=2
)
real_gdp.pct_change().plot(
    ax=ax, label='real GDP', legend=True, linewidth=2
)
add_rec_bars(ax)
sns.despine()

# Recap

DataFrames allow you to rapidly explore data, iterating on ideas.

- `ufunc`s work as expected
- operations between pandas objects automatically align on labels