# Alignment

This notebook covers [*alignment*](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro-alignment), a feature of pandas that's crucial to using it well. It relies on the pandas' handling of *labels*.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Goal: Compute Real GDP

Let's learn through an example: Gross Domestic Product (the total output of a country) is measured in dollars. This means we we can't just compare the GDP from 1950 to the GDP from 2000, since the value of a dollar changed over that time (inflation).

In the US, the Bureau of Economic Analysis already provides an estimate of real GDP, but we'll calculate something similar using the formula:

$$
real\_GDP = \frac{nominal\_GDP}{price\_index}
$$

I've downloaded a couple time series from [FRED](https://fred.stlouisfed.org), one for GDP and one for the Consumer Price Index.

* U.S. Bureau of Labor Statistics, Consumer Price Index for All Urban Consumers: All Items in U.S. City Average [CPIAUCSL], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CPIAUCSL, October 31, 2020.
* U.S. Bureau of Economic Analysis, Gross Domestic Product [GDP], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/GDP, October 31, 2020.


We're going to do things the wrong way first.

In [None]:
gdp_bad = pd.read_csv("data/GDP.csv.gz", parse_dates=["DATE"])
cpi_bad = pd.read_csv("data/CPIAUCSL.csv.gz", parse_dates=["DATE"])

Our formula says `real_gdp = gdp / cpi`, so, let's try it!

In [None]:
%xmode plain

gdp_bad / cpi_bad

Whoops, what happened? We should probably look at our data:

In [None]:
gdp_bad

In [None]:
gdp_bad.dtypes

In [None]:
gdp_bad['DATE'][0]

So, we've tried to divide a datetime by a datetime, and pandas has correctly raised a type error. That raises another issue though. These two timeseries have different frequencies.

In [None]:
cpi_bad.head()

CPI is measured monthly, while GDP is quarterly. What we'd really need to do is *join* the two timeseries on the `DATE` variable, and then do the operation. We could do that, but let's do things the pandorable way first.

A DataFrame is a 2-D data structure composed of three components:

1. The *values*, the actual data
2. The *row labels*, stored in a `pandas.Index` class, accessible with `.index`
3. The *column labels*, stored in a `pandas.Index` class, accessible with `.columns`


![](https://pandas.pydata.org/docs/_images/01_table_dataframe1.svg)

We'll use the *index* to store our *labels* (the dates). Then the only thing in the values is our observations (the GDP or CPI).

In [None]:
# Notice that we select the GDP column to convert the
# 1-column DataFrame to a 1D Series
gdp = pd.read_csv('data/GDP.csv.gz', index_col='DATE',
                  parse_dates=['DATE'])["GDP"]
gdp.head()

Notice that we selected the single column `"GDP"` using `[]`. This returns a `pandas.Series` object, a 1-D array *with row labels*.

![](https://pandas.pydata.org/docs/_images/01_table_series.svg)

In [None]:
type(gdp)

In [None]:
gdp.index

The actual values are a NumPy array of floats.

In [None]:
gdp.to_numpy()[:10]

Let's read in CPI as well.

In [None]:
cpi = pd.read_csv('data/CPIAUCSL.csv.gz', index_col='DATE',
                  parse_dates=['DATE'])["CPIAUCSL"]
cpi.head()

And let's try the formula again.

In [None]:
rgdp = gdp / cpi
rgdp

**What happened?**

We've gotten our answer, but is there anything in the output that's surprising? What are these `NaN`s?

In pandas, any time you do an operation involving multiple pandas objects (dataframes, series), pandas will *align* the inputs. Alignment is a two-step process:

1. Take the union of the labels
2. Reindex the all inputs to the union of the labels

Only after that does the operation (division in this case) happen.

Looking at the raw data, we see that CPI is measured monthly, while GDP is just measured quarterly. So pandas has aligned the two (to monthly frequency, since that's the union), inserting missing values where there weren't any previously.

In [None]:
# manual alignment, just for demonstration:

all_dates = gdp.index.union(cpi.index)
all_dates

In [None]:
gdp2 = gdp.reindex(all_dates)
gdp2

In [None]:
cpi2 = cpi.reindex(all_dates)
cpi2

In [None]:
rgdp2 = gdp2 / cpi2
rgdp2

So when we wrote

```python
rgdp = gdp / cpi
```

pandas performs

```python
all_dates = gdp.index.union(cpi.index)
rgdp = gdp.reindex(all_dates) / cpi.reindex(all_dates)
```

This behavior is somewhat peculiar to pandas. But once you're used to it it's hard to go back. pandas handling the labels / alignment elimiates a class of errors that come from datasets not being aligned.

## Missing Data

Just a quick aside on handling missing data: pandas provides tools for detecting and dealing with missing data. We'll use these throughout the tutorial.

In [None]:
rgdp.isna()

In [None]:
rgdp.dropna()

In [None]:
rgdp.fillna(method='ffill')  # or fill with a scalar.

## Exercise:

Normalize real GDP to year **2000** dollars.

Right now, the unit on the `CPI` variable is "Index 1982-1984=100". This means that "index value" for the Consumer Price *Index* show year is the average of 1982 - 1984.

To *renormalize* an index like CPI, divide it by the average of a different timespan (say the year 200) and multiply by 100.

In [None]:
# use `.loc[start:end]` or `.loc["<year>"]` to slice a subset of *rows*
cpi.loc['1982':'1984'].mean()  # close enough to 100

In [None]:
# Get the mean CPI for the year 2000
cpi_2000_average = cpi.loc[...]...

# *renormalize* the entire `cpi` series to "Index 2000" units.
cpi_2000 = 100 * (... / ...)

# Compute real GDP again, this time in "year 2000 dollars".
rgdp_2000 = ...
rgdp_2000

In [None]:
%%file solutions/alignment-cpi2000.py
cpi_2000_average = cpi.loc["2000"].mean()

# *renormalize* the entire `cpi` series to "Index 2000" units.
cpi_2000 = 100 * (cpi / cpi_2000_average)

# Compute real GDP again, this time in "year 2000 dollars".
rgdp_2000 = gdp / cpi_2000
rgdp_2000

## Summary

In pandas, you generally want to have *meaningful row labels*. They should uniquely identify each observation.
Having a unique identifier is just good data hygenie. And since they're in the index they stay out of the way in operations.

## Next Steps

Now we'll discuss [Tidy Data](tidy.ipynb).