# Time Series and Cross-Section Alignment

One of the most time-consuming issues in working with financial data is the so-called data alignment problem. Two related time series may have indexes that don’t line up perfectly, or two DataFrame objects might have columns or row labels that don’t match. Users of MATLAB, R, and other matrix-programming languages often invest significant effort in wrangling data into perfectly aligned forms. In my experience, having to align data by hand (and worse, having to verify that data is aligned) is a far too rigid and tedious way to work. It is also rife with potential for bugs due to combining misaligned data.

pandas take an alternate approach by automatically aligning data in arithmetic operations. In practice, this grants immense freedom and enhances your productivity. As an example, let’s consider a couple of DataFrames containing time series of stock prices and volume:

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [4]:
from pandas_datareader import data as web
import yfinance
import datetime

start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2017, 6, 1)

all_data={}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOGL']:
    all_data[ticker] = web.get_data_yahoo(ticker, start, end)

price = DataFrame({tic: data['Adj Close'] 
                   for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
                     for tic, data in all_data.items()})

In [8]:
print(price.head(), '\n', volume.head())

                AAPL        IBM       MSFT      GOOGL
Date                                                 
2009-12-31  6.426000  81.321327  23.389154  15.515015
2010-01-04  6.526021  82.284233  23.749805  15.684434
2010-01-05  6.537303  81.290230  23.757488  15.615365
2010-01-06  6.433319  80.762161  23.611691  15.221722
2010-01-07  6.421424  80.482620  23.366137  14.867367 
                    AAPL        IBM        MSFT        GOOGL
Date                                                       
2009-12-31  352410800.0  4417676.0  31929700.0   48743208.0
2010-01-04  493729600.0  6438444.0  38409100.0   78169752.0
2010-01-05  601904800.0  7156104.0  49749600.0  120067812.0
2010-01-06  552160000.0  5863144.0  58182400.0  158988852.0
2010-01-07  477131200.0  6109268.0  50559700.0  256315428.0


In [11]:
price.head() * volume.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOGL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-31,2264592000.0,359251300.0,746808700.0,756251600.0
2010-01-04,3222090000.0,529782400.0,912208700.0,1226048000.0
2010-01-05,3934834000.0,581721300.0,1181926000.0,1874903000.0
2010-01-06,3552221000.0,473520200.0,1373785000.0,2420084000.0
2010-01-07,3063862000.0,491689900.0,1181385000.0,3810735000.0


In [12]:
vwap = (price * volume).sum() / volume.sum()

In [14]:
vwap, vwap.dropna()

(AAPL      15.345857
 IBM      110.330778
 MSFT      29.922578
 GOOGL     20.255270
 dtype: float64,
 AAPL      15.345857
 IBM      110.330778
 MSFT      29.922578
 GOOGL     20.255270
 dtype: float64)

Since SPX wasn’t found in volume, you can choose to explicitly discard that at any point. Should you wish to align by hand, you can use DataFrame’s align method, which returns a tuple of reindexed versions of the two objects:

In [17]:
price.align(volume, join = 'inner')

(                 AAPL         IBM       MSFT      GOOGL
 Date                                                   
 2009-12-31   6.426000   81.321327  23.389154  15.515015
 2010-01-04   6.526021   82.284233  23.749805  15.684434
 2010-01-05   6.537303   81.290230  23.757488  15.615365
 2010-01-06   6.433319   80.762161  23.611691  15.221722
 2010-01-07   6.421424   80.482620  23.366137  14.867367
 ...               ...         ...        ...        ...
 2017-05-25  36.365360  114.259567  65.071732  49.592999
 2017-05-26  36.303917  113.730026  65.389511  49.663502
 2017-05-30  36.318092  113.163223  65.810104  49.808498
 2017-05-31  36.103020  113.834473  65.277336  49.354500
 2017-06-01  36.202290  113.864311  65.520363  49.414501
 
 [1867 rows x 4 columns],
                    AAPL        IBM        MSFT        GOOGL
 Date                                                       
 2009-12-31  352410800.0  4417676.0  31929700.0   48743208.0
 2010-01-04  493729600.0  6438444.0  38409100.0 

Another indispensable feature is constructing a DataFrame from a collection of potentially differently indexed Series:

In [18]:
s1 = Series(range(3), index = list('abc'))

s2 = Series(range(4), index = list('dbce'))

s3 = Series(range(3), index = list('fac'))

In [19]:
DataFrame({'one':s1, 'two': s2, 'three': s3})

Unnamed: 0,one,two,three
a,0.0,,1.0
b,1.0,1.0,
c,2.0,2.0,2.0
d,,0.0,
e,,3.0,
f,,,0.0


As you have seen earlier, you can of course specify explicitly the index of the result, discarding the rest of the data:

In [20]:
DataFrame({'one':s1, 'two': s2, 'three': s3}, index = list('face'))

Unnamed: 0,one,two,three
f,,,0.0
a,0.0,,1.0
c,2.0,2.0,2.0
e,,3.0,
