# Splicing Together Data Sources 

I Described early, a number of strategies for merging together two related data sets. In a financial or economic context, there are a few widely occuring use cases:

* Switching from one data source (a time series or collection of time series) to another at a specific point in time

* "Patching" missing values in a time series at the beginning, middle, or end using another time series

* Completely replacing the data for a subset of sumbols (countries, asset tickers, and so on)

In the first case, switching from one set to another at a specific instant, it is a matter of splicing together tow TimeSeries or DataFrame objects using pandas.concat:

In [32]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

In [33]:
data1 = DataFrame(np.ones((6, 3), dtype=float),
                    columns = list('abc'),
                    index = pd.date_range('6/12/2012', periods= 6, freq= 'B'))

In [34]:
data2 = DataFrame(np.ones((6, 3), dtype=float) * 3,
                columns=list("abc"), index = pd.date_range('6/13/2012',periods=6, freq='B'))

In [35]:
data1, data2

(              a    b    c
 2012-06-12  1.0  1.0  1.0
 2012-06-13  1.0  1.0  1.0
 2012-06-14  1.0  1.0  1.0
 2012-06-15  1.0  1.0  1.0
 2012-06-18  1.0  1.0  1.0
 2012-06-19  1.0  1.0  1.0,
               a    b    c
 2012-06-13  3.0  3.0  3.0
 2012-06-14  3.0  3.0  3.0
 2012-06-15  3.0  3.0  3.0
 2012-06-18  3.0  3.0  3.0
 2012-06-19  3.0  3.0  3.0
 2012-06-20  3.0  3.0  3.0)

In [36]:
spliced = pd.concat([data1.loc[:'6/12/2012'], data2.loc['6/13/2012':]])

In [37]:
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,3.0,3.0,3.0
2012-06-14,3.0,3.0,3.0
2012-06-15,3.0,3.0,3.0
2012-06-18,3.0,3.0,3.0
2012-06-19,3.0,3.0,3.0
2012-06-20,3.0,3.0,3.0


Suppose in a similar example that data1 was missing a time series present in data2:

In [41]:
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
                        columns=['a', 'b', 'c', 'd'],
                        index=pd.date_range('6/13/2012', periods=6))

In [42]:
spliced = pd.concat([data1.loc[:'2012-06-14'], data2.loc['2012-06-15':]])

In [43]:
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,
2012-06-14,1.0,1.0,1.0,
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


Using combine_first, you can bring in data from before the splice point to extend the history for 'd' item:

In [45]:
spliced_filled = spliced.combine_first(data2)

spliced_filled

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


Since data2 does not have any values for 2012-06-12, no values are filled on that day.

DataFrame has a related method update for performing in-place updates. YOu have to pass overweite = False to make it only fill the holes:

In [55]:
spliced.update(data2, overwrite=False)

In [56]:
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,2.0,2.0,2.0,2.0
2012-06-14,2.0,2.0,2.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


To replace the data for a subset of symbols, you can use any of the above techniques, but sometimes it’s simpler to just set the columns directly with DataFrame indexing:

In [57]:
cp_spliced = spliced.copy()

In [58]:
cp_spliced[['a', 'c']] = data1[['a', 'c']]

cp_spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,2.0,1.0,2.0
2012-06-14,1.0,2.0,1.0,2.0
2012-06-15,1.0,2.0,1.0,2.0
2012-06-16,,2.0,,2.0
2012-06-17,,2.0,,2.0
2012-06-18,1.0,2.0,1.0,2.0
