*Combining Datasets: Concat and Append*

Some of the most interesting studies of data come from combining different data sources. These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets. Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this fast and straightforward.



In [1]:
import pandas as pd
import numpy as np

In [2]:
# define function that creates DataFrame of particular form
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)

# Example DataFrame
make_df('ABC', range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


Very similar to the concatenation of NumPy arrays, which is done with the np.concatenate function:

In [3]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

The first argument is a list or tuple of arrays to concatenate. It also takes an axis keyword that allows you to specify the axis along where the result will be concatenated:

In [4]:
x = [[1, 2], [3, 4]]
np.concatenate([x, x], axis=1)

array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

*Simple Concatenation with pd.concat*

Pandas has a function *pd.concat()* which is similar to np.concatenate but contains a number of options

In [None]:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

pd.concat can be used for a simple concatenation of Series or DataFrame objects, just as np.concatenate() can be used for simple concatenations of arrays:

In [6]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

It also works to concatenate higher-dimesional objects, such as DataFrames:

In [7]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1); print(df2); print(pd.concat([df1, df2]))

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


By default, the concatenations take place row-wise within the DataFrame (i.e., axis=0). Like np.concatenate, pd.concat allows specification of an axis along which concatenation will take place.

In [None]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3); print(df4); print(pd.concat([df3, df4], axis=1 #TODO: Figure out why 'col' won't work here))

**Duplicate indices**

An important difference between np.concat and pd.concat is that Pandas concat preserves indices, even if the result will have duplicate indices!

In [15]:
x = make_df('AB', [0, 1])
y = make_df('AB', [0, 1])
y.index = x.index # make duplicate indices
print(x); print(y); print(pd.concat([x, y]))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A0  B0
1  A1  B1
    A   B
0  A0  B0
1  A1  B1
0  A0  B0
1  A1  B1


**Catching the repeats as error** To simplify verify so that the indicies in the result of pd.concat() do not overlap, you can specify the verify_integrity flag. While True, the concat will raise an  exception if there are duplicate indices. 

In [16]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


**Ignoring the index.** Sometimes the index itself does not matter, and you would prefer it to be ignored. You can use the ignore_index flag. With this set to True, the concatenation will create a new index for the resulting Series:

In [17]:
print(x); print(y); print(pd.concat([x, y], ignore_index=True))


    A   B
0  A0  B0
1  A1  B1
    A   B
0  A0  B0
1  A1  B1
    A   B
0  A0  B0
1  A1  B1
2  A0  B0
3  A1  B1


**Adding MultiIndex keys.** Another alternative is to use the keys options to specify a label for the data sources; the result will be hierarchically indexed series containing the data:

In [18]:
print(x); print(y); print(pd.concat([x, y], keys=['x', 'y']))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A0  B0
1  A1  B1
      A   B
x 0  A0  B0
  1  A1  B1
y 0  A0  B0
  1  A1  B1


**Concatenation with joins**

In the simple examples we just looked at, we were mainly concatenation DataFrames with shared column names. In practice, data from different sources might have different column name sets. 

In [20]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6]))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4


By default, the entries with no data available are filled with NA values. To change, we can specify one of several options for the join and join_axes parameters of the concatenate function. By default, the join is a union of the input columns (join='outer'), but we can change to an intersection of the columns using join='inner'

In [21]:
print(df5); print(df6); print(pd.concat([df5, df6], join='inner'))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4


Another option is to directly specify the index of the remaining columns using the join_axes argument, which takes a list of index objects. Here we will specify returned columns should be the same as those of the first input:

In [23]:
print(df5); print(df6); print(pd.concat([df5, df6], join_axes=[df5.columns]))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4


TypeError: concat() got an unexpected keyword argument 'join_axis'

*The append() method*

Because direct array concatenation is common, Series and DataFrane objects have an append method that can accomplish the same thing in fewer keystrokes. For example, rather than calling pd.concat([df1, df2]), you can call df1.append(df2)

In [24]:
print(df1); print(df2); print(df1.append(df2))

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


Unlike append() and extend() methods of Python lists, the append() method Pandas does not modify the original object. It creates a new object with the combined data. It is not a very effenctient method, because it involves the creation of a new index and data buffer. If multiple append operations are planned, it is generally better to build a list of DataFrames and pass them all at once in the concat() function.