# Combining Datasets:Concat and Append

In [2]:
import numpy as np
import pandas as pd

In [6]:
def make_df(cols, ind):
    """
    Quickly make a DataFrame
    Equivalently:
    for c in 'ABC':
    dic[c] = [str(c) + str(i) for i in range(3)]
    return:e.g: {'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
    
    """
    data = {c: [str(c)+str(i) for i in ind]
               for c in cols}
    return pd.DataFrame(data, ind)
make_df('ABC', range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


In [46]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
    
    

## Recall: Concatenation of Numpy Arrays

In [16]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x,y,z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [23]:
x = np.arange(1, 5).reshape((2,2))
np.concatenate([x, x], axis=1)

array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

## Simple Concatenation with `pd.concat`

`pd.concat`'signature  
```python
Signature:  
pd.concat(  
    objs,  
    axis=0,  
    join='outer',  
    join_axes=None,  
    ignore_index=False,  
    keys=None,  
    levels=None,  
    names=None,  
    verify_integrity=False,  
    sort=None,  
    copy=True,  
)
```

In [33]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
print('ser1: \n', ser1)
print('ser2: \n', ser2)

ser1: 
 1    A
2    B
3    C
dtype: object
ser2: 
 4    D
5    E
6    F
dtype: object


In [36]:
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [72]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


In [77]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
# pd.concat([df3, df4], sort=True, axis=1)
display('df3', 'df4', "pd.concat([df3, df4], sort=True, axis=1)")

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,C,D
0,C0,D0
1,C1,D1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1


### Duplicate indices

One important difference between `np.concatenate` and `pd.concat` is that Pandas concatenation **preserves** indices, even if the result will have duplicate indices

In [84]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
print(display('x', 'y'))

# make a duplicate indices!
y.index= x.index
display('x', 'y', 'pd.concat([x, y], sort=True)')


x
    A   B
0  A0  B0
1  A1  B1

y
    A   B
2  A2  B2
3  A3  B3


Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
0,A2,B2
1,A3,B3


We can notice the repeated indices in the result. While this is valid within `DataFrame`s, the outcome is often undesirable.  
`pd.concat()` provides a few way to handle it

#### Catching the repeats as an error

If `verify_intergrity` flag is True, the concatenation will raise an exception if there are duplicate indices

In [88]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print('ValueError :', e)

ValueError : Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


#### Ignoring the index

In [91]:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


#### Adding MultiIndex keys

Another option is to use the `keys` option to specify a label for the data sources


In [93]:
display('x', 'y', 'pd.concat([x, y], keys=["x", "y"])')

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,Unnamed: 1,A,B
x,0,A0,B0
x,1,A1,B1
y,0,A2,B2
y,1,A3,B3


### Concatenation with joins

Consider the concatenation of the following twp `DataFrame`s, which have **some(but not all) columns in common**  
By default, the entries for which no data is available are filled with `NaN` values


In [96]:
df5 = make_df('ABC', [1,2])
df6 = make_df('BCD', [3,4])
display('df5', 'df6', 'pd.concat([df5, df6], sort=True)')

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


We can specify one of serveral options for the `join` and `join_axes` parameters of the concatenate function to change it.  
By default, the join is a union of the input columns(join='outer), but we can specify its parameter using `join='inner'`.

In [99]:
display('df5', 'df6', 
        'pd.concat([df5, df6], sort=True, join="inner")')

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4


Another option is to directly specify the index of the remaining columns using the `join_axes` argument, which takes a list of index objects

In [101]:
display('df5', 'df6',
        'pd.concat([df5, df6], sort=True, join_axes=[df5.columns])')

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2
3,,B3,C3
4,,B4,C4


### The `append()` method

Rather than calling `pd.concat([df1,df2])`, we can simply call `df1.append(df2)`  
Unlike the `append()` and `extend()` methonds of Python lists, the `append()` method in Pandas **does not modify** the origin object-instead it creates a new object
 

In [103]:
display('df1', 'df2', 'df1.append(df2)')


Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4
