<a href="https://colab.research.google.com/github/Shuraimi/DataScience-Handbook-Notes/blob/main/2.%20Data_manipulation_with_Pandas/7.%20Combining_Datasets_Concat_and_Append_and_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Combining Datasets: Concat and Append

One of the most interesting studies of data is by combining data from different sources. This combining can include simple *concat* operation to complex database-style *joins* and *merges* that correctly handles data without any overlaps in datasets.

*Series* and *DataFrame* are built with these operations in mind and Pandas has methods and functions to perform data wrangling fast and straightforward .

We'll first look at simple concatenation using `pd.concat()` on *Series* and *DataFrame* and later move on to more complex in-memory merges and joins.

In [None]:
import numpy as np
import pandas as pd

In [None]:
#function which creates a DataFrame
def make_df(cols, ind):
 """Quickly make a DataFrame"""
 data = {c: [str(c) + str(i) for i in ind]
 for c in cols}
 return pd.DataFrame(data, ind)

In [None]:
#example DataFrame
make_df('ABC',range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


## Recall: Concatenation of Numpy arrays

The concatenation of *Series* and *DataFrame* is similar to concatenation of Numpy arrays using the `np.concatenate()`.
Using this, we can concatenate two or more arrays like

In [None]:
x=np.arange(0,10)
y=np.arange(0,5)
z=np.arange(0,3)
k=np.concatenate([x,y,z])
k

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 0, 1, 2])

The first argument is a list of tuple of arrays to concatenate. The second argument is the axis argument which is used to specify which among which axis result should concatenate.

In [None]:
h=np.arange(0,12).reshape(3,4)
print(np.concatenate([h,h], axis=1))

[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]


## Simple concatenation with pd.concat

Pandas has a function `pd.concat()`that is similar to the `np.concatenate()` but has many options

Signature in Pandas v0.18

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
 keys=None, levels=None, names=None, verify_integrity=False,
 copy=True)

In [None]:
pd.concat?

This gives the documentation of the Concat method.

`pd.concat()` is used for simple concatenation of Series or DataFrame objects just as the `np.concatenate` is used to concatenate simple arrays.

In [None]:
ser1=pd.Series(['a','b','c'],index=[1,2,3])
ser2=pd.Series(['k','l','m'],index=[8,5,6])
pd.concat([ser1,ser2])

1    a
2    b
3    c
8    k
5    l
6    m
dtype: object

It also works for higher dimensions of data.

In [None]:
df1=make_df('AB',[1,2])
df2=make_df('AB',[9,8,7])
pd.concat([df1,df2])

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
9,A9,B9
8,A8,B8
7,A7,B7


By default, the concatenation takes place row-wise just as `np.concatenate()`. We can specify the axis in `pd.concat()` just like in `np.concatenate()` for concatenation.

In [None]:
df3=make_df('AB',[1,2])
df4=make_df('CD',[1,2])
pd.concat([df3,df4],axis=1)
#axis='col' didn't work

Unnamed: 0,A,B,C,D
1,A1,B1,C1,D1
2,A2,B2,C2,D2


### Duplicate indices

One important difference between `pd.concat()` and `np.concatenate()` is Pandas `pd.concat()` will preserve the index even if duplicate indices are present.

In [None]:
df4=make_df('AB',[1,2])
df5=make_df('AB',[5,6])

In [None]:
df5.index=df4.index#making duplicate indices
pd.concat([df4,df5])

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
1,A5,B5
2,A6,B6


The result with repeated indices is valid in DataFrames but th outcome is not desirable. Pandas gives few ways to handle them.

### Catching the repeats as an error

If you would simply like to check whether there are any repeats in the result of the `pd.concat()` or if they overlap, you can specify the `verify_integrity` flag. If it is set to true, the concatenation will raise an exception if there are any duplicate Indices.

We'll use a try catch block to print the exception

In [None]:
try:
    pd.concat([df4,df5],verify_integrity=True)
except Exception as e:
    print(e)

Indexes have overlapping values: Int64Index([1, 2], dtype='int64')


### Ignoring the index

Sometimes the duplicate index is not that important and but might want to ignore it. You can ignore the duplicate indices with the `ignore_index` flag.

As this flag is set True in the concatenation, it will ignore the repeated indices and assigns a new integer index in the resulting Series.

In [None]:
pd.concat([df4,df5],ignore_index=True)

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A5,B5
3,A6,B6


### Adding MultiIndex keys

Another alternative is to use the keys option to specify the labels of data sources. The result will be hierarchically index Series.

In [None]:
pd.concat([df4,df5],keys=['x','y'])

Unnamed: 0,Unnamed: 1,A,B
x,1,A1,B1
x,2,A2,B2
y,1,A5,B5
y,2,A6,B6


We can use the tools discussed in heirarchical indexing to transform data into another format.

### Concatenation with joins

In the simplest examples we discussed till now we're about concatenation of Dataframes which have common column names. But in practise different datasets may have different column names and `pd.concat()` has many options.

Consider concatenation of Dataframes which have some columns in common.

In [None]:
df5=make_df('AB',[1,2])
df6=make_df('BCD',[3,4,5])
print(df5)
print(df6)
print(pd.concat([df5,df6]))

    A   B
1  A1  B1
2  A2  B2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
5  B5  C5  D5
     A   B    C    D
1   A1  B1  NaN  NaN
2   A2  B2  NaN  NaN
3  NaN  B3   C3   D3
4  NaN  B4   C4   D4
5  NaN  B5   C5   D5


The valeus which do not have any common columns are marked as NA.To change this, we can specify one of the several options of  *join* and *join_axes parameters of the concatenation function.
<Br> By default, `pd.concat()` performs the union of columns of the DataFrame i.e join='outer' and for intersection of columns, we specify join='inner'.

In [None]:
print(pd.concat([df5,df6],join='inner'))

    B
1  B1
2  B2
3  B3
4  B4
5  B5


Another option is to directly specify the index of the remaining columns using `join_axes` argument which takes a list of objects.

In [None]:
pd.concat([df5,df6],join_axes=[df5.columns])

TypeError: ignored

In [None]:
print(pd.concat([df5, df6], join_axes=[df5.columns]))

TypeError: ignored

The combination of options of the pd.concat function allows a wide range of possi‐
ble behaviors when you are joining two datasets; keep these in mind as you use these
tools for your own data.

### The `append()`

Because concatenation in Series and DataFrame is so common, pandas has `append()` which can accomplish the same thing with fewer key strokes.

Rather than calling pd.concat([df1,df2]) we can call df1.append(df2)

In [None]:
df1.append(df2)

  df1.append(df2)


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
9,A9,B9
8,A8,B8
7,A7,B7


In the next section, we’ll look at another more powerful approach to combining data
from multiple sources, the database-style merges/joins implemented in pd.merge