# Merging Multiple Dataframes

## concat

In [25]:
import numpy as np
import pandas as pd

list1 = ['aaa' for i in range(6)]
list2 = ['bbb' for i in range(6)]
list3 = ['ccc' for i in range(6)]

In [4]:
list1

['aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa']

In [5]:
list2

['bbb', 'bbb', 'bbb', 'bbb', 'bbb', 'bbb']

In [6]:
list3

['ccc', 'ccc', 'ccc', 'ccc', 'ccc', 'ccc']

In [26]:
df1 = pd.DataFrame(np.array(list1).reshape(2,3))
df2 = pd.DataFrame(np.array(list2).reshape(2,3))
df3 = pd.DataFrame(np.array(list3).reshape(2,3))

In [8]:
df1

Unnamed: 0,0,1,2
0,aaa,aaa,aaa
1,aaa,aaa,aaa


In [9]:
df2

Unnamed: 0,0,1,2
0,bbb,bbb,bbb
1,bbb,bbb,bbb


In [10]:
df3

Unnamed: 0,0,1,2
0,ccc,ccc,ccc
1,ccc,ccc,ccc


The "concat" method will generate a single dataframe from all provided dataframes
All rows will be added vertically for every unique column (DFs with the same column names will just add more rows to the result)

In [27]:
pd.concat([df1,df2,df3])

Unnamed: 0,0,1,2
0,aaa,aaa,aaa
1,aaa,aaa,aaa
0,bbb,bbb,bbb
1,bbb,bbb,bbb
0,ccc,ccc,ccc
1,ccc,ccc,ccc


### ignore_index
Notice the repeating index of the previous example. We can reset it using the **ignore_index** method

In [30]:
pd.concat([df1,df2,df3], ignore_index=True)

Unnamed: 0,0,1,2
0,aaa,aaa,aaa
1,aaa,aaa,aaa
2,bbb,bbb,bbb
3,bbb,bbb,bbb
4,ccc,ccc,ccc
5,ccc,ccc,ccc


We can change the axis to combine values based on their index value instead of column name

In [28]:
pd.concat([df1,df2,df3], axis = 1)

Unnamed: 0,0,1,2,0.1,1.1,2.1,0.2,1.2,2.2
0,aaa,aaa,aaa,bbb,bbb,bbb,ccc,ccc,ccc
1,aaa,aaa,aaa,bbb,bbb,bbb,ccc,ccc,ccc


What will happen if the column names of the different dataframes are not the same?

In [31]:
df1.columns = ['a','b','c']
df1

Unnamed: 0,a,b,c
0,aaa,aaa,aaa
1,aaa,aaa,aaa


In [32]:
pd.concat([df1,df2,df3])

Unnamed: 0,0,1,2,a,b,c
0,,,,aaa,aaa,aaa
1,,,,aaa,aaa,aaa
0,bbb,bbb,bbb,,,
1,bbb,bbb,bbb,,,
0,ccc,ccc,ccc,,,
1,ccc,ccc,ccc,,,


### keys
Let's reset df1 and this time use the **keys** parameter to mark each row with the dataframe it belongs to

In [34]:
df1 = pd.DataFrame(np.array(list1).reshape(2,3))
pd.concat([df1,df2,df3], keys=[1,2,3])

Unnamed: 0,Unnamed: 1,0,1,2
1,0,aaa,aaa,aaa
1,1,aaa,aaa,aaa
2,0,bbb,bbb,bbb
2,1,bbb,bbb,bbb
3,0,ccc,ccc,ccc
3,1,ccc,ccc,ccc


Keys can be strings as well

In [35]:
pd.concat([df1,df2,df3], keys=['df1','df2','df3'])

Unnamed: 0,Unnamed: 1,0,1,2
df1,0,aaa,aaa,aaa
df1,1,aaa,aaa,aaa
df2,0,bbb,bbb,bbb
df2,1,bbb,bbb,bbb
df3,0,ccc,ccc,ccc
df3,1,ccc,ccc,ccc


We can use the iloc method and pass it a tuple to specify a specific row from the multi-index

In [38]:
concat_df = pd.concat([df1,df2,df3], keys=['df1','df2','df3'])
concat_df.loc[('df2',0)]

0    bbb
1    bbb
2    bbb
Name: (df2, 0), dtype: object

Specify a specific column as well

In [40]:
concat_df.loc[('df2',0),2]

'bbb'

### append()
The **append** method works exactly like **concat** but being called directly from the dataframe instead of the Pandas library

In [45]:
df1.append(df2)

Unnamed: 0,0,1,2
0,aaa,aaa,aaa
1,aaa,aaa,aaa
0,bbb,bbb,bbb
1,bbb,bbb,bbb


In [46]:
df1.append([df2,df3], ignore_index=True)

Unnamed: 0,0,1,2
0,aaa,aaa,aaa
1,aaa,aaa,aaa
2,bbb,bbb,bbb
3,bbb,bbb,bbb
4,ccc,ccc,ccc
5,ccc,ccc,ccc


## merge()

In [47]:
sales = {'sale_id':[100,200,300,400,500], 'product_id':[1,2,1,2,4]}
products = {'product_id': [1,2,3], 'product_name': ['TV', 'Laptop', 'Keyboard']}

sales = pd.DataFrame(sales)
products = pd.DataFrame(products)

In [48]:
sales

Unnamed: 0,sale_id,product_id
0,100,1
1,200,2
2,300,1
3,400,2
4,500,4


In [49]:
products

Unnamed: 0,product_id,product_name
0,1,TV
1,2,Laptop
2,3,Keyboard


merge will automatically perform an inner JOIN operation between 2 DFs based on ALL common column names

In [50]:
pd.merge(sales, products)

Unnamed: 0,sale_id,product_id,product_name
0,100,1,TV
1,300,1,TV
2,200,2,Laptop
3,400,2,Laptop


We can specify the actual columns we want the join to be performed on, and also define the type of join we want

In [51]:
pd.merge(sales, products, on = 'product_id', how = 'outer')

Unnamed: 0,sale_id,product_id,product_name
0,100.0,1,TV
1,300.0,1,TV
2,200.0,2,Laptop
3,400.0,2,Laptop
4,500.0,4,
5,,3,Keyboard


We don't need to see the products ID anymore...

In [53]:
pd.merge(sales, products, on = 'product_id', how = 'outer').drop('product_id', axis=1)

Unnamed: 0,sale_id,product_name
0,100.0,TV
1,300.0,TV
2,200.0,Laptop
3,400.0,Laptop
4,500.0,
5,,Keyboard


Display the number of sales of each product

In [52]:
pd.merge(sales, products, on = 'product_id', how = 'inner')['product_name'].value_counts()

Laptop    2
TV        2
Name: product_name, dtype: int64

The **merge** method can be called directly from the dataframe as well

In [66]:
sales.merge(products)

Unnamed: 0,sale_id,product_id,product_name
0,100,1,TV
1,300,1,TV
2,200,2,Laptop
3,400,2,Laptop


## join()
We use the **join** method to combine dataframes horizontally based on index values. Usually we can just add the name of the second dataframe, unless they have columns that share the same name

In [56]:
#sales.join(products) - Error

To overcome this we can change the column names prior to the join operation or use the **lsuffix** / **rsuffix** parameters

In [58]:
sales.join(products, lsuffix='_s', rsuffix='')

Unnamed: 0,sale_id,product_id_s,product_id,product_name
0,100,1,1.0,TV
1,200,2,2.0,Laptop
2,300,1,3.0,Keyboard
3,400,2,,
4,500,4,,


Or simply ommit the duplicate column from one of the dataframes

In [62]:
sales.join(products.drop('product_id', axis=1))

Unnamed: 0,sale_id,product_id,product_name
0,100,1,TV
1,200,2,Laptop
2,300,1,Keyboard
3,400,2,
4,500,4,


we can use join to join on a column as well

In [63]:
sales.join(products, lsuffix='_s', rsuffix='_p', on = 'product_id')

Unnamed: 0,sale_id,product_id_s,product_id_p,product_name
0,100,1,2.0,Laptop
1,200,2,3.0,Keyboard
2,300,1,2.0,Laptop
3,400,2,3.0,Keyboard
4,500,4,,


In [65]:
sales.join(products, lsuffix='_s', rsuffix='_p', on = 'product_id', how = 'outer')

Unnamed: 0,product_id,sale_id,product_id_s,product_id_p,product_name
0,1,100.0,1.0,2.0,Laptop
2,1,300.0,1.0,2.0,Laptop
1,2,200.0,2.0,3.0,Keyboard
3,2,400.0,2.0,3.0,Keyboard
4,4,500.0,4.0,,
4,0,,,1.0,TV
