# Combining Dataframes

There are two ways of combining dataframes, one is to concatenate and two is to merge. Concatenating is nothing but stacking the data. We could stack the data horizontally or vertically

In [2]:
import pandas as pd
import numpy as np
from pandas import DataFrame
import os
os.chdir('C:\\Python Code\\Data Manipulation with Pandas\\Combining Dataframe')

## Concatenate 

Pandas library provides a function named concat to combine dataframes.

To demonstrate this functionality we will use the 'Sales' information for the Months
of September and October. 

In [3]:
data1=pd.read_csv('Sales_Sep.csv',sep=',',header=0, encoding="latin")
print(data1.head())
print(data1.shape)

  Cust_ID  Sales_Amount
0     CS1          1000
1     CS2           900
2     CS3           875
3     CS4           230
4     CS5           987
(10, 2)


In [4]:
data2=pd.read_csv('Sales_Oct.csv',sep=',',header=0, encoding="latin")
print(data2.head())
print(data2.shape)

  Cust_ID  Sales_Amount
0     CS1          1000
1     CS2           890
2     CS3           900
3     CS4           450
4     CS8           980
(12, 2)


In [5]:
#  By default, the 'concat' method horizontally concatenats these two 
# dataframes. It simply appends the rows from the dataframe 
# sales_october to the rows of sales_september

row_concat = pd.concat([data1, data2])

# Print the shape of row_concat. We can observe that the number 
# of rows is the sum of the two dataframes concatenated. 
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())

(22, 2)
  Cust_ID  Sales_Amount
0     CS1          1000
1     CS2           900
2     CS3           875
3     CS4           230
4     CS5           987


## Merging Dataframes

Very often, we deal with mixed datatypes which means that in different dataframes, we may end up having different column names. It is also a common practice to source the data from multiple sources which means that the relevant data may be in different files. In this case, we cannot use the 'concat' function. Instead, we use the 'merge' option. 

Merge is performed using a 'common key' across the dataframes.  

Depending upon the business names, we can combine these two dataframes or merge these two dataframes using different methods
for example, outer merge, inner merge, left join or right join. The concept is similar to how joins work in SQL.


In [6]:
# To demonstrate, let us create two dissimlar dataframes. There is a field
# named 'CustomerID' in both the dataframes that represents the customer ids. 

df1=DataFrame({'CustomerID':[1,2,3,4,5,6], \
               'Product':['Toaster','Toaster','Toaster','Radio','Radio','Radio']})
df2=DataFrame({'CustomerID':[2,4,6],\
               'State':['Alabama','Alabama','ohio']})

In [7]:
print(df1)

   CustomerID  Product
0           1  Toaster
1           2  Toaster
2           3  Toaster
3           4    Radio
4           5    Radio
5           6    Radio


In [8]:
print(df2)

   CustomerID    State
0           2  Alabama
1           4  Alabama
2           6     ohio


In [9]:
# If we use the 'outer' option, all the rows that are common to
# both the  dataframes is available in the output.The 
# missing values for the columns are set to NANs. 
# The 'on' attribute is used to provide the common key.

pd.merge(df1,df2,how='outer',on='CustomerID')

Unnamed: 0,CustomerID,Product,State
0,1,Toaster,
1,2,Toaster,Alabama
2,3,Toaster,
3,4,Radio,Alabama
4,5,Radio,
5,6,Radio,ohio


In [10]:
# If the 'inner' option is used, then only those 
# customer ids that are prosent in both the data
# frames are available in the result. 

pd.merge(df1,df2,how='inner',on='CustomerID')

Unnamed: 0,CustomerID,Product,State
0,2,Toaster,Alabama
1,4,Radio,Alabama
2,6,Radio,ohio


In [11]:
# If the 'left' option is chosen, then only the rows 
# in the 1st dataframe is available in the result. 
# for those customerids that are not available in the 
# second dataframe, the corresponding columns are set to 
# NANs

pd.merge(df1,df2,how='left',on='CustomerID')

Unnamed: 0,CustomerID,Product,State
0,1,Toaster,
1,2,Toaster,Alabama
2,3,Toaster,
3,4,Radio,Alabama
4,5,Radio,
5,6,Radio,ohio


In [12]:
# If the 'right' option is chosen, then only the rows 
# in the second dataframe is available in the result. 
# for those customerids that are not available in the 
# first dataframe, the corresponding columns are set to 
# NANs

pd.merge(df1,df2,how='right',on='CustomerID')

Unnamed: 0,CustomerID,Product,State
0,2,Toaster,Alabama
1,4,Radio,Alabama
2,6,Radio,ohio


### Merging Dataframes with Dissimilar Key Name.

In [13]:
# Very often, the Column name for the keys may not be the same between
# the dataframes. E.g. in the below dataframes, the name of the key 
# is different - "CustomerId" and "CustomerID". 

df1=DataFrame({'CustomerId':[1,2,3,4,5,6],'Product':['Television','Television','Television','Earphones','Earphones','Earphones']})
df2=DataFrame({'CustomerID':[2,4,6],'State':['Texas','Texas','Seattle']})

In [14]:
df1

Unnamed: 0,CustomerId,Product
0,1,Television
1,2,Television
2,3,Television
3,4,Earphones
4,5,Earphones
5,6,Earphones


df2

In [15]:
df2

Unnamed: 0,CustomerID,State
0,2,Texas
1,4,Texas
2,6,Seattle


In [16]:
# In such cases , the name of the key on the left and right dataframes
# needs to be specified explicitly.

pd.merge(df1,df2,how='inner',left_on='CustomerId',right_on='CustomerID')

Unnamed: 0,CustomerId,Product,CustomerID,State
0,2,Television,2,Texas
1,4,Earphones,4,Texas
2,6,Earphones,6,Seattle


In [17]:
# In the previous output, we can see that there are two columns corresponding
# to the keys as result of the merge operation. 

pd.merge(df1,df2,how='inner',left_on='CustomerId',right_on='CustomerID').drop('CustomerID',axis=1)

Unnamed: 0,CustomerId,Product,State
0,2,Television,Texas
1,4,Earphones,Texas
2,6,Earphones,Seattle
