### Python Pandas Tricks 3 Best Methods To Join Datasets

The data required for a data-analysis task usually comes from multiple sources. And therefore, it is important to learn the methods to bring this data together.

In this article, I have listed the three best and most time-saving ways to combine multiple datasets using Python pandas methods.

__merge()__: To combine the datasets on common column or index or both.

__concat()__: To combine the datasets across rows or columns.

__join()__: To combine the datasets on key column or index.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df1 = pd.read_excel("Dummy_course_data.xlsx", sheet_name="Fees")
df2 = pd.read_excel("Dummy_course_data.xlsx", sheet_name="Discounts")

In [5]:
df1

Unnamed: 0,Course,Country,Fee_USD
0,Maths,India,15500
1,Physics,Germany,16700
2,Applied Maths,Germany,11100
3,General Science,United Kingdom,18000
4,Social Science,Austria,18400
5,History,Poland,23000
6,Politics,India,21600
7,Computer Graphics,United States,27000


In [6]:
df2

Unnamed: 0,Course,Country,Discount_USD
0,Maths,India,1000
1,Physics,Germany,2300
2,German language,Germany,1500
3,Information Technology,United Kingdom,1200
4,Social Science,Austria,1500
5,History,Poland,3200
6,Marketing,India,2000
7,Computer Graphics,United States,2500


### Merge

__pandas merge()__ combines two datasets in database-style,i.e DataFrames are joined on common columns or indices 🚦

If datasets are combined with columns on columns, the DataFrame indexes will be ignored.

To use __merge()__, you need to provide at least below two arguments.

1. Left DataFrame
2. Right DataFrame

In [7]:
# for example, combining above two datasets without mentioning anything else like- on which columns we want to combine the two datasets.

pd.merge(df1,df2)

Unnamed: 0,Course,Country,Fee_USD,Discount_USD
0,Maths,India,15500,1000
1,Physics,Germany,16700,2300
2,Social Science,Austria,18400,1500
3,History,Poland,23000,3200
4,Computer Graphics,United States,27000,2500


##### pd.merge() automatically detects the common column between two datasets and combines them on this column.

##### In this case pd.merge() used the default settings and returned a final dataset which contains only the common rows from both the datasets. 
##### Therefore, this results into inner join.

##### This can be the simplest method to combine two datasets.

### However, merge() is the most flexible with the bunch of options for defining the behavior of merge.

Although this list looks quite daunting, but with practice you will master merging variety of datasets.

At the moment, important option to remember is how which defines what kind of merge to make.
Other possible values for this option are — ‘outer’ , ‘left’ , ‘right’ .

 __how__ = __"outer"__

In [9]:
pd.merge(df1, df2, how="outer")

Unnamed: 0,Course,Country,Fee_USD,Discount_USD
0,Maths,India,15500.0,1000.0
1,Physics,Germany,16700.0,2300.0
2,Applied Maths,Germany,11100.0,
3,General Science,United Kingdom,18000.0,
4,Social Science,Austria,18400.0,1500.0
5,History,Poland,23000.0,3200.0
6,Politics,India,21600.0,
7,Computer Graphics,United States,27000.0,2500.0
8,German language,Germany,,1500.0
9,Information Technology,United Kingdom,,1200.0


Some cells are filled with NaN as these columns do not have matching records in either of the two datasets. e.g. for the courses German language, Information Technology, Marketing there is no Fee_USD value in df1. So, after merging, Fee_USD column gets filled with NaN for these courses.

As the second dataset df2 has 3 rows different than df1 for columns Course and Country, the final output after merge contains 10 rows. i.e. 7 rows from df1 + 3 additional rows from df2.

✔️ Often there is questions in data science job interviews — how many total rows will be there in the output after combining the datasets with outer join. The above mentioned point can be best answer for this question.

__how__=__"left"__

In [10]:
pd.merge(df1,df2, how="left")

Unnamed: 0,Course,Country,Fee_USD,Discount_USD
0,Maths,India,15500,1000.0
1,Physics,Germany,16700,2300.0
2,Applied Maths,Germany,11100,
3,General Science,United Kingdom,18000,
4,Social Science,Austria,18400,1500.0
5,History,Poland,23000,3200.0
6,Politics,India,21600,
7,Computer Graphics,United States,27000,2500.0


As per definition, left join returns all the rows from the left DataFrame and only matching rows from right DataFrame.

Exactly same happened here and for the rows which do not have any value in Discount_USD column, NaN is substituted.

Now let’s see the exactly opposite results using right joins.

__how__=__"right"__

In [11]:
pd.merge(df1, df2, how="right")

Unnamed: 0,Course,Country,Fee_USD,Discount_USD
0,Maths,India,15500.0,1000
1,Physics,Germany,16700.0,2300
2,German language,Germany,,1500
3,Information Technology,United Kingdom,,1200
4,Social Science,Austria,18400.0,1500
5,History,Poland,23000.0,3200
6,Marketing,India,,2000
7,Computer Graphics,United States,27000.0,2500


The right join returned all rows from right DataFrame i.e. df2 and only matching rows from left DataFrame i.e. df1

You can get same results by using how = ‘left’ also. All you need to do is just change the order of DataFrames mentioned in pd.merge() from df1, df2 to df2, df1 .

✔️ ️The order of the columns in the final output will change based on the order in which you mention DataFrames in pd.merge()

__left_on__ and __right_on__

If you want to combine two datasets on different column names i.e. the columns itself have similar values but column names are different in both datasets, then you must use this option.

You can mention mention column name of left dataset in left_on and column name of right dataset in right_on .

Admond Lee has very well explained all the pandas merge() use-cases in his article Why And How To Use Merge With Pandas in Python.
https://towardsdatascience.com/why-and-how-to-use-merge-with-pandas-in-python-548600f7e738

### Join

Unlike merge() which is a function in pandas module, join() is an instance method which operates on DataFrame. This gives us flexibility to mention only one DataFrame to be combined with the current DataFrame.

In fact, pandas.DataFrame.join() and pandas.DataFrame.merge() are considered convenient ways of accessing functionalities of pd.merge(). Therefore it is less flexible than merge() itself and offers few options.

In join, only ‘other’ is the required parameter which can take the names of single or multiple DataFrames. Hence, giving you the flexibility to combine multiple datasets in single statement. 💪🏻

for example, let’s combine df1 and df2 using join(). As these both datasets have same column names Course and Country, we should use lsuffix and rsuffix options as well.

In [13]:
df3 = df1.join(df2,
          lsuffix="_df1",
          rsuffix="_df2"
)
df3

Unnamed: 0,Course_df1,Country_df1,Fee_USD,Course_df2,Country_df2,Discount_USD
0,Maths,India,15500,Maths,India,1000
1,Physics,Germany,16700,Physics,Germany,2300
2,Applied Maths,Germany,11100,German language,Germany,1500
3,General Science,United Kingdom,18000,Information Technology,United Kingdom,1200
4,Social Science,Austria,18400,Social Science,Austria,1500
5,History,Poland,23000,History,Poland,3200
6,Politics,India,21600,Marketing,India,2000
7,Computer Graphics,United States,27000,Computer Graphics,United States,2500


As per definition join() combines two DataFrames on either on index (by default) and that’s why the output contains all the rows & columns from both DataFrames.

✔️ If you want to join both DataFrames using the common column — Country, you need to set Country to be the index in both df1 and df2. It can be done like below.

In [14]:
df11=df1.copy()
df11.set_index("Course", inplace=True)

df22=df2.copy()
df22.set_index("Course", inplace=True)

# The above block of code will make column Course as index in both datasets.

In [16]:
df4 = df11.join(df22,
          lsuffix="_df1",
          rsuffix="_df2",
          on="Course")
df4

Unnamed: 0_level_0,Country_df1,Fee_USD,Country_df2,Discount_USD
Course,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Maths,India,15500,India,1000.0
Physics,Germany,16700,Germany,2300.0
Applied Maths,Germany,11100,,
General Science,United Kingdom,18000,,
Social Science,Austria,18400,Austria,1500.0
History,Poland,23000,Poland,3200.0
Politics,India,21600,,
Computer Graphics,United States,27000,United States,2500.0


In [17]:
# The resultant DataFrame will then have Country as its index, as shown above.

## 🚩 Note: The pandas.DataFrame.join() returns ‘left’ join by default whereas pandas.DataFrame.merge() and pandas.merge() returns ‘inner’ join by default.

### Concat

Only ‘objs’ is the required parameter where you can pass the list of DataFrames to combine and as axis = 0 , DataFrame will be combined along the rows i.e. they will be stacked one over above as shown below,

In [18]:
pd.concat([df1,df2], axis=0)

Unnamed: 0,Course,Country,Fee_USD,Discount_USD
0,Maths,India,15500.0,
1,Physics,Germany,16700.0,
2,Applied Maths,Germany,11100.0,
3,General Science,United Kingdom,18000.0,
4,Social Science,Austria,18400.0,
5,History,Poland,23000.0,
6,Politics,India,21600.0,
7,Computer Graphics,United States,27000.0,
0,Maths,India,,1000.0
1,Physics,Germany,,2300.0


Unlike pandas.merge() which combines DataFrames based on values in common columns, pandas.concat() simply stacked them vertically. The columns which are not present in either of the DataFrame get filled with NaN

In [19]:
# Both datasets can be stacked side by side as well by making the axis = 1, as shown below.
pd.concat([df1, df2], axis=1)

Unnamed: 0,Course,Country,Fee_USD,Course.1,Country.1,Discount_USD
0,Maths,India,15500,Maths,India,1000
1,Physics,Germany,16700,Physics,Germany,2300
2,Applied Maths,Germany,11100,German language,Germany,1500
3,General Science,United Kingdom,18000,Information Technology,United Kingdom,1200
4,Social Science,Austria,18400,Social Science,Austria,1500
5,History,Poland,23000,History,Poland,3200
6,Politics,India,21600,Marketing,India,2000
7,Computer Graphics,United States,27000,Computer Graphics,United States,2500


Simple, right?? But wait,

How would I know, which data comes from which DataFrame ❓

That’s when the hierarchical indexing comes into the picture and pandas.concat() offers the best solution for it through option keys. You can use it as below,

In [22]:
pd.concat([df1, df2], axis = 1,
             keys=["df1_data", "df2_data"])

Unnamed: 0_level_0,df1_data,df1_data,df1_data,df2_data,df2_data,df2_data
Unnamed: 0_level_1,Course,Country,Fee_USD,Course,Country,Discount_USD
0,Maths,India,15500,Maths,India,1000
1,Physics,Germany,16700,Physics,Germany,2300
2,Applied Maths,Germany,11100,German language,Germany,1500
3,General Science,United Kingdom,18000,Information Technology,United Kingdom,1200
4,Social Science,Austria,18400,Social Science,Austria,1500
5,History,Poland,23000,History,Poland,3200
6,Politics,India,21600,Marketing,India,2000
7,Computer Graphics,United States,27000,Computer Graphics,United States,2500


Such labeling of data actually makes it easy to extract the data corresponding to a particular DataFrame. ✔️

🚩 Note: The sequence of the labels in keys must match with the sequence in which DataFrames are written in the first argument in pandas.concat()

That’s all!! 😎