# Merge DataFrames

When working with multiple datasets, you'll often have to join datasets to analyze data holistically. This template uses pandas' [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) function to join two DataFrames on one or more columns. These are skills covered in DataCamp's [Joining Data with pandas](https://app.datacamp.com/learn/courses/joining-data-with-pandas) course if you want to learn more!

In [1]:
import pandas as pd

df1 = pd.read_csv("data/df1.csv")
df1.head()

Unnamed: 0,uid,reg_date,device,gender,country,age
0,54030035.0,2017-06-29,and,M,USA,19
1,72574201.0,2018-03-05,iOS,F,TUR,22
2,64187558.0,2016-02-07,iOS,M,USA,16
3,92513925.0,2017-05-25,and,M,BRA,41
4,99231338.0,2017-03-26,iOS,M,FRA,59


In [2]:
df2 = pd.read_csv("data/df2.csv")
df2.head()

Unnamed: 0,date,uid,sku,price
0,2017-07-10,41195147,sku_three_499,499
1,2017-07-15,41195147,sku_three_499,499
2,2017-11-12,41195147,sku_four_599,599
3,2017-09-26,91591874,sku_two_299,299
4,2017-12-01,91591874,sku_four_599,599


## Choose a merging strategy 

There are five types of joins available in pandas: inner, outer, cross, left, or right. You'll have to decide on the join type depending on your desired outcome. If you need a refresher on which to pick, check out chapter two of DataCamp's [Joining Data with pandas](https://app.datacamp.com/learn/courses/joining-data-with-pandas) course.

### Joining on column(s) with the same name

The code below joins two DataFrames on columns sharing the same name in both DataFrames. 

In [3]:
join1 = pd.merge(
    left=df1,  # Specify DataFrame on the left to merge
    right=df2,  # Specify DataFrame on the right to merge
    how="left",  # Choose 'inner', 'outer', 'cross' 'left' or 'right'
    on=["uid"],  # List of column(s) to merge on (these must exist in both DataFrames)
    indicator=True,  # When true, adds “_merge” column with source of each row
)

print("Number of rows & columns:", join1.shape)
join1.head()

Number of rows & columns: (17684, 10)


Unnamed: 0,uid,reg_date,device,gender,country,age,date,sku,price,_merge
0,54030035.0,2017-06-29,and,M,USA,19,,,,left_only
1,72574201.0,2018-03-05,iOS,F,TUR,22,,,,left_only
2,64187558.0,2016-02-07,iOS,M,USA,16,,,,left_only
3,92513925.0,2017-05-25,and,M,BRA,41,2017-10-20,sku_three_499,499.0,both
4,92513925.0,2017-05-25,and,M,BRA,41,2017-05-29,sku_two_299,299.0,both


### Joining on column(s) with different names
The code below joins two DataFrames on columns that may not share the same column names in both DataFrames. 

In [4]:
join2 = pd.merge(
    left=df1,  # Specify DataFrame on the left to merge
    right=df2,  # Specify DataFrame on the right to merge
    how="inner",  # Choose 'inner', 'outer', 'cross' 'left' or 'right'
    left_on=["reg_date", "uid"],  # List of column(s) to merge on in the left DataFrame
    right_on=["date", "uid"],  # List of column(s) to merge on in the right DataFrame
    indicator=True,  # When true, adds “_merge” column with source of each row
)

print("Number of rows & columns:", join2.shape)
join2.head()

Number of rows & columns: (35, 10)


Unnamed: 0,uid,reg_date,device,gender,country,age,date,sku,price,_merge
0,61015529.0,2017-05-13,iOS,M,TUR,22,2017-05-13,sku_four_599,599,both
1,87948327.0,2016-06-11,iOS,F,TUR,15,2016-06-11,sku_one_199,199,both
2,69359991.0,2016-07-05,and,F,DEU,28,2016-07-05,sku_three_499,499,both
3,20320232.0,2018-03-16,and,F,USA,27,2018-03-16,sku_three_499,499,both
4,16501352.0,2017-04-21,and,F,TUR,15,2017-04-21,sku_three_499,499,both


For more information on arguments and output options, visit pandas' [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) documentation.

If you're struggling to understand the different join types, it may be helpful to experiment with different values for the `how` and `on` arguments of the `merge()` function. The `_merge` column enabled by `indicator=True` is also useful for interpreting the resulting DataFrame, specifically why a row is included in the output.