# Joins

Joins are the way that we can join together two dataframes based on common keys. If you have never used joins before it may be worthwhile to use a more in depth explanation resource as this resource will just give a quick review for each.

The options for joins are:

## Left Join

The left join takes the left dataframe's index, and then matches any records on the right dataframe and merges the two dataframes. Any left index values that do not have a matching right index will be set to null for the new columns.

## Right Join

The right join takes the right dataframe's index, and then matches any records on the left dataframe and merges the two dataframes. Any right index values that do not have a matching left index will be set to null for the new columns.

## Outer 

The outer join takes the combination of the two indices and matches then merges the dataframes. If either a left index or a right index does not have a matching index then it will be set to null in the respective new columns.

## Inner

The inner join takes the intersection of the two indices for matching and merging. There will not be missing values in this case because it is the intersection.

The default behavior is to join based on whatever the index is in the two dataframes. As well, we call a join like so: df1.join(df2) where df1 is the left and df2 is the right dataframe. The default will be a left join but the how argument allows for changing the type of the join. Now step 1 is going to be creating our dummy data. There is going to be two columns in each which are different from each other, as well as one indice in each that matches and one indice in each that is unique to that dataframe.

In [1]:
import pandas as pd

#Create the data
df1 = pd.DataFrame([[95, 100],
                 [77, 34]], index=['A', 'C'],
                  columns=['Col 1', 'Col 2'])
df2 = pd.DataFrame([[10,20],
                 [20,30]], index=['A', 'B'], columns=['Col 3', 'Col 4'])
print(df1)
print()
print(df2)

   Col 1  Col 2
A     95    100
C     77     34

   Col 3  Col 4
A     10     20
B     20     30


## Left Join

The left join takes the left dataframe's index, and then matches any records on the right dataframe and merges the two dataframes. Any left index values that do not have a matching right index will be set to null for the new columns.

Notice how the joining below has null values for C, but values for A.

In [2]:
#Left join df1 and df2
df3 = df1.join(df2)
print(df3)

   Col 1  Col 2  Col 3  Col 4
A     95    100   10.0   20.0
C     77     34    NaN    NaN


The default is to left join, but using how='left' will get us the same results.

In [3]:
#Left join df1 and df2
df3 = df1.join(df2, how='left')
print(df3)

   Col 1  Col 2  Col 3  Col 4
A     95    100   10.0   20.0
C     77     34    NaN    NaN


## Right Join

The right join takes the right dataframe's index, and then matches any records on the left dataframe and merges the two dataframes. Any right index values that do not have a matching left index will be set to null for the new columns.

In this example we see that our index is A & B, and there is no values for B in columns 1 and columns 2.

In [4]:
#Right join df1 and df2
df3 = df1.join(df2, how='right')
print(df3)

   Col 1  Col 2  Col 3  Col 4
A   95.0  100.0     10     20
B    NaN    NaN     20     30


Also you might be quick to realize that a right join is the same as just having df2 on the left calling join and using a left join. The only difference will be the order of the columns but that is only a minor detail, you can easily re-arrange the column order after the join.

In [5]:
#Right join df1 and df2
df3 = df2.join(df1, how='left')
print(df3)

   Col 3  Col 4  Col 1  Col 2
A     10     20   95.0  100.0
B     20     30    NaN    NaN


## Outer 

The outer join takes the combination of the two indices and matches then merges the dataframes. If either a left index or a right index does not have a matching index then it will be set to null in the respective new columns.

In [6]:
#Outer join df1 and df2
df3 = df1.join(df2, how='outer')
print(df3)

   Col 1  Col 2  Col 3  Col 4
A   95.0  100.0   10.0   20.0
B    NaN    NaN   20.0   30.0
C   77.0   34.0    NaN    NaN


## Inner

The inner join takes the intersection of the two indices for matching and merging. There will not be missing values in this case because it is the intersection.

In [7]:
#Outer join df1 and df2
df3 = df1.join(df2, how='inner')
print(df3)

   Col 1  Col 2  Col 3  Col 4
A     95    100     10     20


### Joining with Duplicates

When there are duplicated indices, we will see that the dataframe repeats the rows to make sure there can be a match for each. The below example will illuminate this behavior more. In the example we have a dataframe to hold the costs of products and a dataframe for the sales of products. You can see that there are two records for product 1 because there were two different days where the product was sold. Notice what happens with left joining cost and sales.

In [8]:
#Create the data
cost = pd.DataFrame([["Product 1", 100],
                    ["Product 2", 50],
                    ["Product 3", 200]], columns=['Product', 'Cost'])
cost = cost.set_index('Product')
print(cost)

sales = pd.DataFrame([["Product 1", 1, 10],
                     ["Product 1", 2, 10],
                     ["Product 2", 1, 5],
                     ["Product 3", 2, 5]], columns=['Product', 'Day', 'Volume'])
sales = sales.set_index('Product')
print(sales)

           Cost
Product        
Product 1   100
Product 2    50
Product 3   200
           Day  Volume
Product               
Product 1    1      10
Product 1    2      10
Product 2    1       5
Product 3    2       5


In [9]:
#Left join the data
df3 = cost.join(sales)
print(df3)

           Cost  Day  Volume
Product                     
Product 1   100    1      10
Product 1   100    2      10
Product 2    50    1       5
Product 3   200    2       5


In this case, the size of the cost dataframe has expanded and the row was repeated to make up for the fact that there are two entries for product 1. If you do a right join, it will also be repeated in the same way:

In [10]:
#Right join the data
df3 = cost.join(sales, how='right')
print(df3)

           Cost  Day  Volume
Product                     
Product 1   100    1      10
Product 1   100    2      10
Product 2    50    1       5
Product 3   200    2       5


This also works the same when considering a case where both dataframes have duplicates. If we duplicate product 1 twice in the first dataframe with two random prices, we actually get 3 X 2 rows for product 1 because now all of a sudden we need to match each one to one and other!

In [11]:
#Create the data
cost = pd.DataFrame([["Product 1", 100],
                     ["Product 1", 50],
                     ["Product 1", 75],
                    ["Product 2", 50],
                    ["Product 3", 200]], columns=['Product', 'Cost'])
cost = cost.set_index('Product')
print(cost)

sales = pd.DataFrame([["Product 1", 1, 10],
                     ["Product 1", 2, 10],
                     ["Product 2", 1, 5],
                     ["Product 3", 2, 5]], columns=['Product', 'Day', 'Volume'])
sales = sales.set_index('Product')
print(sales)

           Cost
Product        
Product 1   100
Product 1    50
Product 1    75
Product 2    50
Product 3   200
           Day  Volume
Product               
Product 1    1      10
Product 1    2      10
Product 2    1       5
Product 3    2       5


In [12]:
#Left join the data
df3 = cost.join(sales)
print(df3)

           Cost  Day  Volume
Product                     
Product 1   100    1      10
Product 1   100    2      10
Product 1    50    1      10
Product 1    50    2      10
Product 1    75    1      10
Product 1    75    2      10
Product 2    50    1       5
Product 3   200    2       5


This is why we often want to make the constraint that the joining index is unique! Keep this in mind when doing your joins that you can end up adding rows if it is not unique.

### Multi-Index Joins

Imagine a hypothetical scenario where we have 2 products, measured over 2 days. We have prices and sales once again, but what if the price had changed between the two days? In this case, maybe the business owner charges more on a given day. How can we join these 4 records together? The answer lies in the multi-index. First, create the data below.

In [13]:
#Create the data
cost = pd.DataFrame([["Product 1", 100, 1],
                     ["Product 1", 110, 2],
                     ["Product 2", 55, 1],
                     ["Product 2", 60, 2],], columns=['Product', 'Cost', 'Day'])
print(cost)

sales = pd.DataFrame([["Product 1", 1, 10],
                     ["Product 1", 2, 10],
                     ["Product 2", 1, 15],
                     ["Product 2", 2, 20]], columns=['Product', 'Day', 'Volume'])
print(sales)

     Product  Cost  Day
0  Product 1   100    1
1  Product 1   110    2
2  Product 2    55    1
3  Product 2    60    2
     Product  Day  Volume
0  Product 1    1      10
1  Product 1    2      10
2  Product 2    1      15
3  Product 2    2      20


To set a multi-index, we can pass in a list of indices to use for our data. Let's set both to have a multi-index.

In [14]:
#Set the multi-index
cost = cost.set_index(['Product', 'Day'])
print(cost)
print()

sales = sales.set_index(['Product', 'Day'])
print(sales)

               Cost
Product   Day      
Product 1 1     100
          2     110
Product 2 1      55
          2      60

               Volume
Product   Day        
Product 1 1        10
          2        10
Product 2 1        15
          2        20


Now if we just do a simple left join on this, we are going to have no additional rows created, because each of the 4 rows is a 1 to 1 match as the index is unique.

In [15]:
#Join the data
df3 = cost.join(sales)
print(df3)

               Cost  Volume
Product   Day              
Product 1 1     100      10
          2     110      10
Product 2 1      55      15
          2      60      20


### Column Collisions

The last thing to cover with regards to joins is collisions in columns. Often times you may find that you are joining two dataframes that have similar columns but they are named the same. For our example, imagine there are two dataframes that have sales as the column, but one actually is stores in a physical store and one is a store online. We can use a suffix to handle the matching column names. Let's start with an example.

In [16]:
#Create the data
instore_sales = pd.DataFrame([[100],[200], [300]], index=[1, 2, 3], columns=['Sales'])
print(instore_sales)
print()

online_sales = pd.DataFrame([[150],[100], [200]], index=[1, 2, 3], columns=['Sales'])
print(online_sales)

   Sales
1    100
2    200
3    300

   Sales
1    150
2    100
3    200


Notice what happens when you try to join these.

In [17]:
#Join the sales data
sales = instore_sales.join(online_sales)

ValueError: columns overlap but no suffix specified: Index(['Sales'], dtype='object')

The way to work around this is to use lsuffix and rsuffix which represent what suffix to add to each column in the left and right.

In [18]:
#Join after using a suffix
sales = instore_sales.join(online_sales, lsuffix=' Instore', rsuffix=' Online')
print(sales)

   Sales Instore  Sales Online
1            100           150
2            200           100
3            300           200
