# Merging DataFrames

Pandas provides the `merge` function in situations where you need to combine dataframes along multiple columns or along columns other than the index.

`merge` extends concatenation by providing the ability to align rows using multiple columns.

In [1]:
import pandas as pd
import numpy as np
from glob import glob

small_pop = pd.read_csv('./data/pop_small.csv')
large_pop = pd.read_csv('./data/pop_large.csv')

print(small_pop.shape)
print(large_pop.shape)

(5, 2)
(15, 3)


In [2]:
print(small_pop.columns)
print(large_pop.columns)

Index(['Zipcode', ' 2010 Census Population'], dtype='object')
Index(['Zipcode', 'City', 'State'], dtype='object')


What we want is to merge these two dataframes into one, linking `City` to `Population` based on the `Zipcode` - merge the two dataframes by aligning them on the `zipcode` column using the `merge` function.

`merge` computes a merge based on all the columns that are common in the two dataframes, in this case the `Zipcode` column. By default `merge` executes an **inner join**. It takes the rows common two both and there corresponding columns from the 1st arg dataframe, and appends horrizontally the corresponding columns from the 2nd arg dataframe.

**Only rows from both dataframes and their corresponding columns are kept**

In [3]:
pd.merge(small_pop, large_pop)

Unnamed: 0,Zipcode,2010 Census Population,City,State
0,16855,282,MINERAL SPRINGS,PA
1,15681,5241,SALTSBURG,PA
2,18657,11985,TUNKHANNOCK,PA
3,17307,5899,BIGLERVILLE,PA
4,15635,220,HANNASTOWN,PA


In [4]:
pd.merge(large_pop, small_pop)

Unnamed: 0,Zipcode,City,State,2010 Census Population
0,17307,BIGLERVILLE,PA,5899
1,16855,MINERAL SPRINGS,PA,282
2,15635,HANNASTOWN,PA,220
3,15681,SALTSBURG,PA,5241
4,18657,TUNKHANNOCK,PA,11985


In [5]:
bronze = pd.read_csv('./data/Summer Olympic medals/Bronze.csv')
silver = pd.read_csv('./data/Summer Olympic medals/Silver.csv')
gold = pd.read_csv('./data/Summer Olympic medals/Gold.csv')

bronze.head(2)

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0


In [6]:
silver.head(2)

Unnamed: 0,NOC,Country,Total
0,USA,United States,1195.0
1,URS,Soviet Union,627.0


In [7]:
gold.head(2)

Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0


By default `merge` uses all columns to merge the dataframes. The rows in the merged dataframe consist of all rows where the values of the `NOC`, `Country` and `Total` columns are identical in both dataframes `bronze` and `silver`. If there are no matches, merge yields an empty dataframe.

In [8]:
pd.merge(bronze, silver) # rows common to both datasets

Unnamed: 0,NOC,Country,Total
0,IRL,Ireland,8.0
1,MAS,Malaysia,3.0
2,MDA,Moldova,3.0
3,LIB,Lebanon,2.0
4,CMR,Cameroon,1.0
5,DOM,Dominican Republic,1.0
6,SYR,Syria,1.0
7,KSA,Saudi Arabia,1.0
8,TJK,Tajikistan,1.0
9,ZAM,Zambia,1.0


To merge on a particular common, use the `on=<column_name>` argument. This means that matches are sought only in the `NOC` column. The remaining columns are appended to the right. The column labels are ammended to reflect the dataframe they originated from, `_x` for the 1st, `_y` for the 2nd.

In [9]:
pd.merge(bronze, silver, on='NOC').head()

Unnamed: 0,NOC,Country_x,Total_x,Country_y,Total_y
0,USA,United States,1052.0,United States,1195.0
1,URS,Soviet Union,584.0,Soviet Union,627.0
2,GBR,United Kingdom,505.0,United Kingdom,591.0
3,FRA,France,475.0,France,461.0
4,GER,Germany,454.0,Germany,350.0


The result is a dataframe with a duplicate `Country` column. We can: 

* merge on multiple columns, e.g. `NOC` and `Country` columns (prevents duplicates).

* use the `suffixes` attribute to rename the column labels.

In [10]:
pd.merge(bronze, silver, on=['NOC', 'Country'], suffixes=['_bronze', '_silver']).head()

Unnamed: 0,NOC,Country,Total_bronze,Total_silver
0,USA,United States,1052.0,1195.0
1,URS,Soviet Union,584.0,627.0
2,GBR,United Kingdom,505.0,591.0
3,FRA,France,475.0,461.0
4,GER,Germany,454.0,350.0


### Merging when column labels do not match

Where we have dataframes with different column labels but they each have similar values, e.g. 'City_names' in one table vs 'City' in a 2nd, we have to declare which column in each dataframe to merge on using the `left_on` and `right_on` attributes for the 1st (left) and 2nd (right) dataframe arguments passed to `merge`. Both columns are present in the resulting dataframe.

In [11]:
cars = pd.read_csv('./data/cars3.csv')
brics = pd.read_csv('./data/brics2.csv')

cars.head()

Unnamed: 0,cars_per_cap,countries,drives_right
0,809,United States,True
1,731,Australia,False
2,588,Japan,False
3,18,India,False
4,200,Russia,True


In [12]:
brics.head()

Unnamed: 0,country,capital,area,population
0,Brazil,Brasilia,8.516,200.4
1,Russia,Moscow,17.1,143.5
2,India,New Delhi,3.286,1252.0
3,China,Beijing,9.597,1357.0
4,South Africa,Pretoria,1.221,52.98


In [13]:
pd.merge(cars, brics, left_on='countries', right_on='country')

Unnamed: 0,cars_per_cap,countries,drives_right,country,capital,area,population
0,18,India,False,India,New Delhi,3.286,1252.0
1,200,Russia,True,Russia,Moscow,17.1,143.5
2,23,Brazil,False,Brazil,Brasilia,8.516,200.4
3,34,China,True,China,Beijing,9.597,1357.0


In [14]:
revenue = pd.read_csv('./data/revenue.csv')
managers = pd.read_csv('./data/managers.csv')

revenue

Unnamed: 0,city,branch_id,state,revenue
0,Austin,10,TX,100
1,Denver,20,CO,83
2,Springfield,30,IL,4
3,Mendocino,47,CA,200


In [15]:
managers

Unnamed: 0,branch,branch_id,state,manager
0,Austin,10,TX,Charlers
1,Denver,20,CO,Joel
2,Mendocino,47,CA,Brett
3,Springfield,31,MO,Sally


In [16]:
pd.merge(revenue, managers, left_on='city', right_on='branch')

Unnamed: 0,city,branch_id_x,state_x,revenue,branch,branch_id_y,state_y,manager
0,Austin,10,TX,100,Austin,10,TX,Charlers
1,Denver,20,CO,83,Denver,20,CO,Joel
2,Springfield,30,IL,4,Springfield,31,MO,Sally
3,Mendocino,47,CA,200,Mendocino,47,CA,Brett


In [17]:
pd.merge(revenue, managers, on=['branch_id', 'state'])

Unnamed: 0,city,branch_id,state,revenue,branch,manager
0,Austin,10,TX,100,Austin,Charlers
1,Denver,20,CO,83,Denver,Joel
2,Mendocino,47,CA,200,Mendocino,Brett


By default, `merge` uses an **inner join** when joing dataframes through the `how='inner'` attribute.

* `how='left'` keeps **all rows** from the left dataframe in the merged dataframe.

    * For rows in the left df with **matches in the right**, non-joining columns of the right df are appended to the left df.
    
    * For rows in the left df with **no matches in the right**, non-joining columns are filled with nulls.
    
* `how='right'` is the opposite of `left`.

We can employ left and right merges to preserve data and identify where data is missing.

**Left Join on the revenue and managers df**  

We can see that the 'Springfield, MO' is droped from the right df, while the 'Springfield, IL' row from the left df has `NaN` value for both `revenue` and `manager` since this row is not found in the right df.

In [18]:
pd.merge(
    revenue, 
    managers, 
    on=['branch_id', 'state'], # matching column labels
    how='left'
)

Unnamed: 0,city,branch_id,state,revenue,branch,manager
0,Austin,10,TX,100,Austin,Charlers
1,Denver,20,CO,83,Denver,Joel
2,Springfield,30,IL,4,,
3,Mendocino,47,CA,200,Mendocino,Brett


Merging on the right df, we can see that the 'Springfield, MO' has `NaN` values for both `city` and `revenue`, as that particular record is not found in the revenue table.

In [19]:
pd.merge(
    revenue, 
    managers, 
    on=['branch_id', 'state'], # matching column labels
    how='right'
)

Unnamed: 0,city,branch_id,state,revenue,branch,manager
0,Austin,10,TX,100.0,Austin,Charlers
1,Denver,20,CO,83.0,Denver,Joel
2,Mendocino,47,CA,200.0,Mendocino,Brett
3,,31,MO,,Springfield,Sally


Using `how=outer`, equivalent to an **outer join** we can preserve all rows from both left and right df.

In [20]:
pd.merge(
    revenue, 
    managers, 
    on=['branch_id', 'state'], # matching column labels
    how='outer'
)

Unnamed: 0,city,branch_id,state,revenue,branch,manager
0,Austin,10,TX,100.0,Austin,Charlers
1,Denver,20,CO,83.0,Denver,Joel
2,Springfield,30,IL,4.0,,
3,Mendocino,47,CA,200.0,Mendocino,Brett
4,,31,MO,,Springfield,Sally


Pandas also provides the `join` method which supports the same `how` attribute (with `inner`, `outer`, `left` and `right` options) which joins on the **index**.

In [21]:
sales = pd.read_csv('./data/sales5.csv')
sales

Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


We want to employ `left` and `right` merge to preserve data and identify where data is missing using the `sales`, `revenue` and `managers` dfs.

By merging `revenue` and `sales` with a right merge, we can identify the missing revenue values (we don't need to specify `left_on` or `right_on` because the columns to merge on have matching labels).

In [22]:
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])
revenue_and_sales

Unnamed: 0,city,branch_id,state,revenue,units
0,Austin,10.0,TX,100.0,2
1,Denver,20.0,CO,83.0,4
2,Springfield,30.0,IL,4.0,1
3,Mendocino,47.0,CA,200.0,1
4,Springfield,,MO,,5


By merging `sales` and `managers` with a left merge, we can identify the missing manager. Here, the columns to merge on have conflicting labels, so we must specify `left_on` and `right_on`.

In [23]:
sales_and_managers = pd.merge(
    sales, 
    managers, 
    how='left', 
    left_on=['city', 'state'], 
    right_on=['branch', 'state']
)
sales_and_managers

Unnamed: 0,city,state,units,branch,branch_id,manager
0,Mendocino,CA,1,Mendocino,47.0,Brett
1,Denver,CO,4,Denver,20.0,Joel
2,Austin,TX,2,Austin,10.0,Charlers
3,Springfield,MO,5,Springfield,31.0,Sally
4,Springfield,IL,1,,,


The two merged dfs contain enough information to construct a df with 5 rows with all known information correctly aligned and each branch listed only once. We'll merge these two dfs on all matching keys (which computes an inner join by default). We can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

In [24]:
# default is an inner join ('how='inner')
pd.merge(sales_and_managers, revenue_and_sales)

Unnamed: 0,city,state,units,branch,branch_id,manager,revenue
0,Mendocino,CA,1,Mendocino,47.0,Brett,200.0
1,Denver,CO,4,Denver,20.0,Joel,83.0
2,Austin,TX,2,Austin,10.0,Charlers,100.0


In [25]:
# merge using an outer join
pd.merge(sales_and_managers, revenue_and_sales, how='outer')

Unnamed: 0,city,state,units,branch,branch_id,manager,revenue
0,Mendocino,CA,1,Mendocino,47.0,Brett,200.0
1,Denver,CO,4,Denver,20.0,Joel,83.0
2,Austin,TX,2,Austin,10.0,Charlers,100.0
3,Springfield,MO,5,Springfield,31.0,Sally,
4,Springfield,IL,1,,,,
5,Springfield,IL,1,,30.0,,4.0
6,Springfield,MO,5,,,,


Merge `sales_and_manager` with `revenue_and_sales` only on `['city','state']` using an outer join. 

In [26]:
pd.merge(sales_and_managers, revenue_and_sales, how='outer', on=['city', 'state'])

Unnamed: 0,city,state,units_x,branch,branch_id_x,manager,branch_id_y,revenue,units_y
0,Mendocino,CA,1,Mendocino,47.0,Brett,47.0,200.0,1
1,Denver,CO,4,Denver,20.0,Joel,20.0,83.0,4
2,Austin,TX,2,Austin,10.0,Charlers,10.0,100.0,2
3,Springfield,MO,5,Springfield,31.0,Sally,,,5
4,Springfield,IL,1,,,,30.0,4.0,1


### Using merge_ordered

In [34]:
hardware = pd.read_csv('./data/Sales/feb-sales-Hardware.csv', parse_dates=True)
software = pd.read_csv('./data/Sales/feb-sales-Software.csv', parse_dates=True)

hardware

Unnamed: 0,Date,Company,Product,Units
0,2015-02-04 21:52:45,Acme Coporation,Hardware,14
1,2015-02-07 22:58:10,Acme Coporation,Hardware,1
2,2015-02-19 10:59:33,Mediacore,Hardware,16
3,2015-02-02 20:54:49,Mediacore,Hardware,9
4,2015-02-21 20:41:47,Hooli,Hardware,3


In [35]:
software

Unnamed: 0,Date,Company,Product,Units
0,2015-02-16 12:09:19,Hooli,Software,10
1,2015-02-03 14:14:18,Initech,Software,13
2,2015-02-02 08:33:01,Hooli,Software,3
3,2015-02-05 01:53:06,Acme Coporation,Software,19
4,2015-02-11 20:03:08,Initech,Software,7
5,2015-02-09 13:09:55,Mediacore,Software,7
6,2015-02-11 22:50:44,Hooli,Software,4
7,2015-02-04 15:36:29,Streeplex,Software,13
8,2015-02-21 05:01:26,Mediacore,Software,3


If we merge the `software` and `hardware` dfs, we get an empty df - inner join is the default merge technique and there are no overlapping rows.

In [36]:
print(pd.merge(hardware, software))

Empty DataFrame
Columns: [Date, Company, Product, Units]
Index: []


Carrying out an outer join results in a df with all the rows from both dfs. We can also sort by the values in the `Date` column.

In [42]:
pd.merge(hardware, software, how='outer').sort_values('Date').head()

Unnamed: 0,Date,Company,Product,Units
7,2015-02-02 08:33:01,Hooli,Software,3
3,2015-02-02 20:54:49,Mediacore,Hardware,9
6,2015-02-03 14:14:18,Initech,Software,13
12,2015-02-04 15:36:29,Streeplex,Software,13
0,2015-02-04 21:52:45,Acme Coporation,Hardware,14


The `merge_ordered` function does the same, carryout an outer join (by default) between two dfs and sorts the results.

In [44]:
pd.merge_ordered(hardware, software).head()

Unnamed: 0,Date,Company,Product,Units
0,2015-02-02 08:33:01,Hooli,Software,3
1,2015-02-02 20:54:49,Mediacore,Hardware,9
2,2015-02-03 14:14:18,Initech,Software,13
3,2015-02-04 15:36:29,Streeplex,Software,13
4,2015-02-04 21:52:45,Acme Coporation,Hardware,14


`merge_ordered` also accepts keywords such `on`, `suffixes` and `fill_method`.

In [45]:
pd.merge_ordered(
    hardware, 
    software,
    on=['Date', 'Company'],
    suffixes=['_hardware', '_software']
)

Unnamed: 0,Date,Company,Product_hardware,Units_hardware,Product_software,Units_software
0,2015-02-02 08:33:01,Hooli,,,Software,3.0
1,2015-02-02 20:54:49,Mediacore,Hardware,9.0,,
2,2015-02-03 14:14:18,Initech,,,Software,13.0
3,2015-02-04 15:36:29,Streeplex,,,Software,13.0
4,2015-02-04 21:52:45,Acme Coporation,Hardware,14.0,,
5,2015-02-05 01:53:06,Acme Coporation,,,Software,19.0
6,2015-02-07 22:58:10,Acme Coporation,Hardware,1.0,,
7,2015-02-09 13:09:55,Mediacore,,,Software,7.0
8,2015-02-11 20:03:08,Initech,,,Software,7.0
9,2015-02-11 22:50:44,Hooli,,,Software,4.0


In [46]:
gdp = pd.read_csv('./data/GDP/gdp-2013.csv', delimiter=' ', parse_dates=True)
stocks = pd.read_csv('./data/GDP/stocks-2013.csv', delimiter=' ', parse_dates=True)

gdp

Unnamed: 0,Date,GDP
0,2012-03-31,15973.9
1,2012-06-30,16121.9
2,2012-09-30,16227.9
3,2012-12-31,16297.3
4,2013-03-31,16475.4
5,2013-06-30,16541.4
6,2013-09-30,16749.3
7,2013-12-31,16999.9


In [47]:
stocks

Unnamed: 0,Date,AAPL,IBM,CSCO,MSFT
0,2013-01-31,497.822381,197.271905,20.699524,27.236667
1,2013-02-28,456.808953,200.735788,20.988947,27.704211
2,2013-03-31,441.840998,210.978001,21.335,28.141
3,2013-04-30,419.764998,204.733636,20.914545,29.870909
4,2013-05-31,446.45273,205.263639,22.386364,33.950909
5,2013-06-30,425.537999,200.85,24.3755,34.6325
6,2013-07-31,429.157272,194.354546,25.378636,33.650454
7,2013-08-31,484.843635,187.125,24.948636,32.485
8,2013-09-30,480.184499,188.767,24.08,32.5235
9,2013-10-31,504.744783,180.710002,22.847391,34.382174


When merging the `stocks` and `gdp` dfs on the `Date` column, we end up with a large number of `NaN` values since many of the dates do not overlap. We can use `fill_method=ffill` to fill some of these in. Obviously `ffill` can not fix entries at the start of a time series.

In [49]:
pd.merge_ordered(stocks, gdp, on=['Date'], fill_method='ffill')

Unnamed: 0,Date,AAPL,IBM,CSCO,MSFT,GDP
0,2012-03-31,,,,,15973.9
1,2012-06-30,,,,,16121.9
2,2012-09-30,,,,,16227.9
3,2012-12-31,,,,,16297.3
4,2013-01-31,497.822381,197.271905,20.699524,27.236667,16297.3
5,2013-02-28,456.808953,200.735788,20.988947,27.704211,16297.3
6,2013-03-31,441.840998,210.978001,21.335,28.141,16475.4
7,2013-04-30,419.764998,204.733636,20.914545,29.870909,16475.4
8,2013-05-31,446.45273,205.263639,22.386364,33.950909,16475.4
9,2013-06-30,425.537999,200.85,24.3755,34.6325,16541.4


In [59]:
austin = pd.read_csv('./data/austin_weather.csv', delimiter=' ', parse_dates=True)
houston = pd.read_csv('./data/houston_weather.csv', delimiter=' ', parse_dates=True)

austin

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-02-08,Cloudy
2,2016-01-17,Sunny


In [60]:
houston

Unnamed: 0,date,ratings
0,2016-01-04,Rainy
1,2016-01-01,Cloudy
2,2016-03-01,Sunny


In [61]:
# default outer join, sorted by date
pd.merge_ordered(austin, houston)

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-01-04,Rainy
2,2016-01-17,Sunny
3,2016-02-08,Cloudy
4,2016-03-01,Sunny


In [62]:
# outer join on 'date', adding suffixes so we can distiguish between rows
pd.merge_ordered(austin, houston, on=['date'], suffixes=['_aus', '_hus'])

Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,,Rainy
2,2016-01-17,Sunny,
3,2016-02-08,Cloudy,
4,2016-03-01,,Sunny


In [63]:
# use 'ffil' to replace the 'NaN' values
pd.merge_ordered(
    austin, 
    houston, 
    on=['date'], 
    suffixes=['_aus', '_hus'], 
    fill_method='ffill'
)

Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,Cloudy,Rainy
2,2016-01-17,Sunny,Rainy
3,2016-02-08,Cloudy,Rainy
4,2016-03-01,Cloudy,Sunny


Similar to `pd.merge_ordered()`, the `pd.merge_asof()` function will also merge values in order using the `on` column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

This function can be used to align disparate datetime frequencies without having to first resample.