<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Merging Data Frames in Python

The Merging stage is an operation at the data frame level (not a cell operation). 

The merging process combines **TWO** data frames, if and only if, they each have a common column whose cell values represent the same, and are written exactly the same. Unmatched values will not be part of the output. If you have messy data, you need to clean at least those **key** columns for the match to work.

Let's see some data:

In [None]:
import pandas as pd
import os


allFree=pd.read_pickle("https://github.com/PythonVersusR/OperationsFormatting/raw/main/DataFiles/allFree.pkl")  
allFree

Now, let's bring this other one:

In [None]:
%%html

<iframe width="700" height="300" src="https://www.cia.gov/the-world-factbook/field/military-expenditures/country-comparison" allowfullscreen></iframe>


In [None]:
linkCIA="https://www.cia.gov/the-world-factbook/field/military-expenditures/country-comparison"
mil=pd.read_html(linkCIA,flavor='bs4')
# how many
len(mil)

In [None]:
# the only one
mil[0]

Let's check format:

In [None]:
mil[0].info()

Let's keep the columns we need from the data frame:

In [None]:
mil[0]=mil[0].iloc[:,[1,2]]
mil[0].head()

Let's create a new data frame, while renaming the second column with a simpler name:

In [None]:
mili=mil[0].rename(columns={"% of GDP": "mili_pctGDP"})
mili.head()

## Deciding keys

Obviously, _Country_:


In [None]:
mili.columns, allFree.columns

In [None]:
#explore
allFree.Country.sort_values(),mili.Country.sort_values()

We should _normalize_ the **key** columns:

In [None]:
mili['Country']=mili.Country.str.upper()

## Basic merge

The basic merge works like this:

In [None]:
# Which country has more rows?
mili.shape[0],allFree.shape[0]

When row counts differ, you can expect the merge will give at most the lowest amount of rows between those two. Let's see:

In [None]:
mili.merge(allFree,left_on='Country',right_on='Country').shape[0]

## Fuzzy matching

Currently, that is the count of rows in the merge. Let's explore the unmatched keys:

In [None]:
InMiliNotInFree=list(set(mili.Country)-set(allFree.Country))
sorted(InMiliNotInFree)

In [None]:
InFreeUnmatched=list(set(allFree.Country)-set(mili.Country))
sorted(InFreeUnmatched)

Let's try to match strings that are NOT equally written. You need to previously install:
* thefuzz (use _pip install thefuzz_)
* python-Levenshtein  (use _pip python-Levenshtein_)

In [None]:
from thefuzz import process

[(country, process.extractOne(country,InMiliNotInFree )) for country in sorted(InFreeUnmatched)]

This exploration suggest we make changes manually first:

In [None]:
#allFree=allFree[allFree.Country != "NORTH KOREA"] # bye no
manualFree={'REPUBLIC OF THE CONGO':'CONGO, REPUBLIC OF THE','CZECH REPUBLIC':'CZECHIA'}
allFree.Country.replace(manualFree,inplace=True)

#
InMiliNotInFree=list(set(mili.Country)-set(allFree.Country))
InFreeUnmatched=list(set(allFree.Country)-set(mili.Country))

# 
[(country, process.extractOne(country,InMiliNotInFree )) for country in sorted(InFreeUnmatched)]

Notice:

In [None]:
[(country, process.extractOne(country,InMiliNotInFree )) for country in sorted(InFreeUnmatched) 
 if process.extractOne(country,InMiliNotInFree)[1]>=90]

In [None]:
# then:
fuzzyFree={country: process.extractOne(country,InMiliNotInFree )[0] for country in sorted(InFreeUnmatched) 
 if process.extractOne(country,InMiliNotInFree)[1]>=90}
fuzzyFree

Apparently, that was all:

In [None]:
allFree.Country.replace(fuzzyFree,inplace=True)

#
InMiliNotInFree=list(set(mili.Country)-set(allFree.Country))
InFreeUnmatched=list(set(allFree.Country)-set(mili.Country))

# 
[(country, process.extractOne(country,InMiliNotInFree )) for country in sorted(InFreeUnmatched)]

You can also try:

In [None]:
#opposite search
[(country, process.extractOne(country, InFreeUnmatched)) for country in sorted(InMiliNotInFree)]

We reached our best situation, then:

In [None]:
# in case you have different names in the matching columns:
freemili=allFree.merge(mili,left_on='Country', right_on='Country')
freemili

We can save this for R and Python :

In [None]:
# for Python

import os
freemili.to_pickle(os.path.join("FilesToMerge","FreeAndMili.pkl"))

In [None]:
# for R
import os

os.environ['R_HOME'] = '/Library/Frameworks/R.framework/Resources'

from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(freemili,file=os.path.join('FilesToMerge','FreeAndMili.RDS'))