<center><img src="https://github.com/DACSS-PreProcessing/Week_1_main/blob/main/pics/LogoSimple.png?raw=true" width="700"></center>


# Concatenating Data Frames in Python

Concatenating is an operation at the data frame level. It is an easy operation when all the data frames have the **same** column names, and in the same position  (vertical concatenation).

For this example, there is a webpage in **fragilestatesindex.org** where we can find several links to excel files. Let me get all the links:

In [None]:
import requests
from bs4 import BeautifulSoup
 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}
url = "https://fragilestatesindex.org/excel/"
data = requests.get(url,headers=headers).text
soup = BeautifulSoup(data, 'html.parser')

allLinks=[]
for table in soup.find_all('table'):
    for a in table.find_all('a'):
        allLinks.append(a['href'].strip())
allLinks=set(allLinks)

In [None]:
allLinks

Now, I will create a list of data frames:

In [None]:
import pandas as pd

dfs=[] # a list
for link in allLinks:
    dfs.append(pd.read_excel(link,storage_options=headers))

We are not merging, but we should avoid duplicates after concatenating.

What columns do we have?

In [None]:
# saving column names
allColumnNames=[]
for df in dfs:
    allColumnNames.append(set(df.columns))# list of sets!

In [None]:
# details of common columns
commonColumns=set.intersection(*allColumnNames) # expanding list of sets (*)
commonColumns

In [None]:
# all minus the common
set.union(*allColumnNames)-commonColumns

All countries have the Country column. Let's see how they look:

In [None]:
allKeys=[] # list for contry names
for df in dfs:
    allKeys.append(set(df.Country))# list of sets!
#
commonKeys=set.intersection(*allKeys) 

# any weird pattern ?
set.union(*allKeys)-commonKeys

We see that there are two simple problems affecting duplicates: the trailing spaces, and the accent symbol. Let's deal with that:

In [None]:
from unidecode import unidecode

for i in range(len(dfs)):
    dfs[i]['Country']=dfs[i].Country.str.strip().apply(unidecode)

Let's see what we have:

In [None]:
allKeys=[]
for df in dfs:
    allKeys.append(set(df.Country))# list of sets!

commonKeys=set.intersection(*allKeys)
set.union(*allKeys)-commonKeys

The problem is more complicated now. There are countries that have been written in a different way. If we continue, we will create innecessary missing values if reshaped. Let's deal with this.

In [None]:
missfits=list(set.union(*allKeys)-commonKeys)

from thefuzz import process as fz

[(c,fz.extract(c,missfits,limit=2)) for c in sorted(missfits)]

Only the second best is useful. Then:

In [None]:
[(c,fz.extract(c,missfits,limit=2)[1]) for c in sorted(missfits) if (c,fz.extract(c,missfits,limit=2)[1])[1][1]>=75]

We should prepare a dictionary of changes, some will be input manually:

In [None]:
theFits=[(c,fz.extract(c,missfits,limit=2)[1]) for c in sorted(missfits) if (c,fz.extract(c,missfits,limit=2)[1])[1][1]>=75]
allChanges={fit[0]:fit[1][0] for fit in theFits[0:-1:2]}

allChanges.update({'Kyrgyzstan':'Kyrgyz Republic'})
allChanges.update({'Swaziland':'Eswatini'})
allChanges

Now, we make the changes in every data frame:

In [None]:
for i in range(len(dfs)):
    dfs[i].replace({'Country':allChanges}, inplace=True)

Now, you can use the **dfs** list to concatenate:

In [None]:
allDFs=pd.concat(objs=dfs, # DFs as a list
                 axis=0, # one DF on top of the other
                 ignore_index=True, #very important
                 copy=False)

allDFs.columns

Some basic cleaning in column names:

In [None]:
allDFs.columns=allDFs.columns.str.replace(r':\s|\s','_',regex=True)

allDFs

In [None]:
allDFs.info()

We should drop the last column:

In [None]:
allDFs.drop(columns=['Change_from_Previous_Year'], inplace=True)

Pay attention to years:

In [None]:
allDFs.Year.value_counts()

These years need formatting:

In [None]:
allDFs['Year']=[y if type(y)==int else y.year for y in allDFs.Year  ]

In [None]:
allDFs.Year.value_counts()

Let's save the DF:

In [None]:
allDFs.to_csv('allDFs.csv',index=False)