<center><img src="https://github.com/DACSS-PreProcessing/Week_1_main/blob/main/pics/LogoSimple.png?raw=true" width="700"></center>


# Concatenating Data Frames in Python

Concatenating is an operation at the data frame level. It is an easy operation when all the data frames have the **same** column names, and in the same position  (vertical concatenation).

For this example, there is a webpage in **fragilestatesindex.org** where we can find several links to excel files. Let me get all the links:

In [None]:
import requests
from bs4 import BeautifulSoup
 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}
url = "https://fragilestatesindex.org/excel/"
data = requests.get(url,headers=headers).text
soup = BeautifulSoup(data, 'html.parser')

allLinks=[]
for table in soup.find_all('table'):
    for a in table.find_all('a'):
        allLinks.append(a['href'].strip())
allLinks=set(allLinks)

In [None]:
allLinks

Now, I will create a list of data frames:

In [None]:
import pandas as pd

dfs=[] # a list
for link in allLinks:
    dfs.append(pd.read_excel(link,storage_options=headers))

You can use the list to concatenate:

In [None]:
allDFs=pd.concat(objs=dfs, # DFs as a list
                 axis=0, # one DF on top of the other
                 ignore_index=True, #very important
                 copy=False)

allDFs.columns

Some basic cleaning in column names:

In [None]:
allDFs.columns=allDFs.columns.str.replace(r':\s|\s','_',regex=True)

allDFs

Pay attention to years:

In [None]:
allDFs.Year.value_counts()

Some formatting:

In [None]:
pd.to_datetime(allDFs.Year).dt.year.value_counts()

Then, 

In [None]:
allDFs.Year=pd.to_datetime(allDFs.Year).dt.year

allDFs

Check missingness:

In [None]:
allDFs.info()

We should drop the last column:

In [None]:
allDFs.drop(columns=['Change_from_Previous_Year'], inplace=True)

Let's save the DF:

In [None]:
allDFs.to_csv('allDFs.csv',index=False)