<center><img src="https://github.com/DACSS-PreProcessing/Week_1_main/blob/main/pics/LogoSimple.png?raw=true" width="700"></center>


# Concatenating Data Frames in Python

Concatenating is an operation at the data frame level. It is an easy operation when all the data frames have the **same** column names, and in the same position  (vertical concatenation).

For this example, there is a webpage in **fragilestatesindex.org** where we can find several links to excel files. Let me get all the links:

In [36]:
import requests
from bs4 import BeautifulSoup
 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}
url = "https://fragilestatesindex.org/excel/"
data = requests.get(url,headers=headers).text
soup = BeautifulSoup(data, 'html.parser')

allLinks=[]
for table in soup.find_all('table'):
    for a in table.find_all('a'):
        allLinks.append(a['href'].strip())
allLinks=set(allLinks)

In [37]:
allLinks

{'https://fragilestatesindex.org/wp-content/uploads/2018/04/fsi-2018.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/2019/04/fsi-2019.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/2020/05/fsi-2020.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/2021/05/fsi-2021.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/2022/07/fsi-2022-download.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/2023/06/FSI-2023-DOWNLOAD.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2006.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2007.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2008.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2009.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2010.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2011.xlsx',
 'https://fragilestatesindex.org/wp-content/uploads/data/fsi-2012.xlsx',
 'https://fragi

Now, I will create a list of data frames:

In [51]:
import pandas as pd

dfs=[]
for link in allLinks:
    dfs.append(pd.read_excel(link,storage_options=headers))

In [57]:
pd.concat(objs=dfs).info()

<class 'pandas.core.frame.DataFrame'>
Index: 3170 entries, 0 to 177
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country                           3170 non-null   object 
 1   Year                              3170 non-null   object 
 2   Rank                              3170 non-null   object 
 3   Total                             3170 non-null   float64
 4   C1: Security Apparatus            3170 non-null   float64
 5   C2: Factionalized Elites          3170 non-null   float64
 6   C3: Group Grievance               3170 non-null   float64
 7   E1: Economy                       3170 non-null   float64
 8   E2: Economic Inequality           3170 non-null   float64
 9   E3: Human Flight and Brain Drain  3170 non-null   float64
 10  P1: State Legitimacy              3170 non-null   float64
 11  P2: Public Services               3170 non-null   float64
 12  P3: Human Ri

In [58]:
allDFs=pd.concat(objs=dfs, # DFs as a list
                 axis=0, # one DF on top of the other
                 ignore_index=True, #very important
                 copy=False)

allDFs.columns

Index(['Country', 'Year', 'Rank', 'Total', 'C1: Security Apparatus',
       'C2: Factionalized Elites', 'C3: Group Grievance', 'E1: Economy',
       'E2: Economic Inequality', 'E3: Human Flight and Brain Drain',
       'P1: State Legitimacy', 'P2: Public Services', 'P3: Human Rights',
       'S1: Demographic Pressures', 'S2: Refugees and IDPs',
       'X1: External Intervention', 'Change from Previous Year'],
      dtype='object')

In [63]:
allDFs.columns=allDFs.columns.str.replace(r':\s|\s','_',regex=True)

allDFs

Unnamed: 0,Country,Year,Rank,Total,C1_Security_Apparatus,C2_Factionalized_Elites,C3_Group_Grievance,E1_Economy,E2_Economic_Inequality,E3_Human_Flight_and_Brain_Drain,P1_State_Legitimacy,P2_Public_Services,P3_Human_Rights,S1_Demographic_Pressures,S2_Refugees_and_IDPs,X1_External_Intervention,Change_from_Previous_Year
0,Somalia,2012-01-01 00:00:00,1st,114.9,10.0,9.8,9.6,9.7,8.1,8.6,9.9,9.8,9.9,9.8,10.0,9.8,
1,Congo Democratic Republic,2012-01-01 00:00:00,2nd,111.2,9.7,9.5,9.3,8.8,8.9,7.4,9.5,9.2,9.7,9.9,9.7,9.6,
2,Sudan,2012-01-01 00:00:00,3rd,109.4,9.7,9.9,10.0,7.3,8.8,8.3,9.5,8.5,9.4,8.4,9.9,9.5,
3,South Sudan,2012-01-01 00:00:00,n/r,108.4,9.7,10.0,10.0,7.3,8.8,6.4,9.1,9.5,9.2,8.4,9.9,10.0,
4,Chad,2012-01-01 00:00:00,4th,107.6,8.9,9.8,9.1,8.3,8.6,7.7,9.8,9.5,9.3,9.3,9.5,7.8,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3165,Australia,2019-01-01 00:00:00,174th,19.7,2.7,1.7,3.3,1.6,1.6,1.0,1.0,1.5,1.7,1.2,1.7,0.7,-1.1
3166,Denmark,2019-01-01 00:00:00,175th,19.5,1.3,1.4,4.3,1.6,1.2,1.9,0.9,0.9,1.7,1.6,2.0,0.7,-0.3
3167,Switzerland,2019-01-01 00:00:00,176th,18.7,1.1,1.0,3.3,1.9,1.8,1.7,0.7,1.0,1.4,1.4,2.7,0.7,-0.5
3168,Norway,2019-01-01 00:00:00,177th,18.0,2.1,1.1,3.3,1.9,1.0,1.3,0.6,0.8,0.9,1.2,2.8,1.0,-0.3


In [65]:
allDFs.Year.value_counts()

Year
2021                   179
2022-01-01 00:00:00    179
2023                   179
2012-01-01 00:00:00    178
2014-01-01 00:00:00    178
2020-01-01 00:00:00    178
2015-01-01 00:00:00    178
2019-01-01 00:00:00    178
2016-01-01 00:00:00    178
2017-01-01 00:00:00    178
2013-01-01 00:00:00    178
2018-01-01 00:00:00    178
2009-01-01 00:00:00    177
2008-01-01 00:00:00    177
2011-01-01 00:00:00    177
2010-01-01 00:00:00    177
2007-01-01 00:00:00    177
2006-01-01 00:00:00    146
Name: count, dtype: int64

In [71]:
pd.to_datetime(allDFs.Year).dt.year.value_counts()

Year
1970    358
2022    179
2012    178
2016    178
2014    178
2020    178
2015    178
2019    178
2017    178
2013    178
2018    178
2008    177
2007    177
2011    177
2010    177
2009    177
2006    146
Name: count, dtype: int64

In [72]:
allDFs.Year=pd.to_datetime(allDFs.Year).dt.year

allDFs

Unnamed: 0,Country,Year,Rank,Total,C1_Security_Apparatus,C2_Factionalized_Elites,C3_Group_Grievance,E1_Economy,E2_Economic_Inequality,E3_Human_Flight_and_Brain_Drain,P1_State_Legitimacy,P2_Public_Services,P3_Human_Rights,S1_Demographic_Pressures,S2_Refugees_and_IDPs,X1_External_Intervention,Change_from_Previous_Year
0,Somalia,2012,1st,114.9,10.0,9.8,9.6,9.7,8.1,8.6,9.9,9.8,9.9,9.8,10.0,9.8,
1,Congo Democratic Republic,2012,2nd,111.2,9.7,9.5,9.3,8.8,8.9,7.4,9.5,9.2,9.7,9.9,9.7,9.6,
2,Sudan,2012,3rd,109.4,9.7,9.9,10.0,7.3,8.8,8.3,9.5,8.5,9.4,8.4,9.9,9.5,
3,South Sudan,2012,n/r,108.4,9.7,10.0,10.0,7.3,8.8,6.4,9.1,9.5,9.2,8.4,9.9,10.0,
4,Chad,2012,4th,107.6,8.9,9.8,9.1,8.3,8.6,7.7,9.8,9.5,9.3,9.3,9.5,7.8,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3165,Australia,2019,174th,19.7,2.7,1.7,3.3,1.6,1.6,1.0,1.0,1.5,1.7,1.2,1.7,0.7,-1.1
3166,Denmark,2019,175th,19.5,1.3,1.4,4.3,1.6,1.2,1.9,0.9,0.9,1.7,1.6,2.0,0.7,-0.3
3167,Switzerland,2019,176th,18.7,1.1,1.0,3.3,1.9,1.8,1.7,0.7,1.0,1.4,1.4,2.7,0.7,-0.5
3168,Norway,2019,177th,18.0,2.1,1.1,3.3,1.9,1.0,1.3,0.6,0.8,0.9,1.2,2.8,1.0,-0.3


Let me check if the column names are the same:

In [52]:
for df in dfs:
    print(df.columns)

Index(['Country', 'Year', 'Rank', 'Total', 'C1: Security Apparatus',
       'C2: Factionalized Elites', 'C3: Group Grievance', 'E1: Economy',
       'E2: Economic Inequality', 'E3: Human Flight and Brain Drain',
       'P1: State Legitimacy', 'P2: Public Services', 'P3: Human Rights',
       'S1: Demographic Pressures', 'S2: Refugees and IDPs',
       'X1: External Intervention'],
      dtype='object')
Index(['Country', 'Year', 'Rank', 'Total', 'C1: Security Apparatus',
       'C2: Factionalized Elites', 'C3: Group Grievance', 'E1: Economy',
       'E2: Economic Inequality', 'E3: Human Flight and Brain Drain',
       'P1: State Legitimacy', 'P2: Public Services', 'P3: Human Rights',
       'S1: Demographic Pressures', 'S2: Refugees and IDPs',
       'X1: External Intervention'],
      dtype='object')
Index(['Country', 'Year', 'Rank', 'Total', 'C1: Security Apparatus',
       'C2: Factionalized Elites', 'C3: Group Grievance', 'E1: Economy',
       'E2: Economic Inequality', 'E3: Human F

In this situation, I need to work on the column names of the last one:

In [None]:
# keep in the rigth order
dfs[3]['Combatants']=None
dfs[3]=dfs[3][['War', 'Deathrange', 'Date', 'Combatants', 'Location', 'Notes']]

Let's verify:

In [None]:
# do this again:
for df in dfs:
    print(df.columns)

Now we can concatenate them, and count the amount of rows:

In [None]:
allWars=pd.concat(objs=dfs, # DFs as a list
                  axis=0, # one DF on top of the other
                  ignore_index=True, #very important
                  copy=False)
allWars.shape

You can save this now. 