<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Operations on Data Frames


Let me get the data on dengue from [Peru](https://www.datosabiertos.gob.pe/dataset/vigilancia-epidemiol%C3%B3gica-de-dengue):

In [1]:
import pandas as pd
linkData="https://github.com/SocialAnalytics-StrategicIntelligence/TableOperations/raw/main/dataFiles/dengue_ok.pkl"
dengue = pd.read_pickle(linkData)
dengue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398943 entries, 0 to 398942
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   departamento  398943 non-null  object        
 1   provincia     398943 non-null  object        
 2   distrito      398943 non-null  object        
 3   ano           398943 non-null  int64         
 4   semana        398943 non-null  int64         
 5   sexo          398943 non-null  object        
 6   edad          398943 non-null  int64         
 7   enfermedad    398943 non-null  category      
 8   year          398931 non-null  datetime64[ns]
dtypes: category(1), datetime64[ns](1), int64(3), object(4)
memory usage: 24.7+ MB


In [2]:
# some exploration
dengue.describe().apply(lambda s: s.apply('{0:.5f}'.format))

Unnamed: 0,ano,semana,edad,year
count,398943.0,398943.0,398943.0,398931.00000
mean,2015.0617,22.61685,29.97476,.5f
min,2000.0,1.0,0.0,.5f
25%,2011.0,11.0,15.0,.5f
50%,2016.0,19.0,27.0,.5f
75%,2020.0,34.0,42.0,.5f
max,2022.0,53.0,106.0,.5f
std,6.14862,14.89333,18.5326,


Each row is a person:

In [3]:
dengue.head()

Unnamed: 0,departamento,provincia,distrito,ano,semana,sexo,edad,enfermedad,year
0,HUANUCO,LEONCIO PRADO,LUYANDO,2000,47,M,9,SIN_SEÑALES,2000-01-01
1,HUANUCO,LEONCIO PRADO,LUYANDO,2000,40,F,18,SIN_SEÑALES,2000-01-01
2,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,48,F,32,SIN_SEÑALES,2000-01-01
3,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,37,F,40,SIN_SEÑALES,2000-01-01
4,HUANUCO,LEONCIO PRADO,MARIANO DAMASO BERAUN,2000,42,M,16,SIN_SEÑALES,2000-01-01


If we wanted to count people, creating a column of ones helps:

In [4]:
dengue=dengue.assign(case=1)
dengue.head()

Unnamed: 0,departamento,provincia,distrito,ano,semana,sexo,edad,enfermedad,year,case
0,HUANUCO,LEONCIO PRADO,LUYANDO,2000,47,M,9,SIN_SEÑALES,2000-01-01,1
1,HUANUCO,LEONCIO PRADO,LUYANDO,2000,40,F,18,SIN_SEÑALES,2000-01-01,1
2,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,48,F,32,SIN_SEÑALES,2000-01-01,1
3,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,37,F,40,SIN_SEÑALES,2000-01-01,1
4,HUANUCO,LEONCIO PRADO,MARIANO DAMASO BERAUN,2000,42,M,16,SIN_SEÑALES,2000-01-01,1


Let's start creating _data from these data_!

# Dengue by Year

## Aggregation

Having people, we need to count them by some grouping variable, in this case year (_ano_) and dengue status.


In [5]:
indexList=['ano','enfermedad']
aggregator={'edad': ['mean','median'], 'case':['sum']}
ByYear_stats=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYear_stats.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,edad,edad,case
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,sum
ano,enfermedad,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2000,SIN_SEÑALES,29.508788,27.0,4324
2001,SIN_SEÑALES,30.634282,28.0,15851
2001,GRAVE,31.572614,28.0,241
2002,SIN_SEÑALES,26.960178,24.0,6278
2002,ALARMA,12.0,12.0,1
2002,GRAVE,21.928571,19.0,14
2003,SIN_SEÑALES,28.947719,27.0,2850
2003,GRAVE,38.0,30.0,15
2004,SIN_SEÑALES,28.863269,26.0,7928
2004,GRAVE,23.794118,19.0,34


In [6]:
# notice hierarchy: multindex
ByYear_stats.columns

MultiIndex([('edad',   'mean'),
            ('edad', 'median'),
            ('case',    'sum')],
           )

For easier manipualtion outside Python, we could flatten the index hierarchy:

In [7]:
# ok?
["_".join(name) for name in ByYear_stats.columns]

['edad_mean', 'edad_median', 'case_sum']

In [8]:
# changing
ByYear_stats.columns=["_".join(name) for name in ByYear_stats.columns]
ByYear_stats.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,edad_mean,edad_median,case_sum
ano,enfermedad,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000,SIN_SEÑALES,29.508788,27.0,4324
2001,SIN_SEÑALES,30.634282,28.0,15851
2001,GRAVE,31.572614,28.0,241
2002,SIN_SEÑALES,26.960178,24.0,6278
2002,ALARMA,12.0,12.0,1
2002,GRAVE,21.928571,19.0,14
2003,SIN_SEÑALES,28.947719,27.0,2850
2003,GRAVE,38.0,30.0,15
2004,SIN_SEÑALES,28.863269,26.0,7928
2004,GRAVE,23.794118,19.0,34


In [9]:
# final look:
ByYear_stats.reset_index(drop=False,inplace=True)
ByYear_stats.head(20)

Unnamed: 0,ano,enfermedad,edad_mean,edad_median,case_sum
0,2000,SIN_SEÑALES,29.508788,27.0,4324
1,2001,SIN_SEÑALES,30.634282,28.0,15851
2,2001,GRAVE,31.572614,28.0,241
3,2002,SIN_SEÑALES,26.960178,24.0,6278
4,2002,ALARMA,12.0,12.0,1
5,2002,GRAVE,21.928571,19.0,14
6,2003,SIN_SEÑALES,28.947719,27.0,2850
7,2003,GRAVE,38.0,30.0,15
8,2004,SIN_SEÑALES,28.863269,26.0,7928
9,2004,GRAVE,23.794118,19.0,34


Notice a particular data type:

In [10]:
ByYear_stats.enfermedad.dtype

CategoricalDtype(categories=['SIN_SEÑALES', 'ALARMA', 'GRAVE'], ordered=True, categories_dtype=object)

Saving to CSV will erase that _dtype_ attribute. Then, use pickle file format: 

In [11]:
ByYear_stats.to_pickle('dataFiles/ByYear_stats.pkl') # this can be read in R.

## Reshaping

Notice the variables are in three columns: **edad_mean** /	**edad_median** / 	**case_sum**. We could reshape those columns to a long format: 

In [12]:
theVarsAsIndex=['ano','enfermedad']

# stacking  and resetting index
ByYear_LongStats=ByYear_stats.set_index(theVarsAsIndex).stack().reset_index()

#result
ByYear_LongStats

Unnamed: 0,ano,enfermedad,level_2,0
0,2000,SIN_SEÑALES,edad_mean,29.508788
1,2000,SIN_SEÑALES,edad_median,27.000000
2,2000,SIN_SEÑALES,case_sum,4324.000000
3,2001,SIN_SEÑALES,edad_mean,30.634282
4,2001,SIN_SEÑALES,edad_median,28.000000
...,...,...,...,...
172,2022,ALARMA,edad_median,25.000000
173,2022,ALARMA,case_sum,7370.000000
174,2022,GRAVE,edad_mean,35.146226
175,2022,GRAVE,edad_median,32.000000


In [13]:
# just renaming
ByYear_LongStats.rename(columns={'level_2':'statsName',0:'statsValue'},inplace=True)
ByYear_LongStats

Unnamed: 0,ano,enfermedad,statsName,statsValue
0,2000,SIN_SEÑALES,edad_mean,29.508788
1,2000,SIN_SEÑALES,edad_median,27.000000
2,2000,SIN_SEÑALES,case_sum,4324.000000
3,2001,SIN_SEÑALES,edad_mean,30.634282
4,2001,SIN_SEÑALES,edad_median,28.000000
...,...,...,...,...
172,2022,ALARMA,edad_median,25.000000
173,2022,ALARMA,case_sum,7370.000000
174,2022,GRAVE,edad_mean,35.146226
175,2022,GRAVE,edad_median,32.000000


In [14]:
# still ordinal?
ByYear_LongStats.enfermedad.dtype

CategoricalDtype(categories=['SIN_SEÑALES', 'ALARMA', 'GRAVE'], ordered=True, categories_dtype=object)

In [15]:
ByYear_LongStats.to_pickle('dataFiles/ByYear_LongStats.pkl')

# Dengue by Location (Region vs Province)

## Aggregating

We can redo the previous process, adding _departamento_ and _province_: 

In [16]:
indexList=['ano','departamento','provincia','enfermedad']
aggregator={'case':['sum']}
ByYearPlace=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,case
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum
ano,departamento,provincia,enfermedad,Unnamed: 4_level_2
2000,AMAZONAS,BAGUA,SIN_SEÑALES,215
2000,AMAZONAS,UTCUBAMBA,SIN_SEÑALES,58
2000,CAJAMARCA,CUTERVO,SIN_SEÑALES,2
2000,CAJAMARCA,JAEN,SIN_SEÑALES,16
2000,HUANUCO,LEONCIO PRADO,SIN_SEÑALES,29
...,...,...,...,...
2022,UCAYALI,PADRE ABAD,SIN_SEÑALES,412
2022,UCAYALI,PADRE ABAD,ALARMA,87
2022,UCAYALI,PADRE ABAD,GRAVE,2
2022,UCAYALI,PURUS,SIN_SEÑALES,1


Before flattening the output data frame in long format, you could create a wide shape:

## Long to wide

In [17]:
#simply
ByYearPlace.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,case,case,case
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,sum,sum
Unnamed: 0_level_2,Unnamed: 1_level_2,enfermedad,SIN_SEÑALES,ALARMA,GRAVE
ano,departamento,provincia,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
2000,AMAZONAS,BAGUA,215.0,,
2000,AMAZONAS,UTCUBAMBA,58.0,,
2000,CAJAMARCA,CUTERVO,2.0,,
2000,CAJAMARCA,JAEN,16.0,,
2000,HUANUCO,LEONCIO PRADO,29.0,,
...,...,...,...,...,...
2022,TUMBES,ZARUMILLA,89.0,5.0,
2022,UCAYALI,ATALAYA,542.0,92.0,2.0
2022,UCAYALI,CORONEL PORTILLO,2680.0,499.0,23.0
2022,UCAYALI,PADRE ABAD,412.0,87.0,2.0


In [18]:
# a more familiar look
ByYearPlace_wide=ByYearPlace.unstack().reset_index()
ByYearPlace_wide

Unnamed: 0_level_0,ano,departamento,provincia,case,case,case
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum,sum,sum
enfermedad,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,SIN_SEÑALES,ALARMA,GRAVE
0,2000,AMAZONAS,BAGUA,215.0,,
1,2000,AMAZONAS,UTCUBAMBA,58.0,,
2,2000,CAJAMARCA,CUTERVO,2.0,,
3,2000,CAJAMARCA,JAEN,16.0,,
4,2000,HUANUCO,LEONCIO PRADO,29.0,,
...,...,...,...,...,...,...
1305,2022,TUMBES,ZARUMILLA,89.0,5.0,
1306,2022,UCAYALI,ATALAYA,542.0,92.0,2.0
1307,2022,UCAYALI,CORONEL PORTILLO,2680.0,499.0,23.0
1308,2022,UCAYALI,PADRE ABAD,412.0,87.0,2.0


In [19]:
# zero instead of missing
ByYearPlace_wide.fillna(0,inplace=True)
ByYearPlace_wide

Unnamed: 0_level_0,ano,departamento,provincia,case,case,case
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum,sum,sum
enfermedad,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,SIN_SEÑALES,ALARMA,GRAVE
0,2000,AMAZONAS,BAGUA,215.0,0.0,0.0
1,2000,AMAZONAS,UTCUBAMBA,58.0,0.0,0.0
2,2000,CAJAMARCA,CUTERVO,2.0,0.0,0.0
3,2000,CAJAMARCA,JAEN,16.0,0.0,0.0
4,2000,HUANUCO,LEONCIO PRADO,29.0,0.0,0.0
...,...,...,...,...,...,...
1305,2022,TUMBES,ZARUMILLA,89.0,5.0,0.0
1306,2022,UCAYALI,ATALAYA,542.0,92.0,2.0
1307,2022,UCAYALI,CORONEL PORTILLO,2680.0,499.0,23.0
1308,2022,UCAYALI,PADRE ABAD,412.0,87.0,2.0


In [20]:
# you expected
ByYearPlace_wide.columns

MultiIndex([(         'ano',    '',            ''),
            ('departamento',    '',            ''),
            (   'provincia',    '',            ''),
            (        'case', 'sum', 'SIN_SEÑALES'),
            (        'case', 'sum',      'ALARMA'),
            (        'case', 'sum',       'GRAVE')],
           names=[None, None, 'enfermedad'])

In [21]:
#prepare
["_".join(names) if names[1]!='' else names[0] for names in ByYearPlace_wide.columns]

['ano',
 'departamento',
 'provincia',
 'case_sum_SIN_SEÑALES',
 'case_sum_ALARMA',
 'case_sum_GRAVE']

In [22]:
# change
ByYearPlace_wide.columns=["_".join(names) if names[1]!='' else names[0] for names in ByYearPlace_wide.columns]
ByYearPlace_wide

Unnamed: 0,ano,departamento,provincia,case_sum_SIN_SEÑALES,case_sum_ALARMA,case_sum_GRAVE
0,2000,AMAZONAS,BAGUA,215.0,0.0,0.0
1,2000,AMAZONAS,UTCUBAMBA,58.0,0.0,0.0
2,2000,CAJAMARCA,CUTERVO,2.0,0.0,0.0
3,2000,CAJAMARCA,JAEN,16.0,0.0,0.0
4,2000,HUANUCO,LEONCIO PRADO,29.0,0.0,0.0
...,...,...,...,...,...,...
1305,2022,TUMBES,ZARUMILLA,89.0,5.0,0.0
1306,2022,UCAYALI,ATALAYA,542.0,92.0,2.0
1307,2022,UCAYALI,CORONEL PORTILLO,2680.0,499.0,23.0
1308,2022,UCAYALI,PADRE ABAD,412.0,87.0,2.0


What about finding the _provincia_ most affected in a _departamento_?

In [23]:
where = ByYearPlace_wide.groupby(['ano','departamento'])['case_sum_ALARMA'].idxmax()
worst_prov_year = ByYearPlace_wide.loc[where].reset_index(drop=True)
worst_prov_year

Unnamed: 0,ano,departamento,provincia,case_sum_SIN_SEÑALES,case_sum_ALARMA,case_sum_GRAVE
0,2000,AMAZONAS,BAGUA,215.0,0.0,0.0
1,2000,CAJAMARCA,CUTERVO,2.0,0.0,0.0
2,2000,HUANUCO,LEONCIO PRADO,29.0,0.0,0.0
3,2000,JUNIN,CHANCHAMAYO,4.0,0.0,0.0
4,2000,LA LIBERTAD,TRUJILLO,894.0,0.0,0.0
...,...,...,...,...,...,...
366,2022,PIURA,PIURA,3471.0,667.0,27.0
367,2022,PUNO,CARABAYA,25.0,0.0,0.0
368,2022,SAN MARTIN,SAN MARTIN,770.0,350.0,6.0
369,2022,TUMBES,TUMBES,515.0,28.0,0.0


In [24]:
# worst provinces
len(worst_prov_year.provincia.value_counts())

59

In [25]:
# worst >0
len(worst_prov_year[worst_prov_year.case_sum_ALARMA>0].provincia.value_counts())

43

## Filtering

Lets' filter some rows based on what we just computed:

In [26]:
worst_ProvYear_alarma=worst_prov_year[worst_prov_year.case_sum_ALARMA>0].loc[:,['departamento','provincia']]
worst_ProvYear_alarma.reset_index(drop=True,inplace=True)
worst_ProvYear_alarma

Unnamed: 0,departamento,provincia
0,LORETO,MAYNAS
1,JUNIN,SATIPO
2,LORETO,MAYNAS
3,MADRE DE DIOS,TAMBOPATA
4,PIURA,PIURA
...,...,...
198,PASCO,OXAPAMPA
199,PIURA,PIURA
200,SAN MARTIN,SAN MARTIN
201,TUMBES,TUMBES


In [27]:
# adding a column of ones
worst_ProvYear_alarma['case']=1
worst_ProvYear_alarma

Unnamed: 0,departamento,provincia,case
0,LORETO,MAYNAS,1
1,JUNIN,SATIPO,1
2,LORETO,MAYNAS,1
3,MADRE DE DIOS,TAMBOPATA,1
4,PIURA,PIURA,1
...,...,...,...
198,PASCO,OXAPAMPA,1
199,PIURA,PIURA,1
200,SAN MARTIN,SAN MARTIN,1
201,TUMBES,TUMBES,1


## Frequency table

With filtered data, let's create a crosstabulation:

In [28]:
indexList=['departamento','provincia']
aggregator={'case':['sum']}
worst_ProvYear_alarma_Frequency=worst_ProvYear_alarma.groupby(indexList,observed=True).agg(aggregator)
worst_ProvYear_alarma_Frequency

Unnamed: 0_level_0,Unnamed: 1_level_0,case
Unnamed: 0_level_1,Unnamed: 1_level_1,sum
departamento,provincia,Unnamed: 2_level_2
AMAZONAS,BAGUA,6
AMAZONAS,UTCUBAMBA,6
ANCASH,CASMA,5
ANCASH,SANTA,3
AREQUIPA,AREQUIPA,1
AYACUCHO,LA MAR,7
AYACUCHO,SUCRE,1
CAJAMARCA,CAJAMARCA,1
CAJAMARCA,JAEN,10
CALLAO,CALLAO,2


In [30]:
# we get a long format
worst_ProvYear_alarma_Frequency.reset_index()

Unnamed: 0_level_0,departamento,provincia,case
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum
0,AMAZONAS,BAGUA,6
1,AMAZONAS,UTCUBAMBA,6
2,ANCASH,CASMA,5
3,ANCASH,SANTA,3
4,AREQUIPA,AREQUIPA,1
5,AYACUCHO,LA MAR,7
6,AYACUCHO,SUCRE,1
7,CAJAMARCA,CAJAMARCA,1
8,CAJAMARCA,JAEN,10
9,CALLAO,CALLAO,2


In [31]:
# final look
worst_ProvYear_alarma_Frequency.columns=['case']
worst_ProvYear_alarma_Frequency.reset_index(inplace=True)
worst_ProvYear_alarma_Frequency

Unnamed: 0,departamento,provincia,case
0,AMAZONAS,BAGUA,6
1,AMAZONAS,UTCUBAMBA,6
2,ANCASH,CASMA,5
3,ANCASH,SANTA,3
4,AREQUIPA,AREQUIPA,1
5,AYACUCHO,LA MAR,7
6,AYACUCHO,SUCRE,1
7,CAJAMARCA,CAJAMARCA,1
8,CAJAMARCA,JAEN,10
9,CALLAO,CALLAO,2


Saving the frequencies in a file:

In [None]:
worst_ProvYear_alarma_Frequency.to_csv('dataFiles/worst_ProvYear_alarma_Frequency.csv',index=False)

# Dengue by Location and Year

## Aggregating

Let's check a previous data frame:

In [32]:
ByYearPlace_wide

Unnamed: 0,ano,departamento,provincia,case_sum_SIN_SEÑALES,case_sum_ALARMA,case_sum_GRAVE
0,2000,AMAZONAS,BAGUA,215.0,0.0,0.0
1,2000,AMAZONAS,UTCUBAMBA,58.0,0.0,0.0
2,2000,CAJAMARCA,CUTERVO,2.0,0.0,0.0
3,2000,CAJAMARCA,JAEN,16.0,0.0,0.0
4,2000,HUANUCO,LEONCIO PRADO,29.0,0.0,0.0
...,...,...,...,...,...,...
1305,2022,TUMBES,ZARUMILLA,89.0,5.0,0.0
1306,2022,UCAYALI,ATALAYA,542.0,92.0,2.0
1307,2022,UCAYALI,CORONEL PORTILLO,2680.0,499.0,23.0
1308,2022,UCAYALI,PADRE ABAD,412.0,87.0,2.0


This time, I want two variables:

In [33]:
indexList=['ano','departamento']
aggregator={'case_sum_SIN_SEÑALES':['sum'],'case_sum_ALARMA':['sum']}
ByYearPlace=ByYearPlace_wide.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace.columns=['sum_SIN_SEÑALES','sum_ALARMA']
ByYearPlace.reset_index(inplace=True)
ByYearPlace

Unnamed: 0,ano,departamento,sum_SIN_SEÑALES,sum_ALARMA
0,2000,AMAZONAS,273.0,0.0
1,2000,CAJAMARCA,18.0,0.0
2,2000,HUANUCO,29.0,0.0
3,2000,JUNIN,7.0,0.0
4,2000,LA LIBERTAD,894.0,0.0
...,...,...,...,...
366,2022,PIURA,9296.0,1361.0
367,2022,PUNO,25.0,0.0
368,2022,SAN MARTIN,3229.0,907.0
369,2022,TUMBES,656.0,36.0


## Creating information

I will create a new variable:

In [34]:
ByYearPlace['rateAlarma']=(ByYearPlace['sum_ALARMA']/ByYearPlace['sum_SIN_SEÑALES'])
ByYearPlace['rateAlarma'].describe()

count    371.000000
mean            inf
std             NaN
min        0.000000
25%        0.000000
50%        0.018216
75%        0.134195
max             inf
Name: rateAlarma, dtype: float64

We got _inf_ values:

In [35]:
import numpy as np #identify with numpy
ByYearPlace[np.isinf(ByYearPlace.rateAlarma)]

Unnamed: 0,ano,departamento,sum_SIN_SEÑALES,sum_ALARMA,rateAlarma
186,2013,AYACUCHO,0.0,1.0,inf


We need to make a decision. I did this:

In [36]:
ByYearPlace.loc[186,'rateAlarma']=1
ByYearPlace.drop(columns=['sum_SIN_SEÑALES','sum_ALARMA'],inplace=True)
ByYearPlace['rateAlarma'].describe()

count    371.000000
mean       0.111113
std        0.202358
min        0.000000
25%        0.000000
50%        0.018216
75%        0.134195
max        1.583333
Name: rateAlarma, dtype: float64

### Discretizing

Sometimes you need a numerical variable as an ordinal variable: 

In [37]:
edges=[-1, .1, .25, .5,.75,1,2]
theLabels=["less10%","10-25%","25-50","51-75%","75-100%","above100%"]
ByYearPlace["rateAlarma.cut"]=pd.cut(ByYearPlace['rateAlarma'], include_lowest=True,
                                     bins=edges, 
                                     labels=theLabels,
                                     ordered=True)

In [38]:
# we have
ByYearPlace

Unnamed: 0,ano,departamento,rateAlarma,rateAlarma.cut
0,2000,AMAZONAS,0.000000,less10%
1,2000,CAJAMARCA,0.000000,less10%
2,2000,HUANUCO,0.000000,less10%
3,2000,JUNIN,0.000000,less10%
4,2000,LA LIBERTAD,0.000000,less10%
...,...,...,...,...
366,2022,PIURA,0.146407,10-25%
367,2022,PUNO,0.000000,less10%
368,2022,SAN MARTIN,0.280892,25-50
369,2022,TUMBES,0.054878,less10%


We could check the yearly behavior:

In [None]:
ByYearPlace.groupby('ano').describe()

Let's do some **filtering**:

In [None]:
ByYearPlace=ByYearPlace[ByYearPlace.ano>=2012]
ByYearPlace.reset_index(drop=True,inplace=True)
ByYearPlace

In [None]:
ByYearPlace.info()

In [None]:
# the categort should be exported as pickle

ByYearPlace.to_pickle("dataFiles/ByYearPlace.pkl")

# World Fragility Data

## Concatenating


Let's visit this website: https://fundforpeace.org/what-we-do/country-risk-and-fragility-data/

There, you will find several excel files with the _Fragile States Index_ per year. Please, create folder **fragility** inside the folder **dataFiles**, where you will download the excel files from 2006 to 2023. 

In [46]:
# Import libraries
import os
import glob
import pandas as pd

path = os.path.join('dataFiles','fragility','*.xlsx') # xlsx files in the folder
excel_files_names = glob.glob(path) #file names using pyhton's glob

# see the file names
excel_files_names


['dataFiles\\fragility\\fsi-2006.xlsx',
 'dataFiles\\fragility\\fsi-2007.xlsx',
 'dataFiles\\fragility\\fsi-2008.xlsx',
 'dataFiles\\fragility\\fsi-2009.xlsx',
 'dataFiles\\fragility\\fsi-2010.xlsx',
 'dataFiles\\fragility\\fsi-2011.xlsx',
 'dataFiles\\fragility\\fsi-2012.xlsx',
 'dataFiles\\fragility\\fsi-2013.xlsx',
 'dataFiles\\fragility\\fsi-2014.xlsx',
 'dataFiles\\fragility\\fsi-2015.xlsx',
 'dataFiles\\fragility\\fsi-2016.xlsx',
 'dataFiles\\fragility\\fsi-2017.xlsx',
 'dataFiles\\fragility\\fsi-2018.xlsx',
 'dataFiles\\fragility\\fsi-2019.xlsx',
 'dataFiles\\fragility\\fsi-2020.xlsx',
 'dataFiles\\fragility\\fsi-2021.xlsx',
 'dataFiles\\fragility\\fsi-2022-download.xlsx',
 'dataFiles\\fragility\\FSI-2023-DOWNLOAD.xlsx']

Let´s open each file (make sure you have previously installed **openpyxl**):

In [47]:
allDFs=[] # all XLSX will be here!

import pandas as pd

for fileName in excel_files_names:
    currentFile=pd.read_excel(fileName)
    allDFs.append(currentFile)

In [48]:
# amount of rows and columns:
for df,year in zip(allDFs,range(2006,2024)):
    print(df.shape,year)

(146, 16) 2006
(177, 16) 2007
(177, 16) 2008
(177, 16) 2009
(177, 16) 2010
(177, 16) 2011
(178, 16) 2012
(178, 16) 2013
(178, 16) 2014
(178, 16) 2015
(178, 16) 2016
(178, 16) 2017
(178, 16) 2018
(178, 17) 2019
(178, 17) 2020
(179, 16) 2021
(179, 16) 2022
(179, 16) 2023


In [53]:
#dropping one year
allDFs_sub=allDFs[1::]

Putting all the dataframes column names into a list:

In [54]:
allColumnNames=[] # I will write every column 
for df in allDFs_sub:
    allColumnNames.append(set(df.columns))# list of sets!

# this is what we have
allColumnNames

[{'C1: Security Apparatus',
  'C2: Factionalized Elites',
  'C3: Group Grievance',
  'Country',
  'E1: Economy',
  'E2: Economic Inequality',
  'E3: Human Flight and Brain Drain',
  'P1: State Legitimacy',
  'P2: Public Services',
  'P3: Human Rights',
  'Rank',
  'S1: Demographic Pressures',
  'S2: Refugees and IDPs',
  'Total',
  'X1: External Intervention',
  'Year'},
 {'C1: Security Apparatus',
  'C2: Factionalized Elites',
  'C3: Group Grievance',
  'Country',
  'E1: Economy',
  'E2: Economic Inequality',
  'E3: Human Flight and Brain Drain',
  'P1: State Legitimacy',
  'P2: Public Services',
  'P3: Human Rights',
  'Rank',
  'S1: Demographic Pressures',
  'S2: Refugees and IDPs',
  'Total',
  'X1: External Intervention',
  'Year'},
 {'C1: Security Apparatus',
  'C2: Factionalized Elites',
  'C3: Group Grievance',
  'Country',
  'E1: Economy',
  'E2: Economic Inequality',
  'E3: Human Flight and Brain Drain',
  'P1: State Legitimacy',
  'P2: Public Services',
  'P3: Human Rights',

In [55]:
# common columns
commonColumns=set.intersection(*allColumnNames) # expanding list of sets (*)
commonColumns

{'C1: Security Apparatus',
 'C2: Factionalized Elites',
 'C3: Group Grievance',
 'Country',
 'E1: Economy',
 'E2: Economic Inequality',
 'E3: Human Flight and Brain Drain',
 'P1: State Legitimacy',
 'P2: Public Services',
 'P3: Human Rights',
 'Rank',
 'S1: Demographic Pressures',
 'S2: Refugees and IDPs',
 'Total',
 'X1: External Intervention',
 'Year'}

In [56]:
commonColumns.symmetric_difference(set.union(*allColumnNames))

{'Change from Previous Year'}

In [57]:
allDFs_sameNames=[] # final DataFrame (with all DFs from 2013-2021
colnamesSorted=sorted(list(commonColumns)) # columns names sorted - must turn 'set' into 'list'

# making list of DFs
for df in allDFs_sub:
    allDFs_sameNames.append(df.loc[:,colnamesSorted]) 

# here it is
allDFs_sameNames

[     C1: Security Apparatus  C2: Factionalized Elites  C3: Group Grievance  \
 0                       9.9                       9.7                 10.0   
 1                      10.0                       9.8                 10.0   
 2                      10.0                      10.0                  8.5   
 3                       9.5                       9.0                  8.8   
 4                       9.6                       9.7                  9.5   
 ..                      ...                       ...                  ...   
 172                     1.0                       1.0                  2.1   
 173                     1.0                       1.0                  1.0   
 174                     0.9                       1.0                  1.0   
 175                     0.9                       0.7                  1.0   
 176                     1.0                       1.0                  1.0   
 
          Country  E1: Economy  E2: Economic Inequ

In [70]:
# concatenating
allDFsConcat=pd.concat(allDFs_sameNames,ignore_index=True) # appending DFs using 'concat()'

#done!... see it:
allDFsConcat

Unnamed: 0,C1: Security Apparatus,C2: Factionalized Elites,C3: Group Grievance,Country,E1: Economy,E2: Economic Inequality,E3: Human Flight and Brain Drain,P1: State Legitimacy,P2: Public Services,P3: Human Rights,Rank,S1: Demographic Pressures,S2: Refugees and IDPs,Total,X1: External Intervention,Year
0,9.9,9.7,10.0,Sudan,7.7,9.1,9.0,10.0,9.5,10.0,1st,9.2,9.8,113.7,9.8,2007-01-01 00:00:00
1,10.0,9.8,10.0,Iraq,8.0,8.5,9.5,9.4,8.5,9.7,2nd,9.0,9.0,111.4,10.0,2007-01-01 00:00:00
2,10.0,10.0,8.5,Somalia,9.2,7.5,8.0,10.0,10.0,9.7,3rd,9.2,9.0,111.1,10.0,2007-01-01 00:00:00
3,9.5,9.0,8.8,Zimbabwe,10.0,9.5,9.1,9.5,9.6,9.7,4th,9.7,8.7,110.1,7.0,2007-01-01 00:00:00
4,9.6,9.7,9.5,Chad,8.3,9.0,7.9,9.5,9.1,9.2,5th,9.1,8.9,108.8,9.0,2007-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3019,1.4,1.0,2.1,Switzerland,1.6,2.4,1.0,0.3,1.6,0.4,175th,2.4,3.2,17.8,0.4,2023
3020,1.6,1.4,2.0,New Zealand,2.6,2.6,1.6,0.5,1.1,0.5,176th,1.1,1.2,16.7,0.5,2023
3021,2.0,1.4,0.3,Finland,2.7,1.6,1.5,0.4,1.0,0.5,177th,1.7,1.9,16.0,1.0,2023
3022,0.4,1.8,0.5,Iceland,2.6,1.5,1.6,0.4,0.9,0.4,178th,1.5,1.5,15.7,2.6,2023


In [71]:
allDFsConcat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3024 entries, 0 to 3023
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   C1: Security Apparatus            3024 non-null   float64
 1   C2: Factionalized Elites          3024 non-null   float64
 2   C3: Group Grievance               3024 non-null   float64
 3   Country                           3024 non-null   object 
 4   E1: Economy                       3024 non-null   float64
 5   E2: Economic Inequality           3024 non-null   float64
 6   E3: Human Flight and Brain Drain  3024 non-null   float64
 7   P1: State Legitimacy              3024 non-null   float64
 8   P2: Public Services               3024 non-null   float64
 9   P3: Human Rights                  3024 non-null   float64
 10  Rank                              3024 non-null   object 
 11  S1: Demographic Pressures         3024 non-null   float64
 12  S2: Re

In [72]:
# value_counts can be used in object type
allDFsConcat.Year.value_counts()

Year
2023                   179
2022-01-01 00:00:00    179
2021                   179
2016-01-01 00:00:00    178
2020-01-01 00:00:00    178
2019-01-01 00:00:00    178
2018-01-01 00:00:00    178
2017-01-01 00:00:00    178
2015-01-01 00:00:00    178
2014-01-01 00:00:00    178
2013-01-01 00:00:00    178
2012-01-01 00:00:00    178
2008-01-01 00:00:00    177
2011-01-01 00:00:00    177
2010-01-01 00:00:00    177
2009-01-01 00:00:00    177
2007-01-01 00:00:00    177
Name: count, dtype: int64

In [73]:
# keeping just the year value
yearAsNumber=[]
for y in allDFsConcat.Year:
    try:
        yearAsNumber.append(y.year)# the value from a date-time format
    except:
        yearAsNumber.append(y) # if not a datetime

#verifying
pd.Series(yearAsNumber).value_counts()

2023    179
2022    179
2021    179
2016    178
2020    178
2019    178
2018    178
2017    178
2015    178
2014    178
2013    178
2012    178
2008    177
2011    177
2010    177
2009    177
2007    177
Name: count, dtype: int64

In [74]:
# overwriting the year column
allDFsConcat['Year']=yearAsNumber

In [75]:
# current order
allDFsConcat.columns.to_list()

['C1: Security Apparatus',
 'C2: Factionalized Elites',
 'C3: Group Grievance',
 'Country',
 'E1: Economy',
 'E2: Economic Inequality',
 'E3: Human Flight and Brain Drain',
 'P1: State Legitimacy',
 'P2: Public Services',
 'P3: Human Rights',
 'Rank',
 'S1: Demographic Pressures',
 'S2: Refugees and IDPs',
 'Total',
 'X1: External Intervention',
 'Year']

In [76]:
# this is a trick: setting columns as index
allDFsConcat.set_index(['Country','Year','Total'],inplace=True)
allDFsConcat.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,C1: Security Apparatus,C2: Factionalized Elites,C3: Group Grievance,E1: Economy,E2: Economic Inequality,E3: Human Flight and Brain Drain,P1: State Legitimacy,P2: Public Services,P3: Human Rights,Rank,S1: Demographic Pressures,S2: Refugees and IDPs,X1: External Intervention
Country,Year,Total,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Sudan,2007,113.7,9.9,9.7,10.0,7.7,9.1,9.0,10.0,9.5,10.0,1st,9.2,9.8,9.8
Iraq,2007,111.4,10.0,9.8,10.0,8.0,8.5,9.5,9.4,8.5,9.7,2nd,9.0,9.0,10.0
Somalia,2007,111.1,10.0,10.0,8.5,9.2,7.5,8.0,10.0,10.0,9.7,3rd,9.2,9.0,10.0
Zimbabwe,2007,110.1,9.5,9.0,8.8,10.0,9.5,9.1,9.5,9.6,9.7,4th,9.7,8.7,7.0
Chad,2007,108.8,9.6,9.7,9.5,8.3,9.0,7.9,9.5,9.1,9.2,5th,9.1,8.9,9.0


Reordering columns:

In [77]:
# dropping unneeded column
allDFsConcat.drop(columns='Rank',inplace=True)

In [78]:
# indexes will be columns
allDFsConcat.reset_index(drop=False,inplace=True)

# see
allDFsConcat.head()

Unnamed: 0,Country,Year,Total,C1: Security Apparatus,C2: Factionalized Elites,C3: Group Grievance,E1: Economy,E2: Economic Inequality,E3: Human Flight and Brain Drain,P1: State Legitimacy,P2: Public Services,P3: Human Rights,S1: Demographic Pressures,S2: Refugees and IDPs,X1: External Intervention
0,Sudan,2007,113.7,9.9,9.7,10.0,7.7,9.1,9.0,10.0,9.5,10.0,9.2,9.8,9.8
1,Iraq,2007,111.4,10.0,9.8,10.0,8.0,8.5,9.5,9.4,8.5,9.7,9.0,9.0,10.0
2,Somalia,2007,111.1,10.0,10.0,8.5,9.2,7.5,8.0,10.0,10.0,9.7,9.2,9.0,10.0
3,Zimbabwe,2007,110.1,9.5,9.0,8.8,10.0,9.5,9.1,9.5,9.6,9.7,9.7,8.7,7.0
4,Chad,2007,108.8,9.6,9.7,9.5,8.3,9.0,7.9,9.5,9.1,9.2,9.1,8.9,9.0


In [79]:
# better ?
allDFsConcat.columns.to_list()

['Country',
 'Year',
 'Total',
 'C1: Security Apparatus',
 'C2: Factionalized Elites',
 'C3: Group Grievance',
 'E1: Economy',
 'E2: Economic Inequality',
 'E3: Human Flight and Brain Drain',
 'P1: State Legitimacy',
 'P2: Public Services',
 'P3: Human Rights',
 'S1: Demographic Pressures',
 'S2: Refugees and IDPs',
 'X1: External Intervention']

In [80]:
# clean column names
allDFsConcat.columns=allDFsConcat.columns.str.replace(':\s',"_",regex=True)
allDFsConcat.columns=allDFsConcat.columns.str.replace('\s',"",regex=True)
#see
allDFsConcat.columns.to_list()

['Country',
 'Year',
 'Total',
 'C1_SecurityApparatus',
 'C2_FactionalizedElites',
 'C3_GroupGrievance',
 'E1_Economy',
 'E2_EconomicInequality',
 'E3_HumanFlightandBrainDrain',
 'P1_StateLegitimacy',
 'P2_PublicServices',
 'P3_HumanRights',
 'S1_DemographicPressures',
 'S2_RefugeesandIDPs',
 'X1_ExternalIntervention']

In [81]:
# overwriting country
allDFsConcat['Country']=allDFsConcat.Country.str.upper()
allDFsConcat["Country"]=allDFsConcat.Country.str.strip()

## Reshaping after concatenation

We can find some problems that were created during the concatenation:

In [82]:
# seeing long shape
fragileLong=allDFsConcat.iloc[:,:3]
fragileLong

Unnamed: 0,Country,Year,Total
0,SUDAN,2007,113.7
1,IRAQ,2007,111.4
2,SOMALIA,2007,111.1
3,ZIMBABWE,2007,110.1
4,CHAD,2007,108.8
...,...,...,...
3019,SWITZERLAND,2023,17.8
3020,NEW ZEALAND,2023,16.7
3021,FINLAND,2023,16.0
3022,ICELAND,2023,15.7


In [83]:
# to wide
fragileWide=pd.pivot_table(fragileLong,
               values='Total', # values to use
               index=['Country'], # unit of analysis
               columns=['Year']) # the values for NEW column
# see wide
fragileWide.head()

Year,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AFGHANISTAN,102.3,105.4,108.2,109.3,107.5,106.0,106.7,106.5,107.9,107.9,107.3,106.620768,105.0,102.901187,102.1,105.9,106.6
ALBANIA,70.5,69.7,70.0,67.1,66.1,66.1,65.2,63.6,61.9,61.2,60.5,60.079308,58.9,58.753811,59.0,56.7,56.8
ALGERIA,75.9,77.8,80.6,81.3,78.0,78.1,78.7,78.8,79.6,78.3,76.8,75.785052,75.4,74.575183,73.6,72.2,70.0
ANGOLA,84.9,83.8,85.0,83.7,84.6,85.1,87.1,87.4,87.9,90.5,91.1,89.440296,87.8,87.320039,89.0,88.1,86.9
ANTIGUA AND BARBUDA,65.7,64.1,62.8,60.9,59.9,58.9,58.0,59.0,57.8,56.2,54.8,55.611041,54.4,52.062352,54.9,54.2,53.8


In [84]:
# missing values in long format
fragileLong[fragileLong.isna().any(axis=1)]

Unnamed: 0,Country,Year,Total


In [85]:
# what cells have missing values?
fragileWide[fragileWide.isna().any(axis=1)]

Year,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CABO VERDE,,,,,,,,,,,,,,,64.2,61.4,
CAPE VERDE,81.1,80.7,78.5,77.2,75.8,74.7,73.7,74.1,73.5,71.5,70.1,68.0,66.6,64.778171,,,60.1
COTE D'IVOIRE,107.3,104.6,102.5,101.2,102.8,103.6,103.5,101.7,100.1,97.9,96.5,94.561519,92.1,89.722674,90.7,,87.1
CZECH REPUBLIC,42.1,42.1,42.6,41.5,42.4,39.5,39.9,39.4,37.4,40.8,40.1,39.047601,,35.741616,,,40.2
CZECHIA,,,,,,,,,,,,,37.6,,39.3,39.9,
CÔTE D'IVOIRE,,,,,,,,,,,,,,,,89.6,
ESWATINI,,,,,,,,,,,,,85.3,,82.5,,
ISRAEL,,,,,,,,,,,,,,,43.0,42.6,44.1
ISRAEL AND WEST BANK,79.6,83.6,84.6,84.6,84.4,82.2,80.8,79.5,79.4,79.7,78.9,78.53374,76.5,75.123972,,,
KYRGYZ REPUBLIC,88.2,88.8,89.1,88.4,91.8,87.4,85.7,83.9,82.2,81.1,80.3,78.634122,76.2,73.929364,,,75.6


So, we got problems.

In [87]:
# details
fragileWide[fragileWide.isna().any(axis=1)].index

Index(['CABO VERDE', 'CAPE VERDE', 'COTE D'IVOIRE', 'CZECH REPUBLIC',
       'CZECHIA', 'CÔTE D'IVOIRE', 'ESWATINI', 'ISRAEL',
       'ISRAEL AND WEST BANK', 'KYRGYZ REPUBLIC', 'KYRGYZSTAN', 'MACEDONIA',
       'NORTH MACEDONIA', 'PALESTINE', 'SLOVAK REPUBLIC', 'SLOVAKIA',
       'SOUTH SUDAN', 'SWAZILAND'],
      dtype='object', name='Country')

In [88]:
# prepare changes as dict:
changes={"CABO VERDE": "CAPE VERDE","CÔTE D'IVOIRE":"COTE D'IVOIRE", 
"CZECHIA":"CZECH REPUBLIC",
"SWAZILAND":"ESWATINI",
"ISRAEL AND WEST BANK":"ISRAEL",
"KYRGYZSTAN":"KYRGYZ REPUBLIC",
"NORTH MACEDONIA":"MACEDONIA",
"SLOVAKIA": "SLOVAK REPUBLIC"}

In [89]:
# make changes using 'replace':
allDFsConcat.Country.replace(to_replace=changes,inplace=True)
# re create:
fragileLong=allDFsConcat.iloc[:,:3]

In [90]:
# to wide shape again
fragileWide=pd.pivot_table(fragileLong,
               values='Total',
               index=['Country'],
               columns=['Year']).\
            reset_index(drop=False).\
            rename_axis(index=None, columns=None)

# verify missing
fragileWide[fragileWide.isna().any(axis=1)] # remember you had an extra country

Unnamed: 0,Country,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
124,PALESTINE,,,,,,,,,,,,,,,86.0,85.6,87.9
150,SOUTH SUDAN,,,,,,108.4,110.6,112.9,114.5,113.8,113.9,113.357315,112.2,110.75219,109.4,108.4,108.5


In [91]:
# new subset
allDFsConcat=allDFsConcat[allDFsConcat.Year>=2012]

In [92]:
allDFsConcat

Unnamed: 0,Country,Year,Total,C1_SecurityApparatus,C2_FactionalizedElites,C3_GroupGrievance,E1_Economy,E2_EconomicInequality,E3_HumanFlightandBrainDrain,P1_StateLegitimacy,P2_PublicServices,P3_HumanRights,S1_DemographicPressures,S2_RefugeesandIDPs,X1_ExternalIntervention
885,SOMALIA,2012,114.9,10.0,9.8,9.6,9.7,8.1,8.6,9.9,9.8,9.9,9.8,10.0,9.8
886,CONGO DEMOCRATIC REPUBLIC,2012,111.2,9.7,9.5,9.3,8.8,8.9,7.4,9.5,9.2,9.7,9.9,9.7,9.6
887,SUDAN,2012,109.4,9.7,9.9,10.0,7.3,8.8,8.3,9.5,8.5,9.4,8.4,9.9,9.5
888,SOUTH SUDAN,2012,108.4,9.7,10.0,10.0,7.3,8.8,6.4,9.1,9.5,9.2,8.4,9.9,10.0
889,CHAD,2012,107.6,8.9,9.8,9.1,8.3,8.6,7.7,9.8,9.5,9.3,9.3,9.5,7.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3019,SWITZERLAND,2023,17.8,1.4,1.0,2.1,1.6,2.4,1.0,0.3,1.6,0.4,2.4,3.2,0.4
3020,NEW ZEALAND,2023,16.7,1.6,1.4,2.0,2.6,2.6,1.6,0.5,1.1,0.5,1.1,1.2,0.5
3021,FINLAND,2023,16.0,2.0,1.4,0.3,2.7,1.6,1.5,0.4,1.0,0.5,1.7,1.9,1.0
3022,ICELAND,2023,15.7,0.4,1.8,0.5,2.6,1.5,1.6,0.4,0.9,0.4,1.5,1.5,2.6


In [93]:
allDFsConcat.reset_index(drop=True, inplace=True)

In [94]:
allDFsConcat

Unnamed: 0,Country,Year,Total,C1_SecurityApparatus,C2_FactionalizedElites,C3_GroupGrievance,E1_Economy,E2_EconomicInequality,E3_HumanFlightandBrainDrain,P1_StateLegitimacy,P2_PublicServices,P3_HumanRights,S1_DemographicPressures,S2_RefugeesandIDPs,X1_ExternalIntervention
0,SOMALIA,2012,114.9,10.0,9.8,9.6,9.7,8.1,8.6,9.9,9.8,9.9,9.8,10.0,9.8
1,CONGO DEMOCRATIC REPUBLIC,2012,111.2,9.7,9.5,9.3,8.8,8.9,7.4,9.5,9.2,9.7,9.9,9.7,9.6
2,SUDAN,2012,109.4,9.7,9.9,10.0,7.3,8.8,8.3,9.5,8.5,9.4,8.4,9.9,9.5
3,SOUTH SUDAN,2012,108.4,9.7,10.0,10.0,7.3,8.8,6.4,9.1,9.5,9.2,8.4,9.9,10.0
4,CHAD,2012,107.6,8.9,9.8,9.1,8.3,8.6,7.7,9.8,9.5,9.3,9.3,9.5,7.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2134,SWITZERLAND,2023,17.8,1.4,1.0,2.1,1.6,2.4,1.0,0.3,1.6,0.4,2.4,3.2,0.4
2135,NEW ZEALAND,2023,16.7,1.6,1.4,2.0,2.6,2.6,1.6,0.5,1.1,0.5,1.1,1.2,0.5
2136,FINLAND,2023,16.0,2.0,1.4,0.3,2.7,1.6,1.5,0.4,1.0,0.5,1.7,1.9,1.0
2137,ICELAND,2023,15.7,0.4,1.8,0.5,2.6,1.5,1.6,0.4,0.9,0.4,1.5,1.5,2.6


In [99]:
# saving
allDFsConcat.to_csv(os.path.join("dataFiles","fragility","fragility2012_2023.csv"),index=False)

# Country Codes

## Merging

In [97]:
# make sure to install 'html5lib', 'beautifulSoup4' and 'lxml'

codesLink='https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'

allTablesWiki=pd.read_html(codesLink, flavor='bs4')

In [98]:

allTablesWiki[0]

Unnamed: 0_level_0,ISO 3166[1] name[5],World Factbook[6] official state name[a],Sovereignty [6][7][8],ISO 3166-1[2],ISO 3166-1[2],ISO 3166-1[2],ISO 3166-2[3] subdivision codes link,TLD [9]
Unnamed: 0_level_1,ISO 3166[1] name[5],World Factbook[6] official state name[a],Sovereignty [6][7][8],A-2 [5],A-3 [5],Num. [5],ISO 3166-2[3] subdivision codes link,TLD [9]
0,Afghanistan,The Islamic Republic of Afghanistan,UN member,AF,AFG,004,ISO 3166-2:AF,.af
1,Åland Islands,Åland,Finland,AX,ALA,248,ISO 3166-2:AX,.ax
2,Albania,The Republic of Albania,UN member,AL,ALB,008,ISO 3166-2:AL,.al
3,Algeria,The People's Democratic Republic of Algeria,UN member,DZ,DZA,012,ISO 3166-2:DZ,.dz
4,American Samoa,The Territory of American Samoa,United States,AS,ASM,016,ISO 3166-2:AS,.as
...,...,...,...,...,...,...,...,...
266,Wallis and Futuna,The Territory of the Wallis and Futuna Islands,France,WF,WLF,876,ISO 3166-2:WF,.wf
267,Western Sahara [aj],The Sahrawi Arab Democratic Republic,Disputed [ak],EH,ESH,732,ISO 3166-2:EH,[al]
268,Yemen,The Republic of Yemen,UN member,YE,YEM,887,ISO 3166-2:YE,.ye
269,Zambia,The Republic of Zambia,UN member,ZM,ZMB,894,ISO 3166-2:ZM,.zm


In [100]:
# keep that on
countryCodes=allTablesWiki[0].copy()

In [101]:
# check names
countryCodes.columns

MultiIndex([(                     'ISO 3166[1] name[5]', ...),
            ('World Factbook[6] official state name[a]', ...),
            (                   'Sovereignty [6][7][8]', ...),
            (                           'ISO 3166-1[2]', ...),
            (                           'ISO 3166-1[2]', ...),
            (                           'ISO 3166-1[2]', ...),
            (    'ISO 3166-2[3] subdivision codes link', ...),
            (                                 'TLD [9]', ...)],
           )

In [102]:
# keeping what is needed
countryCodes=countryCodes.iloc[:,[0,3,4]]

In [103]:
countryCodes.columns

MultiIndex([('ISO 3166[1] name[5]', 'ISO 3166[1] name[5]'),
            (      'ISO 3166-1[2]',             'A-2 [5]'),
            (      'ISO 3166-1[2]',             'A-3 [5]')],
           )

In [104]:
# brute-force renaming
newNames=["Country","iso2","iso3"]
countryCodes.columns=newNames
countryCodes

Unnamed: 0,Country,iso2,iso3
0,Afghanistan,AF,AFG
1,Åland Islands,AX,ALA
2,Albania,AL,ALB
3,Algeria,DZ,DZA
4,American Samoa,AS,ASM
...,...,...,...
266,Wallis and Futuna,WF,WLF
267,Western Sahara [aj],EH,ESH
268,Yemen,YE,YEM
269,Zambia,ZM,ZMB


In [105]:
# bye symbols
countryCodes['Country']=countryCodes['Country'].str.normalize('NFKD').\
                        str.encode('ascii', errors='ignore').str.decode('utf-8').str.upper()

In [106]:
# check missing
countryCodes[countryCodes.isna().any(axis=1)]

Unnamed: 0,Country,iso2,iso3
164,NAMIBIA,,NAM


In [107]:
# easy fix
countryCodes.loc[countryCodes.Country=='NAMIBIA','iso2']="NA"

# something missing?
countryCodes[countryCodes.isna().any(axis=1)]

Unnamed: 0,Country,iso2,iso3


In [108]:
# are these iso2 valid values?
[x for x in countryCodes.iso2 if len(x)>2]

['British Virgin Islands – See Virgin Islands (British).',
 'Burma – See Myanmar.',
 'Cape Verde – See Cabo Verde.',
 'Caribbean Netherlands – See Bonaire, Sint Eustatius and Saba.',
 'China, The Republic of – See Taiwan (Province of China).',
 "Democratic People's Republic of Korea – See Korea, The Democratic People's Republic of.",
 'Democratic Republic of the Congo – See Congo, The Democratic Republic of the.',
 'East Timor – See Timor-Leste.',
 'Great Britain – See United Kingdom, The.',
 "Ivory Coast – See Côte d'Ivoire.",
 'Jan Mayen – See Svalbard and Jan Mayen.',
 "North Korea – See Korea, The Democratic People's Republic of.",
 "People's Republic of China – See China.",
 'Republic of China – See Taiwan (Province of China).',
 'Republic of Korea – See Korea, The Republic of.',
 'Republic of the Congo – See Congo, The.',
 'Saba – See Bonaire, Sint Eustatius and Saba.',
 'Sahrawi Arab Democratic Republic – See Western Sahara.',
 'Sint Eustatius – See Bonaire, Sint Eustatius and S

In [109]:
# wrong rows

badValues=[x for x in countryCodes.iso2 if len(x)>2]

countryCodes[countryCodes.iso2.isin(badValues)]

Unnamed: 0,Country,iso2,iso3
33,BRITISH VIRGIN ISLANDS SEE VIRGIN ISLANDS (BR...,British Virgin Islands – See Virgin Islands (B...,British Virgin Islands – See Virgin Islands (B...
37,BURMA SEE MYANMAR.,Burma – See Myanmar.,Burma – See Myanmar.
43,CAPE VERDE SEE CABO VERDE.,Cape Verde – See Cabo Verde.,Cape Verde – See Cabo Verde.
44,"CARIBBEAN NETHERLANDS SEE BONAIRE, SINT EUSTA...","Caribbean Netherlands – See Bonaire, Sint Eust...","Caribbean Netherlands – See Bonaire, Sint Eust..."
50,"CHINA, THE REPUBLIC OF SEE TAIWAN (PROVINCE O...","China, The Republic of – See Taiwan (Province ...","China, The Republic of – See Taiwan (Province ..."
65,DEMOCRATIC PEOPLE'S REPUBLIC OF KOREA SEE KOR...,Democratic People's Republic of Korea – See Ko...,Democratic People's Republic of Korea – See Ko...
66,"DEMOCRATIC REPUBLIC OF THE CONGO SEE CONGO, T...","Democratic Republic of the Congo – See Congo, ...","Democratic Republic of the Congo – See Congo, ..."
71,EAST TIMOR SEE TIMOR-LESTE.,East Timor – See Timor-Leste.,East Timor – See Timor-Leste.
94,"GREAT BRITAIN SEE UNITED KINGDOM, THE.","Great Britain – See United Kingdom, The.","Great Britain – See United Kingdom, The."
120,IVORY COAST SEE COTE D'IVOIRE.,Ivory Coast – See Côte d'Ivoire.,Ivory Coast – See Côte d'Ivoire.


In [113]:
# dropping wrong rows
countryCodes=countryCodes[~countryCodes.iso2.isin(badValues)] # filtering

countryCodes.reset_index(drop=True,inplace=True) # needed when rows are dropped

In [114]:
#how many countries?
allDFsConcat.Country.unique().shape

(179,)

In [115]:
#how many countries?
countryCodes.Country.shape

(249,)

Let's use sets to determine the non coincidences:

In [116]:
# only in countryCodes.Country NOT in allDFsConcat.Country
OnlyInCodes=set(countryCodes.Country)-set(allDFsConcat.Country)
OnlyInCodes

{'ALAND ISLANDS',
 'AMERICAN SAMOA',
 'ANDORRA',
 'ANGUILLA',
 'ANTARCTICA [B]',
 'ARUBA',
 'AUSTRALIA [C]',
 'BAHAMAS (THE)',
 'BERMUDA',
 'BOLIVIA (PLURINATIONAL STATE OF)',
 'BONAIRE  SINT EUSTATIUS  SABA',
 'BOUVET ISLAND',
 'BRITISH INDIAN OCEAN TERRITORY (THE)',
 'BRUNEI DARUSSALAM [F]',
 'CABO VERDE [G]',
 'CAYMAN ISLANDS (THE)',
 'CENTRAL AFRICAN REPUBLIC (THE)',
 'CHRISTMAS ISLAND',
 'COCOS (KEELING) ISLANDS (THE)',
 'COMOROS (THE)',
 'CONGO (THE DEMOCRATIC REPUBLIC OF THE)',
 'CONGO (THE) [H]',
 'COOK ISLANDS (THE)',
 "COTE D'IVOIRE [I]",
 'CURACAO',
 'CZECHIA [J]',
 'DOMINICA',
 'DOMINICAN REPUBLIC (THE)',
 'ESWATINI [K]',
 'FALKLAND ISLANDS (THE) [MALVINAS] [L]',
 'FAROE ISLANDS (THE)',
 'FRANCE [M]',
 'FRENCH GUIANA',
 'FRENCH POLYNESIA',
 'FRENCH SOUTHERN TERRITORIES (THE) [N]',
 'GAMBIA (THE)',
 'GIBRALTAR',
 'GREENLAND',
 'GUADELOUPE',
 'GUAM',
 'GUERNSEY',
 'GUINEA-BISSAU',
 'HEARD ISLAND AND MCDONALD ISLANDS',
 'HOLY SEE (THE) [O]',
 'HONG KONG',
 'IRAN (ISLAMIC REPUB

In [117]:
# only in allDFsConcat.Country NOT in countryCodes.Country
OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)
OnlyInConcat

{'AUSTRALIA',
 'BAHAMAS',
 'BOLIVIA',
 'BRUNEI DARUSSALAM',
 'CAPE VERDE',
 'CENTRAL AFRICAN REPUBLIC',
 'COMOROS',
 'CONGO DEMOCRATIC REPUBLIC',
 'CONGO REPUBLIC',
 "COTE D'IVOIRE",
 'CZECH REPUBLIC',
 'DOMINICAN REPUBLIC',
 'ESWATINI',
 'FRANCE',
 'GAMBIA',
 'GUINEA BISSAU',
 'IRAN',
 'KYRGYZ REPUBLIC',
 'LAOS',
 'MACEDONIA',
 'MICRONESIA',
 'MOLDOVA',
 'MYANMAR',
 'NETHERLANDS',
 'NIGER',
 'NORTH KOREA',
 'PALESTINE',
 'PHILIPPINES',
 'RUSSIA',
 'SLOVAK REPUBLIC',
 'SOUTH KOREA',
 'SUDAN',
 'SYRIA',
 'TANZANIA',
 'TIMOR-LESTE',
 'TURKEY',
 'UNITED ARAB EMIRATES',
 'UNITED KINGDOM',
 'UNITED STATES',
 'VENEZUELA',
 'VIETNAM'}

## Fuzzy merging

We used the previous information to look for _possible_ matches (please install **thefuzz**):

In [118]:
from thefuzz import process as fz

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)]

[('AUSTRALIA', ('AUSTRALIA [C]', 95)),
 ('BAHAMAS', ('BAHAMAS (THE)', 90)),
 ('BOLIVIA', ('BOLIVIA (PLURINATIONAL STATE OF)', 90)),
 ('BRUNEI DARUSSALAM', ('BRUNEI DARUSSALAM [F]', 95)),
 ('CAPE VERDE', ('CABO VERDE [G]', 70)),
 ('CENTRAL AFRICAN REPUBLIC', ('CENTRAL AFRICAN REPUBLIC (THE)', 95)),
 ('COMOROS', ('COMOROS (THE)', 90)),
 ('CONGO DEMOCRATIC REPUBLIC', ('CONGO (THE DEMOCRATIC REPUBLIC OF THE)', 95)),
 ('CONGO REPUBLIC', ('DOMINICAN REPUBLIC (THE)', 86)),
 ("COTE D'IVOIRE", ("COTE D'IVOIRE [I]", 95)),
 ('CZECH REPUBLIC', ('DOMINICAN REPUBLIC (THE)', 86)),
 ('DOMINICAN REPUBLIC', ('DOMINICAN REPUBLIC (THE)', 95)),
 ('ESWATINI', ('ESWATINI [K]', 95)),
 ('FRANCE', ('FRANCE [M]', 90)),
 ('GAMBIA', ('GAMBIA (THE)', 90)),
 ('GUINEA BISSAU', ('GUINEA-BISSAU', 100)),
 ('IRAN', ('IRAN (ISLAMIC REPUBLIC OF)', 90)),
 ('KYRGYZ REPUBLIC', ('DOMINICAN REPUBLIC (THE)', 86)),
 ('LAOS', ('MACAO [S]', 64)),
 ('MACEDONIA', ('NORTH MACEDONIA [U]', 90)),
 ('MICRONESIA', ('MICRONESIA (FEDERATED S

In [121]:
# this may be clearer:

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)
 if fz.extractOne(f, OnlyInCodes)[1]>=90]

[('AUSTRALIA', ('AUSTRALIA [C]', 95)),
 ('BAHAMAS', ('BAHAMAS (THE)', 90)),
 ('BOLIVIA', ('BOLIVIA (PLURINATIONAL STATE OF)', 90)),
 ('BRUNEI DARUSSALAM', ('BRUNEI DARUSSALAM [F]', 95)),
 ('CENTRAL AFRICAN REPUBLIC', ('CENTRAL AFRICAN REPUBLIC (THE)', 95)),
 ('COMOROS', ('COMOROS (THE)', 90)),
 ('CONGO DEMOCRATIC REPUBLIC', ('CONGO (THE DEMOCRATIC REPUBLIC OF THE)', 95)),
 ("COTE D'IVOIRE", ("COTE D'IVOIRE [I]", 95)),
 ('DOMINICAN REPUBLIC', ('DOMINICAN REPUBLIC (THE)', 95)),
 ('ESWATINI', ('ESWATINI [K]', 95)),
 ('FRANCE', ('FRANCE [M]', 90)),
 ('GAMBIA', ('GAMBIA (THE)', 90)),
 ('GUINEA BISSAU', ('GUINEA-BISSAU', 100)),
 ('IRAN', ('IRAN (ISLAMIC REPUBLIC OF)', 90)),
 ('MACEDONIA', ('NORTH MACEDONIA [U]', 90)),
 ('MICRONESIA', ('MICRONESIA (FEDERATED STATES OF)', 90)),
 ('MOLDOVA', ('MOLDOVA (THE REPUBLIC OF)', 90)),
 ('MYANMAR', ('MYANMAR [T]', 95)),
 ('NETHERLANDS', ('NETHERLANDS, KINGDOM OF THE', 90)),
 ('NIGER', ('NIGER (THE)', 90)),
 ('PALESTINE', ('PALESTINE, STATE OF', 90)),
 (

In [123]:
# prepare a dict of changes

changesInCodes1={fz.extractOne(f, OnlyInCodes)[0]:f 
                 for f in sorted(OnlyInConcat)
                 if fz.extractOne(f, OnlyInCodes)[1] >=90}
#the dict
changesInCodes1

{'AUSTRALIA [C]': 'AUSTRALIA',
 'BAHAMAS (THE)': 'BAHAMAS',
 'BOLIVIA (PLURINATIONAL STATE OF)': 'BOLIVIA',
 'BRUNEI DARUSSALAM [F]': 'BRUNEI DARUSSALAM',
 'CENTRAL AFRICAN REPUBLIC (THE)': 'CENTRAL AFRICAN REPUBLIC',
 'COMOROS (THE)': 'COMOROS',
 'CONGO (THE DEMOCRATIC REPUBLIC OF THE)': 'CONGO DEMOCRATIC REPUBLIC',
 "COTE D'IVOIRE [I]": "COTE D'IVOIRE",
 'DOMINICAN REPUBLIC (THE)': 'DOMINICAN REPUBLIC',
 'ESWATINI [K]': 'ESWATINI',
 'FRANCE [M]': 'FRANCE',
 'GAMBIA (THE)': 'GAMBIA',
 'GUINEA-BISSAU': 'GUINEA BISSAU',
 'IRAN (ISLAMIC REPUBLIC OF)': 'IRAN',
 'NORTH MACEDONIA [U]': 'MACEDONIA',
 'MICRONESIA (FEDERATED STATES OF)': 'MICRONESIA',
 'MOLDOVA (THE REPUBLIC OF)': 'MOLDOVA',
 'MYANMAR [T]': 'MYANMAR',
 'NETHERLANDS, KINGDOM OF THE': 'NETHERLANDS',
 'NIGER (THE)': 'NIGER',
 'PALESTINE, STATE OF': 'PALESTINE',
 'PHILIPPINES (THE)': 'PHILIPPINES',
 'RUSSIAN FEDERATION (THE) [W]': 'RUSSIA',
 'SUDAN (THE)': 'SUDAN',
 'SYRIAN ARAB REPUBLIC (THE) [Y]': 'SYRIA',
 'TANZANIA, THE UNITED

In [124]:
countryCodes.Country.replace(to_replace=changesInCodes1,inplace=True)

In [125]:
# second iteration

OnlyInCodes=set(countryCodes.Country)-set(allDFsConcat.Country)
OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)]

[('CAPE VERDE', ('CABO VERDE [G]', 70)),
 ('CONGO REPUBLIC', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]", 86)),
 ('CZECH REPUBLIC', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]", 86)),
 ('KYRGYZ REPUBLIC', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]", 86)),
 ('LAOS', ('MACAO [S]', 64)),
 ('NORTH KOREA', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]", 86)),
 ('SLOVAK REPUBLIC', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]", 86)),
 ('SOUTH KOREA', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]", 86)),
 ('TURKEY', ('TURKIYE [AC]', 75)),
 ('VIETNAM', ('VIET NAM [AG]', 77))]

Based on last result, we may need manual changes:

In [127]:
# see the strings in countryCodes:

countryCodes[countryCodes.Country.str.contains('LAO|KOREA|CZECH|CONGO',regex=True)]

Unnamed: 0,Country,iso2,iso3
50,CONGO DEMOCRATIC REPUBLIC,CD,COD
51,CONGO (THE) [H],CG,COG
59,CZECHIA [J],CZ,CZE
118,KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P],KP,PRK
119,KOREA (THE REPUBLIC OF) [Q],KR,KOR
122,LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE) [R],LA,LAO


In [128]:
# second iteration

changesInCodes2={"KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]":'NORTH KOREA',
                 "KOREA (THE REPUBLIC OF) [Q]":"SOUTH KOREA",
                 "LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE) [R]":"LAOS",
                 "CZECHIA [J]":'CZECH REPUBLIC',
                 "CONGO (THE) [H]":'CONGO REPUBLIC'}
countryCodes.Country.replace(to_replace=changesInCodes2,inplace=True)

Those changes now allow for a different result:

In [129]:
OnlyInCodes=set(countryCodes.Country)-set(allDFsConcat.Country)
OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)]

[('CAPE VERDE', ('CABO VERDE [G]', 70)),
 ('KYRGYZ REPUBLIC', ('KYRGYZSTAN', 68)),
 ('SLOVAK REPUBLIC', ('SLOVAKIA', 77)),
 ('TURKEY', ('TURKIYE [AC]', 75)),
 ('VIETNAM', ('VIET NAM [AG]', 77))]

In [130]:
# we got it !
changesInCodes3={fz.extractOne(f, OnlyInCodes)[0]:f 
                 for f in sorted(OnlyInConcat)
                 if fz.extractOne(f, OnlyInCodes)[1] >=52}
#dict of matches
changesInCodes3

{'CABO VERDE [G]': 'CAPE VERDE',
 'KYRGYZSTAN': 'KYRGYZ REPUBLIC',
 'SLOVAKIA': 'SLOVAK REPUBLIC',
 'TURKIYE [AC]': 'TURKEY',
 'VIET NAM [AG]': 'VIETNAM'}

In [131]:
# make the changes
countryCodes.Country.replace(to_replace=changesInCodes3,inplace=True)

In [132]:
# confirming

OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)
OnlyInConcat

set()

When we recover the  most matches, we are ready to merge:

In [134]:
fragilityCoded_2012_2023=allDFsConcat.merge(countryCodes,left_on='Country',right_on='Country') #merge on Country
fragilityCoded_2012_2023

Unnamed: 0,Country,Year,Total,C1_SecurityApparatus,C2_FactionalizedElites,C3_GroupGrievance,E1_Economy,E2_EconomicInequality,E3_HumanFlightandBrainDrain,P1_StateLegitimacy,P2_PublicServices,P3_HumanRights,S1_DemographicPressures,S2_RefugeesandIDPs,X1_ExternalIntervention,iso2,iso3
0,SOMALIA,2012,114.9,10.0,9.8,9.6,9.7,8.1,8.6,9.9,9.8,9.9,9.8,10.0,9.8,SO,SOM
1,SOMALIA,2013,113.9,9.7,10.0,9.3,9.4,8.4,8.9,9.5,9.8,10.0,9.5,10.0,9.4,SO,SOM
2,SOMALIA,2014,112.6,9.4,10.0,9.3,9.1,8.7,8.9,9.1,9.6,9.8,9.5,10.0,9.2,SO,SOM
3,SOMALIA,2015,114.0,9.7,10.0,9.5,9.1,9.0,9.2,9.3,9.3,10.0,9.6,9.8,9.5,SO,SOM
4,SOMALIA,2016,114.0,9.7,10.0,9.4,9.0,9.3,9.5,9.5,9.0,9.7,9.7,9.7,9.5,SO,SOM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2134,FINLAND,2022,15.1,2.2,1.4,0.3,2.6,1.3,1.5,0.3,1.3,0.3,2.0,1.6,0.3,FI,FIN
2135,FINLAND,2023,16.0,2.0,1.4,0.3,2.7,1.6,1.5,0.4,1.0,0.5,1.7,1.9,1.0,FI,FIN
2136,PALESTINE,2021,86.0,6.0,8.9,5.0,7.0,4.9,8.8,8.8,4.2,7.7,8.5,6.2,10.0,PS,PSE
2137,PALESTINE,2022,85.6,6.1,8.9,5.3,6.5,5.0,8.8,8.8,3.9,7.8,8.6,5.9,10.0,PS,PSE


In [135]:
fragilityCoded_2012_2023.to_csv(os.path.join("dataFiles","fragility","fragilityCoded_2012_2023.csv"),index=False)