<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center> 

_____

<a id='home'></a>

# Merging

Merging data sets need the following considerations:

* Merging is done on two data frames.
* You need a column in each data frame that share the same exact and unique values. The column names or titles need not be the same.
* The merged table shows by default the mutual coincidences; but you can also request the values not matched, which will help you detect possible extra cleaning.
* Pandas jargon uses a **left** and a **right** data frame: **left**.merge(**right**).

At this stage, let me use other data frames we prepared previously:

In [2]:
import pandas as pd

co2Link='https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/co2.csv'
forestLink='https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/forestRev.csv'

co2=pd.read_csv(co2Link)
forest=pd.read_csv(forestLink)

Remember the amount of rows of each DF:

In [3]:
co2.shape,forest.shape

((218, 4), (204, 3))

Also keep in mind the column names:

In [4]:
forest.columns,co2.columns

(Index(['Country', 'ForestRev_gdp', 'ForestRev_date'], dtype='object'),
 Index(['name', 'co2', 'co2_date', 'region'], dtype='object'))

Let me show you some merge approaches, but I will only show the amount of columns produced:

1. You keep only what is common in both key columns:

This is the default. The final rows will be the ones where the key values in each data frame match exactly. In this case, your count of rows will be at most the amount of rows of the smallest data frame.

In [5]:
# how many resulting rows after inner merging
co2.merge(forest,how='inner',left_on='name',right_on='Country').shape

(197, 7)

2. You keep all the keys from one data frame:

The final rows will be all the rows from the dataframe (here from the _left_). If a key values does not find a match, the key value is kept, but the columns will have missing values. In this case, your count of rows will be equal to the amount of rows of the data frame to the left. You can also use **right** so the same logic applies to the data frame to the right.



In [6]:
# how many resulting rows after left merging
co2.merge(forest,how='left',left_on='name',right_on='Country').shape

(218, 7)

3. You keep all the rows from both data frames:

In this case you will obtain all possible rows: the matched values, and the unmatched values from both data frames. You will also generate missing values. In this case, your count of rows will be at least the amount of rows of the data frame with the most rows.


In [7]:
# how many resulting rows after outer merging
co2.merge(forest,how='outer',left_on='name',right_on='Country').shape

(225, 7)

Why the different amount of rows? 

In [8]:
set(co2.name)-set(forest.Country)

{'ANTARCTICA',
 'BERMUDA',
 'BRITISH VIRGIN ISLANDS',
 'COOK ISLANDS',
 'ERITREA',
 'FALKLAND ISLANDS (ISLAS MALVINAS)',
 'FRENCH POLYNESIA',
 'GIBRALTAR',
 'JERSEY',
 'KOREA, NORTH',
 'MONTSERRAT',
 'NEW CALEDONIA',
 'NIUE',
 'SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA',
 'SAINT PIERRE AND MIQUELON',
 'SOMALIA',
 'SOUTH AFRICA',
 'SYRIA',
 'TAIWAN',
 'VENEZUELA',
 'WAKE ISLAND'}

In [9]:
set(forest.Country)-set(co2.name)

{'ANDORRA',
 'CURACAO',
 'ISLE OF MAN',
 'LIECHTENSTEIN',
 'MONACO',
 'PALAU',
 'SAN MARINO'}

Apparently, the data is not available from every country. So, let's just continue:

In [10]:
# the default is inner merge
cia=co2.merge(forest,left_on='name',right_on='Country')
cia

Unnamed: 0,name,co2,co2_date,region,Country,ForestRev_gdp,ForestRev_date
0,CHINA,1.077325e+10,2019,EAST AND SOUTHEAST ASIA,CHINA,0.08,2018
1,UNITED STATES,5.144361e+09,2019,NORTH AMERICA,UNITED STATES,0.04,2018
2,INDIA,2.314738e+09,2019,SOUTH ASIA,INDIA,0.14,2018
3,RUSSIA,1.848070e+09,2019,CENTRAL ASIA,RUSSIA,0.29,2018
4,JAPAN,1.103234e+09,2019,EAST AND SOUTHEAST ASIA,JAPAN,0.02,2018
...,...,...,...,...,...,...,...
192,TONGA,1.710000e+05,2019,AUSTRALIA AND OCEANIA,TONGA,0.03,2018
193,KIRIBATI,7.600000e+04,2019,AUSTRALIA AND OCEANIA,KIRIBATI,0.04,2018
194,NAURU,6.600000e+04,2019,AUSTRALIA AND OCEANIA,NAURU,0.00,2018
195,NORTHERN MARIANA ISLANDS,0.000000e+00,2019,AUSTRALIA AND OCEANIA,NORTHERN MARIANA ISLANDS,0.00,2018


Let's bring back the data on fragility, but just for the year 2019:

In [11]:
import os

#read in:
FragilityAll=pd.read_csv(os.path.join("data","Fragility_corrected.csv"))

#subset
fragile2019=FragilityAll.loc[FragilityAll.Year==2019,:"Total"].copy()

# see
fragile2019

Unnamed: 0,Country,Year,Total
1068,YEMEN,2019,113.5
1069,SOMALIA,2019,112.3
1070,SOUTH SUDAN,2019,112.2
1071,SYRIA,2019,111.5
1072,CONGO DEMOCRATIC REPUBLIC,2019,110.2
...,...,...,...
1241,AUSTRALIA,2019,19.7
1242,DENMARK,2019,19.5
1243,SWITZERLAND,2019,18.7
1244,NORWAY,2019,18.0


We will practice **fuzzy merging** now.

In [12]:
# Countries in 'cia' but NOT in 'fragile2019' 
OnlyCia=set(cia.Country)-set(fragile2019.Country)
OnlyCia

{'AMERICAN SAMOA',
 'ARUBA',
 'BAHAMAS, THE',
 'BRUNEI',
 'BURMA',
 'CABO VERDE',
 'CAYMAN ISLANDS',
 'CONGO, DEMOCRATIC REPUBLIC OF THE',
 'CONGO, REPUBLIC OF THE',
 'CZECHIA',
 'DOMINICA',
 'FAROE ISLANDS',
 'GAMBIA, THE',
 'GAZA STRIP',
 'GREENLAND',
 'GUAM',
 'GUINEA-BISSAU',
 'HONG KONG',
 'KIRIBATI',
 'KOREA, SOUTH',
 'KOSOVO',
 'KYRGYZSTAN',
 'MACAU',
 'MARSHALL ISLANDS',
 'MICRONESIA, FEDERATED STATES OF',
 'NAURU',
 'NORTH MACEDONIA',
 'NORTHERN MARIANA ISLANDS',
 'PUERTO RICO',
 'SAINT KITTS AND NEVIS',
 'SAINT LUCIA',
 'SAINT VINCENT AND THE GRENADINES',
 'SLOVAKIA',
 'TONGA',
 'TURKEY (TURKIYE)',
 'TURKS AND CAICOS ISLANDS',
 'TUVALU',
 'VANUATU',
 'VIRGIN ISLANDS',
 'WEST BANK'}

In [13]:
# Countris in 'fragile2019' but NOT in 'cia' 
OnlyFragile=set(fragile2019.Country)-set(cia.Country)
OnlyFragile

{'BAHAMAS',
 'BRUNEI DARUSSALAM',
 'CAPE VERDE',
 'CONGO DEMOCRATIC REPUBLIC',
 'CONGO REPUBLIC',
 'CZECH REPUBLIC',
 'ERITREA',
 'GAMBIA',
 'GUINEA BISSAU',
 'KYRGYZ REPUBLIC',
 'MACEDONIA',
 'MICRONESIA',
 'MYANMAR',
 'NORTH KOREA',
 'SLOVAK REPUBLIC',
 'SOMALIA',
 'SOUTH AFRICA',
 'SOUTH KOREA',
 'SYRIA',
 'TURKEY',
 'VENEZUELA'}

Here, we should try to find the what countries in _OnlyFragile_ may match the ones in _OnlyCia_. We need to use the **fuzzy merge** approach (please install **thefuzz** if not previously installed):

In [14]:
from thefuzz import process as fz

# take a country from OnlyFragile
# look for a country in OnlyCia and return the most similar
[(f,fz.extractOne(f, OnlyCia)) for f in sorted(OnlyFragile)]

[('BAHAMAS', ('BAHAMAS, THE', 90)),
 ('BRUNEI DARUSSALAM', ('BRUNEI', 90)),
 ('CAPE VERDE', ('CABO VERDE', 80)),
 ('CONGO DEMOCRATIC REPUBLIC', ('CONGO, DEMOCRATIC REPUBLIC OF THE', 95)),
 ('CONGO REPUBLIC', ('CONGO, REPUBLIC OF THE', 86)),
 ('CZECH REPUBLIC', ('CONGO, REPUBLIC OF THE', 86)),
 ('ERITREA', ('SAINT VINCENT AND THE GRENADINES', 51)),
 ('GAMBIA', ('GAMBIA, THE', 90)),
 ('GUINEA BISSAU', ('GUINEA-BISSAU', 100)),
 ('KYRGYZ REPUBLIC', ('CONGO, DEMOCRATIC REPUBLIC OF THE', 86)),
 ('MACEDONIA', ('NORTH MACEDONIA', 90)),
 ('MICRONESIA', ('MICRONESIA, FEDERATED STATES OF', 90)),
 ('MYANMAR', ('NORTHERN MARIANA ISLANDS', 51)),
 ('NORTH KOREA', ('KOREA, SOUTH', 78)),
 ('SLOVAK REPUBLIC', ('CONGO, DEMOCRATIC REPUBLIC OF THE', 86)),
 ('SOMALIA', ('SLOVAKIA', 67)),
 ('SOUTH AFRICA', ('KOREA, SOUTH', 66)),
 ('SOUTH KOREA', ('KOREA, SOUTH', 95)),
 ('SYRIA', ('NORTHERN MARIANA ISLANDS', 54)),
 ('TURKEY', ('TURKEY (TURKIYE)', 90)),
 ('VENEZUELA', ('VANUATU', 50))]

Above you have found _some_ good matches. Let's keep the best ones:

In [15]:
[(f,fz.extractOne(f, OnlyCia)) for f in sorted(OnlyFragile)
 if fz.extractOne(f, OnlyCia)[1]>=87]

[('BAHAMAS', ('BAHAMAS, THE', 90)),
 ('BRUNEI DARUSSALAM', ('BRUNEI', 90)),
 ('CONGO DEMOCRATIC REPUBLIC', ('CONGO, DEMOCRATIC REPUBLIC OF THE', 95)),
 ('GAMBIA', ('GAMBIA, THE', 90)),
 ('GUINEA BISSAU', ('GUINEA-BISSAU', 100)),
 ('MACEDONIA', ('NORTH MACEDONIA', 90)),
 ('MICRONESIA', ('MICRONESIA, FEDERATED STATES OF', 90)),
 ('SOUTH KOREA', ('KOREA, SOUTH', 95)),
 ('TURKEY', ('TURKEY (TURKIYE)', 90))]

Once you have good matches, you have to create dictionary like this:

In [16]:
changesFragile1={f:fz.extractOne(f, OnlyCia)[0] 
                 for f in sorted(OnlyFragile)
                 if fz.extractOne(f, OnlyCia)[1] >=87}
#dict of matches
changesFragile1

{'BAHAMAS': 'BAHAMAS, THE',
 'BRUNEI DARUSSALAM': 'BRUNEI',
 'CONGO DEMOCRATIC REPUBLIC': 'CONGO, DEMOCRATIC REPUBLIC OF THE',
 'GAMBIA': 'GAMBIA, THE',
 'GUINEA BISSAU': 'GUINEA-BISSAU',
 'MACEDONIA': 'NORTH MACEDONIA',
 'MICRONESIA': 'MICRONESIA, FEDERATED STATES OF',
 'SOUTH KOREA': 'KOREA, SOUTH',
 'TURKEY': 'TURKEY (TURKIYE)'}

You can use that dict for the replacements:

In [17]:
fragile2019.Country.replace(to_replace=changesFragile1,inplace=True)

Now the countries in fragile2019 have more matches. 

This process can be done a few more times, and you can recover more rows for the merging process. Let's see:

In [18]:
# second try
OnlyCia=set(cia.Country)-set(fragile2019.Country)
OnlyFragile=set(fragile2019.Country)-set(cia.Country)
[(f,fz.extractOne(f, OnlyCia)) for f in sorted(OnlyFragile)]

[('CAPE VERDE', ('CABO VERDE', 80)),
 ('CONGO REPUBLIC', ('CONGO, REPUBLIC OF THE', 86)),
 ('CZECH REPUBLIC', ('CONGO, REPUBLIC OF THE', 86)),
 ('ERITREA', ('SAINT VINCENT AND THE GRENADINES', 51)),
 ('KYRGYZ REPUBLIC', ('CONGO, REPUBLIC OF THE', 66)),
 ('MYANMAR', ('NORTHERN MARIANA ISLANDS', 51)),
 ('NORTH KOREA', ('NORTHERN MARIANA ISLANDS', 62)),
 ('SLOVAK REPUBLIC', ('SLOVAKIA', 74)),
 ('SOMALIA', ('SLOVAKIA', 67)),
 ('SOUTH AFRICA', ('AMERICAN SAMOA', 59)),
 ('SYRIA', ('NORTHERN MARIANA ISLANDS', 54)),
 ('VENEZUELA', ('VANUATU', 50))]

In [19]:
# second dict of changes
# select a different threshold
changesFragile2={f:fz.extractOne(f, OnlyCia)[0] 
                 for f in sorted(OnlyFragile)
                 if 74<=fz.extractOne(f, OnlyCia)[1]<=80}

#dict of matches
changesFragile2

{'CAPE VERDE': 'CABO VERDE', 'SLOVAK REPUBLIC': 'SLOVAKIA'}

In [20]:
# add manually
changesFragile2.update({'CONGO REPUBLIC':'CONGO, REPUBLIC OF THE'})
changesFragile2

{'CAPE VERDE': 'CABO VERDE',
 'SLOVAK REPUBLIC': 'SLOVAKIA',
 'CONGO REPUBLIC': 'CONGO, REPUBLIC OF THE'}

In [21]:
# make the changes
fragile2019.Country.replace(to_replace=changesFragile2,inplace=True)

In [22]:
# third try
OnlyCia=set(cia.Country)-set(fragile2019.Country)
OnlyFragile=set(fragile2019.Country)-set(cia.Country)
[(f,fz.extractOne(f, OnlyCia)) for f in sorted(OnlyFragile)]

[('CZECH REPUBLIC', ('CZECHIA', 64)),
 ('ERITREA', ('SAINT VINCENT AND THE GRENADINES', 51)),
 ('KYRGYZ REPUBLIC', ('KYRGYZSTAN', 54)),
 ('MYANMAR', ('NORTHERN MARIANA ISLANDS', 51)),
 ('NORTH KOREA', ('NORTHERN MARIANA ISLANDS', 62)),
 ('SOMALIA', ('NORTHERN MARIANA ISLANDS', 61)),
 ('SOUTH AFRICA', ('AMERICAN SAMOA', 59)),
 ('SYRIA', ('NORTHERN MARIANA ISLANDS', 54)),
 ('VENEZUELA', ('VANUATU', 50))]

In [23]:
# third dict of changes
# new threshold
changesFragile3={f:fz.extractOne(f, OnlyCia)[0] 
                 for f in sorted(OnlyFragile)
                 if 64==fz.extractOne(f, OnlyCia)[1]}

changesFragile3.update({'KYRGYZ REPUBLIC':'KYRGYZSTAN'})
#dict of matches
changesFragile3

{'CZECH REPUBLIC': 'CZECHIA', 'KYRGYZ REPUBLIC': 'KYRGYZSTAN'}

In [24]:
# make changes
fragile2019.Country.replace(to_replace=changesFragile3,inplace=True)

# also error in CIA
cia.Country.replace(to_replace={'BURMA':'MYANMAR'},inplace=True)

In [25]:
# fourth try

OnlyCia=set(cia.Country)-set(fragile2019.Country)
OnlyFragile=set(fragile2019.Country)-set(cia.Country)
[(f,fz.extractOne(f, OnlyCia)) for f in sorted(OnlyFragile)]

[('ERITREA', ('SAINT VINCENT AND THE GRENADINES', 51)),
 ('NORTH KOREA', ('NORTHERN MARIANA ISLANDS', 62)),
 ('SOMALIA', ('NORTHERN MARIANA ISLANDS', 61)),
 ('SOUTH AFRICA', ('AMERICAN SAMOA', 59)),
 ('SYRIA', ('NORTHERN MARIANA ISLANDS', 54)),
 ('VENEZUELA', ('VANUATU', 50))]

The fourth attempt did not offer good results. So we are ready:

In [26]:
fragilecia=fragile2019.merge(cia) #merge on Country
fragilecia

Unnamed: 0,Country,Year,Total,name,co2,co2_date,region,ForestRev_gdp,ForestRev_date
0,YEMEN,2019,113.5,YEMEN,10158000.0,2019,MIDDLE EAST,0.04,2018
1,SOUTH SUDAN,2019,112.2,SOUTH SUDAN,1778000.0,2019,AFRICA,2.65,2015
2,"CONGO, DEMOCRATIC REPUBLIC OF THE",2019,110.2,"CONGO, DEMOCRATIC REPUBLIC OF THE",2653000.0,2019,AFRICA,8.72,2018
3,CENTRAL AFRICAN REPUBLIC,2019,108.9,CENTRAL AFRICAN REPUBLIC,285000.0,2019,AFRICA,8.99,2018
4,CHAD,2019,108.5,CHAD,1771000.0,2019,AFRICA,3.81,2018
...,...,...,...,...,...,...,...,...,...
167,AUSTRALIA,2019,19.7,AUSTRALIA,417870000.0,2019,AUSTRALIA AND OCEANIA,0.13,2018
168,DENMARK,2019,19.5,DENMARK,33850000.0,2019,EUROPE,0.02,2018
169,SWITZERLAND,2019,18.7,SWITZERLAND,38739000.0,2019,EUROPE,0.01,2018
170,NORWAY,2019,18.0,NORWAY,36731000.0,2019,EUROPE,0.05,2018


In [27]:
#checking:
fragilecia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 172 entries, 0 to 171
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         172 non-null    object 
 1   Year            172 non-null    int64  
 2   Total           172 non-null    float64
 3   name            172 non-null    object 
 4   co2             172 non-null    float64
 5   co2_date        172 non-null    int64  
 6   region          172 non-null    object 
 7   ForestRev_gdp   172 non-null    float64
 8   ForestRev_date  172 non-null    int64  
dtypes: float64(3), int64(3), object(3)
memory usage: 13.4+ KB


Merging is a key process for producing analytics. So, it is always good to add some 'standard' information to avoid the need of fuzzy merging. See this data table


In [28]:
isoLink='https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/isodata.csv'
isoCodes=pd.read_csv(isoLink)
isoCodes.head()

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3
0,Afghanistan,The Islamic Republic of Afghanistan,.af,AF,AFG
1,Åland Islands,Åland,.ax,AX,ALA
2,Albania,The Republic of Albania,.al,AL,ALB
3,Algeria,The People's Democratic Republic of Algeria,.dz,DZ,DZA
4,American Samoa,The Territory of American Samoa,.as,AS,ASM


We should add the **ISO** columns to our recent merged data frame:

In [29]:
# key columns are not spelled the same:
isoCodes.Countryname=isoCodes.Countryname.str.upper()
isoCodes.merge(fragilecia,left_on='Countryname',right_on='Country')

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3,Country,Year,Total,name,co2,co2_date,region,ForestRev_gdp,ForestRev_date
0,AFGHANISTAN,The Islamic Republic of Afghanistan,.af,AF,AFG,AFGHANISTAN,2019,105.0,AFGHANISTAN,7893000.0,2019,SOUTH ASIA,0.20,2018
1,ALBANIA,The Republic of Albania,.al,AL,ALB,ALBANIA,2019,58.9,ALBANIA,3794000.0,2019,EUROPE,0.18,2018
2,ALGERIA,The People's Democratic Republic of Algeria,.dz,DZ,DZA,ALGERIA,2019,75.4,ALGERIA,151633000.0,2019,AFRICA,0.10,2018
3,ANGOLA,The Republic of Angola,.ao,AO,AGO,ANGOLA,2019,87.8,ANGOLA,19362000.0,2019,AFRICA,0.36,2018
4,ANTIGUA AND BARBUDA,Antigua and Barbuda,.ag,AG,ATG,ANTIGUA AND BARBUDA,2019,54.4,ANTIGUA AND BARBUDA,729000.0,2019,CENTRAL AMERICA AND THE CARIBBEAN,0.00,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,URUGUAY,The Oriental Republic of Uruguay,.uy,UY,URY,URUGUAY,2019,34.0,URUGUAY,6450000.0,2019,SOUTH AMERICA,1.56,2018
142,UZBEKISTAN,The Republic of Uzbekistan,.uz,UZ,UZB,UZBEKISTAN,2019,75.7,UZBEKISTAN,102965000.0,2019,CENTRAL ASIA,0.00,2018
143,YEMEN,The Republic of Yemen,.ye,YE,YEM,YEMEN,2019,113.5,YEMEN,10158000.0,2019,MIDDLE EAST,0.04,2018
144,ZAMBIA,The Republic of Zambia,.zm,ZM,ZMB,ZAMBIA,2019,85.7,ZAMBIA,6798000.0,2019,AFRICA,4.45,2018


We have lost several countries, then we redo the fuzzy merge:

In [30]:
onlyFrcia=set(fragilecia.Country)-set(isoCodes.Countryname)
onlyISO=set(isoCodes.Countryname)-set(fragilecia.Country)

[(f,fz.extractOne(f, onlyISO)) for f in sorted(onlyFrcia)]

[('BAHAMAS, THE', ('BAHAMAS (THE)', 100)),
 ('BOLIVIA', ('BOLIVIA (PLURINATIONAL STATE OF)', 90)),
 ('BRUNEI', ('BRUNEI DARUSSALAM', 90)),
 ('CENTRAL AFRICAN REPUBLIC', ('CENTRAL AFRICAN REPUBLIC (THE)', 95)),
 ('COMOROS', ('COMOROS (THE)', 90)),
 ('CONGO, DEMOCRATIC REPUBLIC OF THE',
  ('CONGO (THE DEMOCRATIC REPUBLIC OF THE)', 95)),
 ('CONGO, REPUBLIC OF THE', ('BRITISH INDIAN OCEAN TERRITORY (THE)', 86)),
 ("COTE D'IVOIRE", ("CÔTE D'IVOIRE", 96)),
 ('DOMINICAN REPUBLIC', ('DOMINICAN REPUBLIC (THE)', 95)),
 ('GAMBIA, THE', ('GAMBIA (THE)', 100)),
 ('IRAN', ('IRAN (ISLAMIC REPUBLIC OF)', 90)),
 ('KOREA, SOUTH', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)", 86)),
 ('LAOS', ('ANGUILLA', 60)),
 ('MICRONESIA, FEDERATED STATES OF',
  ('MICRONESIA (FEDERATED STATES OF)', 100)),
 ('MOLDOVA', ('MOLDOVA (THE REPUBLIC OF)', 90)),
 ('NETHERLANDS', ('NETHERLANDS (THE)', 95)),
 ('NIGER', ('NIGER (THE)', 90)),
 ('PHILIPPINES', ('PHILIPPINES (THE)', 95)),
 ('RUSSIA', ('RUSSIAN FEDERATION (THE)', 9

Prepare changes:

In [31]:
# first change
changesFrcia1={f:fz.extractOne(f, onlyISO)[0] 
                 for f in sorted(onlyFrcia)
                 if fz.extractOne(f, onlyISO)[1] >=87}
#dict of matches
changesFrcia1

{'BAHAMAS, THE': 'BAHAMAS (THE)',
 'BOLIVIA': 'BOLIVIA (PLURINATIONAL STATE OF)',
 'BRUNEI': 'BRUNEI DARUSSALAM',
 'CENTRAL AFRICAN REPUBLIC': 'CENTRAL AFRICAN REPUBLIC (THE)',
 'COMOROS': 'COMOROS (THE)',
 'CONGO, DEMOCRATIC REPUBLIC OF THE': 'CONGO (THE DEMOCRATIC REPUBLIC OF THE)',
 "COTE D'IVOIRE": "CÔTE D'IVOIRE",
 'DOMINICAN REPUBLIC': 'DOMINICAN REPUBLIC (THE)',
 'GAMBIA, THE': 'GAMBIA (THE)',
 'IRAN': 'IRAN (ISLAMIC REPUBLIC OF)',
 'MICRONESIA, FEDERATED STATES OF': 'MICRONESIA (FEDERATED STATES OF)',
 'MOLDOVA': 'MOLDOVA (THE REPUBLIC OF)',
 'NETHERLANDS': 'NETHERLANDS (THE)',
 'NIGER': 'NIGER (THE)',
 'PHILIPPINES': 'PHILIPPINES (THE)',
 'RUSSIA': 'RUSSIAN FEDERATION (THE)',
 'SUDAN': 'SUDAN (THE)',
 'TANZANIA': 'TANZANIA, THE UNITED REPUBLIC OF',
 'TURKEY (TURKIYE)': 'TURKEY',
 'UNITED ARAB EMIRATES': 'UNITED ARAB EMIRATES (THE)',
 'UNITED KINGDOM': 'UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND (THE)',
 'UNITED STATES': 'UNITED STATES OF AMERICA (THE)',
 'VIETNAM': '

In [32]:
# make changes
fragilecia.Country.replace(to_replace=changesFrcia1,inplace=True)

Second iteration

In [33]:
onlyFrcia=set(fragilecia.Country)-set(isoCodes.Countryname)
onlyISO=set(isoCodes.Countryname)-set(fragilecia.Country)

[(f,fz.extractOne(f, onlyISO)) for f in sorted(onlyFrcia)]

[('CONGO, REPUBLIC OF THE', ('BRITISH INDIAN OCEAN TERRITORY (THE)', 86)),
 ('KOREA, SOUTH', ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)", 86)),
 ('LAOS', ('ANGUILLA', 60))]

This second iteration gives weird results. Let's use a different function to get more than one result:

In [34]:
onlyFrcia=set(fragilecia.Country)-set(isoCodes.Countryname)
onlyISO=set(isoCodes.Countryname)-set(fragilecia.Country)

[(f,fz.extract(f, onlyISO)) for f in sorted(onlyFrcia)]

[('CONGO, REPUBLIC OF THE',
  [('BRITISH INDIAN OCEAN TERRITORY (THE)', 86),
   ('HOLY SEE (THE)', 86),
   ('VENEZUELA (BOLIVARIAN REPUBLIC OF)', 86),
   ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)", 86),
   ('SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 86)]),
 ('KOREA, SOUTH',
  [("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)", 86),
   ('SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 86),
   ('KOREA (THE REPUBLIC OF)', 86),
   ('SOUTH AFRICA', 66),
   ('FRENCH SOUTHERN TERRITORIES (THE)', 62)]),
 ('LAOS',
  [('ANGUILLA', 60),
   ('CURAÇAO', 51),
   ('MONACO', 51),
   ('TOKELAU', 51),
   ('HOLY SEE (THE)', 45)])]

In [35]:
# remember you can use this for a particular case:
isoCodes.loc[isoCodes.Countryname.str.contains('LAO')]

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3
122,LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE),The Lao People's Democratic Republic,.la,LA,LAO


Then, just prepare manual changes:

In [36]:
lastChanges={'CONGO, REPUBLIC OF THE':'CONGO (THE)',
 'KOREA, SOUTH':'KOREA (THE REPUBLIC OF)',
'LAOS':"LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE)"}

fragilecia.Country.replace(to_replace=lastChanges,inplace=True)

Then,

In [37]:
fragciaiso=isoCodes.merge(fragilecia,left_on='Countryname',right_on='Country')
fragciaiso.head()

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3,Country,Year,Total,name,co2,co2_date,region,ForestRev_gdp,ForestRev_date
0,AFGHANISTAN,The Islamic Republic of Afghanistan,.af,AF,AFG,AFGHANISTAN,2019,105.0,AFGHANISTAN,7893000.0,2019,SOUTH ASIA,0.2,2018
1,ALBANIA,The Republic of Albania,.al,AL,ALB,ALBANIA,2019,58.9,ALBANIA,3794000.0,2019,EUROPE,0.18,2018
2,ALGERIA,The People's Democratic Republic of Algeria,.dz,DZ,DZA,ALGERIA,2019,75.4,ALGERIA,151633000.0,2019,AFRICA,0.1,2018
3,ANGOLA,The Republic of Angola,.ao,AO,AGO,ANGOLA,2019,87.8,ANGOLA,19362000.0,2019,AFRICA,0.36,2018
4,ANTIGUA AND BARBUDA,Antigua and Barbuda,.ag,AG,ATG,ANTIGUA AND BARBUDA,2019,54.4,ANTIGUA AND BARBUDA,729000.0,2019,CENTRAL AMERICA AND THE CARIBBEAN,0.0,2018


In [38]:
fragciaiso.drop(columns=['Country','name'],inplace=True)
fragciaiso.rename(columns={'Countryname':"Country",'Year':'fragility_date','Total':'fragility'},inplace=True)
fragciaiso

Unnamed: 0,Country,Officialstatename,InternetccTLD,iso2,iso3,fragility_date,fragility,co2,co2_date,region,ForestRev_gdp,ForestRev_date
0,AFGHANISTAN,The Islamic Republic of Afghanistan,.af,AF,AFG,2019,105.0,7893000.0,2019,SOUTH ASIA,0.20,2018
1,ALBANIA,The Republic of Albania,.al,AL,ALB,2019,58.9,3794000.0,2019,EUROPE,0.18,2018
2,ALGERIA,The People's Democratic Republic of Algeria,.dz,DZ,DZA,2019,75.4,151633000.0,2019,AFRICA,0.10,2018
3,ANGOLA,The Republic of Angola,.ao,AO,AGO,2019,87.8,19362000.0,2019,AFRICA,0.36,2018
4,ANTIGUA AND BARBUDA,Antigua and Barbuda,.ag,AG,ATG,2019,54.4,729000.0,2019,CENTRAL AMERICA AND THE CARIBBEAN,0.00,2018
...,...,...,...,...,...,...,...,...,...,...,...,...
167,UZBEKISTAN,The Republic of Uzbekistan,.uz,UZ,UZB,2019,75.7,102965000.0,2019,CENTRAL ASIA,0.00,2018
168,VIET NAM,The Socialist Republic of Viet Nam,.vn,VN,VNM,2019,66.1,249929000.0,2019,EAST AND SOUTHEAST ASIA,1.49,2018
169,YEMEN,The Republic of Yemen,.ye,YE,YEM,2019,113.5,10158000.0,2019,MIDDLE EAST,0.04,2018
170,ZAMBIA,The Republic of Zambia,.zm,ZM,ZMB,2019,85.7,6798000.0,2019,AFRICA,4.45,2018


Let's save what we have:

In [40]:
fragciaiso.to_csv(os.path.join("data","FragilityCia_isos.csv"), index=False)

A next step will be to merge _fragciaiso_ into an actual map.
Let's bring the map:

In [41]:
import geopandas as gpd
import os

#read in:
mapWorld=gpd.read_file(os.path.join("maps","mapWorld.gpkg"),layer="countries_valid")

In [42]:
# as usual check dimensions:
mapWorld.shape, fragciaiso.shape

((251, 14), (172, 12))

The merge **can not** give you more than the amount of rows *fragciaiso* has:

In [43]:
fragciaiso_geo=mapWorld.merge(fragciaiso,left_on='ISO_A3', right_on='iso3')

In [44]:
fragciaiso_geo.shape

(172, 26)

With ISO codes, this step was easy. Let's save our map with added columns:

In [45]:
fragciaiso_geo.to_file(os.path.join("maps","mapWorld.gpkg"), layer='countries_valid_data', driver="GPKG")