<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center> 

_____

<a id='home'></a>

# Merging

Merging data sets need the following considerations:

* Merging is done on two data frames.
* You need a column in each data frame that share the same exact and unique values. The column names or titles need not be the same.
* The merged table shows by default the mutual coincidences; but you can also request the values not matched, which will help you detect possible extra cleaning.
* Pandas jargon uses a **left** and a **right** data frame: **left**.merge(**right**).

At this stage, let me use other data frames we prepared previously:

In [1]:
import pandas as pd

co2Link='https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/co2.csv'
co2=pd.read_csv(co2Link)

Remember the amount of rows:

In [3]:
co2.shape

(218, 4)

Also keep in mind the column names:

In [4]:
co2.columns

Index(['name', 'co2', 'co2_date', 'region'], dtype='object')

Now, take a look:

In [6]:
co2

Unnamed: 0,name,co2,co2_date,region
0,CHINA,1.077325e+10,2019,EAST AND SOUTHEAST ASIA
1,UNITED STATES,5.144361e+09,2019,NORTH AMERICA
2,INDIA,2.314738e+09,2019,SOUTH ASIA
3,RUSSIA,1.848070e+09,2019,CENTRAL ASIA
4,JAPAN,1.103234e+09,2019,EAST AND SOUTHEAST ASIA
...,...,...,...,...
213,ANTARCTICA,2.800000e+04,2019,ANTARCTICA
214,"SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA",1.300000e+04,2019,AFRICA
215,NIUE,8.000000e+03,2019,AUSTRALIA AND OCEANIA
216,NORTHERN MARIANA ISLANDS,0.000000e+00,2019,AUSTRALIA AND OCEANIA


Let me show this table:

In [10]:
isoLink='https://github.com/CienciaDeDatosEspacial/dataSets/raw/main/isodata.csv'
isoCodes=pd.read_csv(isoLink)
isoCodes

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3
0,Afghanistan,The Islamic Republic of Afghanistan,.af,AF,AFG
1,Åland Islands,Åland,.ax,AX,ALA
2,Albania,The Republic of Albania,.al,AL,ALB
3,Algeria,The People's Democratic Republic of Algeria,.dz,DZ,DZA
4,American Samoa,The Territory of American Samoa,.as,AS,ASM
...,...,...,...,...,...
244,Wallis and Futuna,The Territory of the Wallis and Futuna Islands,.wf,WF,WLF
245,Western Sahara,The Sahrawi Arab Democratic Republic,,EH,ESH
246,Yemen,The Republic of Yemen,.ye,YE,YEM
247,Zambia,The Republic of Zambia,.zm,ZM,ZMB


Merging is a key process for producing analytics. So, it is always good to add some column with a universal key. In this case, we should add the **ISO** columns to data frame. Notice that the isoCodes data frame has the countrynames in title case, so we need to set them to upper case:

In [12]:
isoCodes['Countryname']=isoCodes.Countryname.str.upper()

Notice the amount of rows before merging:

In [16]:
co2.shape[0], isoCodes.shape[0]

(218, 249)

Let me show you some merge approaches, but I will only show the amount of columns produced:

1. You keep only what is common in both key columns:

This is the default. The final rows will be the ones where the key values in each data frame match exactly. In this case, your count of rows will be at most the amount of rows of the smallest data frame.

In [14]:
# how many resulting rows after inner merging
co2.merge(isoCodes,how='inner',left_on='name',right_on='Countryname')

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3
0,CHINA,1.077325e+10,2019,EAST AND SOUTHEAST ASIA,CHINA,The People's Republic of China,.cn,CN,CHN
1,INDIA,2.314738e+09,2019,SOUTH ASIA,INDIA,The Republic of India,.in,IN,IND
2,JAPAN,1.103234e+09,2019,EAST AND SOUTHEAST ASIA,JAPAN,Japan,.jp,JP,JPN
3,GERMANY,7.268810e+08,2019,EUROPE,GERMANY,The Federal Republic of Germany,.de,DE,DEU
4,CANADA,6.120840e+08,2019,NORTH AMERICA,CANADA,Canada,.ca,CA,CAN
...,...,...,...,...,...,...,...,...,...
167,NAURU,6.600000e+04,2019,AUSTRALIA AND OCEANIA,NAURU,The Republic of Nauru,.nr,NR,NRU
168,MONTSERRAT,3.300000e+04,2019,CENTRAL AMERICA AND THE CARIBBEAN,MONTSERRAT,Montserrat,.ms,MS,MSR
169,ANTARCTICA,2.800000e+04,2019,ANTARCTICA,ANTARCTICA,All land and ice shelves south of the 60th par...,.aq,AQ,ATA
170,NIUE,8.000000e+03,2019,AUSTRALIA AND OCEANIA,NIUE,Niue,.nu,NU,NIU


2. You keep all the keys from one data frame:

The final rows will be all the rows from the dataframe (here from the _left_). If a key values does not find a match, the key value is kept, but the columns will have missing values. In this case, your count of rows will be equal to the amount of rows of the data frame to the left. You can also use **right** so the same logic applies to the data frame to the right.



In [19]:
# how many resulting rows after left merging
co2.merge(isoCodes,how='left',left_on='name',right_on='Countryname')

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3
0,CHINA,1.077325e+10,2019,EAST AND SOUTHEAST ASIA,CHINA,The People's Republic of China,.cn,CN,CHN
1,UNITED STATES,5.144361e+09,2019,NORTH AMERICA,,,,,
2,INDIA,2.314738e+09,2019,SOUTH ASIA,INDIA,The Republic of India,.in,IN,IND
3,RUSSIA,1.848070e+09,2019,CENTRAL ASIA,,,,,
4,JAPAN,1.103234e+09,2019,EAST AND SOUTHEAST ASIA,JAPAN,Japan,.jp,JP,JPN
...,...,...,...,...,...,...,...,...,...
213,ANTARCTICA,2.800000e+04,2019,ANTARCTICA,ANTARCTICA,All land and ice shelves south of the 60th par...,.aq,AQ,ATA
214,"SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA",1.300000e+04,2019,AFRICA,,,,,
215,NIUE,8.000000e+03,2019,AUSTRALIA AND OCEANIA,NIUE,Niue,.nu,NU,NIU
216,NORTHERN MARIANA ISLANDS,0.000000e+00,2019,AUSTRALIA AND OCEANIA,,,,,


3. You keep all the rows from both data frames:

In this case you will obtain all possible rows: the matched values, and the unmatched values from both data frames. You will also generate missing values. In this case, your count of rows will be at least the amount of rows of the data frame with the most rows.


In [21]:
# how many resulting rows after outer merging
co2.merge(isoCodes,how='outer',left_on='name',right_on='Countryname')

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3
0,CHINA,1.077325e+10,2019.0,EAST AND SOUTHEAST ASIA,CHINA,The People's Republic of China,.cn,CN,CHN
1,UNITED STATES,5.144361e+09,2019.0,NORTH AMERICA,,,,,
2,INDIA,2.314738e+09,2019.0,SOUTH ASIA,INDIA,The Republic of India,.in,IN,IND
3,RUSSIA,1.848070e+09,2019.0,CENTRAL ASIA,,,,,
4,JAPAN,1.103234e+09,2019.0,EAST AND SOUTHEAST ASIA,JAPAN,Japan,.jp,JP,JPN
...,...,...,...,...,...,...,...,...,...
290,,,,,VIET NAM,The Socialist Republic of Viet Nam,.vn,VN,VNM
291,,,,,VIRGIN ISLANDS (BRITISH),The Virgin Islands,.vg,VG,VGB
292,,,,,VIRGIN ISLANDS (U.S.),The Virgin Islands of the United States,.vi,VI,VIR
293,,,,,WALLIS AND FUTUNA,The Territory of the Wallis and Futuna Islands,.wf,WF,WLF


Your merge should not produce missing values, so the **inner** approach is in general preferred. But that approach loses several rows. 

In this situation, we speak of **fuzzy merging**.

**Step 1**: Detect what values are present in the right data frame, but not present in the left data frame:

In [24]:
onlyCo2=set(co2.name)-set(isoCodes.Countryname)
onlyISO=set(isoCodes.Countryname)-set(co2.name)

In [25]:
onlyCo2

{'BAHAMAS, THE',
 'BOLIVIA',
 'BRITISH VIRGIN ISLANDS',
 'BRUNEI',
 'BURMA',
 'CAYMAN ISLANDS',
 'CENTRAL AFRICAN REPUBLIC',
 'COMOROS',
 'CONGO, DEMOCRATIC REPUBLIC OF THE',
 'CONGO, REPUBLIC OF THE',
 'COOK ISLANDS',
 "COTE D'IVOIRE",
 'DOMINICAN REPUBLIC',
 'FALKLAND ISLANDS (ISLAS MALVINAS)',
 'FAROE ISLANDS',
 'GAMBIA, THE',
 'GAZA STRIP',
 'IRAN',
 'KOREA, NORTH',
 'KOREA, SOUTH',
 'KOSOVO',
 'LAOS',
 'MACAU',
 'MARSHALL ISLANDS',
 'MICRONESIA, FEDERATED STATES OF',
 'MOLDOVA',
 'NETHERLANDS',
 'NIGER',
 'NORTHERN MARIANA ISLANDS',
 'PHILIPPINES',
 'RUSSIA',
 'SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA',
 'SUDAN',
 'SYRIA',
 'TAIWAN',
 'TANZANIA',
 'TURKEY (TURKIYE)',
 'TURKS AND CAICOS ISLANDS',
 'UNITED ARAB EMIRATES',
 'UNITED KINGDOM',
 'UNITED STATES',
 'VENEZUELA',
 'VIETNAM',
 'VIRGIN ISLANDS',
 'WAKE ISLAND',
 'WEST BANK'}

In [26]:
onlyISO

{'ANDORRA',
 'ANGUILLA',
 'BAHAMAS (THE)',
 'BOLIVIA (PLURINATIONAL STATE OF)',
 'BONAIRE\xa0SINT EUSTATIUS\xa0SABA',
 'BOUVET ISLAND',
 'BRITISH INDIAN OCEAN TERRITORY (THE)',
 'BRUNEI DARUSSALAM',
 'CAYMAN ISLANDS (THE)',
 'CENTRAL AFRICAN REPUBLIC (THE)',
 'CHRISTMAS ISLAND',
 'COCOS (KEELING) ISLANDS (THE)',
 'COMOROS (THE)',
 'CONGO (THE DEMOCRATIC REPUBLIC OF THE)',
 'CONGO (THE)',
 'COOK ISLANDS (THE)',
 'CURAÇAO',
 "CÔTE D'IVOIRE",
 'DOMINICAN REPUBLIC (THE)',
 'FALKLAND ISLANDS (THE)',
 'FAROE ISLANDS (THE)',
 'FRENCH GUIANA',
 'FRENCH SOUTHERN TERRITORIES (THE)',
 'GAMBIA (THE)',
 'GUADELOUPE',
 'GUERNSEY',
 'HEARD ISLAND AND MCDONALD ISLANDS',
 'HOLY SEE (THE)',
 'IRAN (ISLAMIC REPUBLIC OF)',
 'ISLE OF MAN',
 "KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)",
 'KOREA (THE REPUBLIC OF)',
 "LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE)",
 'LIECHTENSTEIN',
 'MACAO',
 'MARSHALL ISLANDS (THE)',
 'MARTINIQUE',
 'MAYOTTE',
 'MICRONESIA (FEDERATED STATES OF)',
 'MOLDOVA (THE REPUBLIC OF)',
 'M

If the same country is written differently, we need to modify one of those columns so that both columns share the same name spelling. 

In [46]:
from thefuzz import process as fz

[(f,fz.extractOne(f, onlyCo2)) for f in sorted(onlyISO)]

[('ANDORRA', ('SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA', 51)),
 ('ANGUILLA', ('LAOS', 60)),
 ('BAHAMAS (THE)', ('BAHAMAS, THE', 100)),
 ('BOLIVIA (PLURINATIONAL STATE OF)', ('BOLIVIA', 90)),
 ('BONAIRE\xa0SINT EUSTATIUS\xa0SABA', ('UNITED STATES', 56)),
 ('BOUVET ISLAND', ('FAROE ISLANDS', 69)),
 ('BRITISH INDIAN OCEAN TERRITORY (THE)', ('BAHAMAS, THE', 86)),
 ('BRUNEI DARUSSALAM', ('BRUNEI', 90)),
 ('CAYMAN ISLANDS (THE)', ('CAYMAN ISLANDS', 95)),
 ('CENTRAL AFRICAN REPUBLIC (THE)', ('CENTRAL AFRICAN REPUBLIC', 95)),
 ('CHRISTMAS ISLAND', ('WAKE ISLAND', 67)),
 ('COCOS (KEELING) ISLANDS (THE)', ('VIRGIN ISLANDS', 86)),
 ('COMOROS (THE)', ('COMOROS', 90)),
 ('CONGO (THE DEMOCRATIC REPUBLIC OF THE)',
  ('CONGO, DEMOCRATIC REPUBLIC OF THE', 95)),
 ('CONGO (THE)', ('CONGO, DEMOCRATIC REPUBLIC OF THE', 86)),
 ('COOK ISLANDS (THE)', ('COOK ISLANDS', 95)),
 ('CURAÇAO', ('BURMA', 55)),
 ("CÔTE D'IVOIRE", ("COTE D'IVOIRE", 96)),
 ('DOMINICAN REPUBLIC (THE)', ('DOMINICAN REPUBLIC', 95)),


Prepare changes:

In [48]:
#[(f,fz.extractOne(f, onlyCo2)) for f in sorted(onlyISO)]
# first change
changesIso_1={f:fz.extractOne(f, onlyCo2)[0] 
                 for f in sorted(onlyISO)
                 if fz.extractOne(f, onlyCo2)[1] >=90}
#dict of matches
changesIso_1

{'BAHAMAS (THE)': 'BAHAMAS, THE',
 'BOLIVIA (PLURINATIONAL STATE OF)': 'BOLIVIA',
 'BRUNEI DARUSSALAM': 'BRUNEI',
 'CAYMAN ISLANDS (THE)': 'CAYMAN ISLANDS',
 'CENTRAL AFRICAN REPUBLIC (THE)': 'CENTRAL AFRICAN REPUBLIC',
 'COMOROS (THE)': 'COMOROS',
 'CONGO (THE DEMOCRATIC REPUBLIC OF THE)': 'CONGO, DEMOCRATIC REPUBLIC OF THE',
 'COOK ISLANDS (THE)': 'COOK ISLANDS',
 "CÔTE D'IVOIRE": "COTE D'IVOIRE",
 'DOMINICAN REPUBLIC (THE)': 'DOMINICAN REPUBLIC',
 'FAROE ISLANDS (THE)': 'FAROE ISLANDS',
 'GAMBIA (THE)': 'GAMBIA, THE',
 'IRAN (ISLAMIC REPUBLIC OF)': 'IRAN',
 'MARSHALL ISLANDS (THE)': 'MARSHALL ISLANDS',
 'MICRONESIA (FEDERATED STATES OF)': 'MICRONESIA, FEDERATED STATES OF',
 'MOLDOVA (THE REPUBLIC OF)': 'MOLDOVA',
 'NETHERLANDS (THE)': 'NETHERLANDS',
 'NIGER (THE)': 'NIGER',
 'NORTHERN MARIANA ISLANDS (THE)': 'NORTHERN MARIANA ISLANDS',
 'PHILIPPINES (THE)': 'PHILIPPINES',
 'RUSSIAN FEDERATION (THE)': 'RUSSIA',
 'SAINT HELENA\xa0ASCENSION ISLAND\xa0TRISTAN DA CUNHA': 'SAINT HELENA, A

In [49]:
# make changes
isoCodes.Countryname.replace(to_replace=changesIso_1,inplace=True)

Second iteration

In [50]:
onlyCo2=set(co2.name)-set(isoCodes.Countryname)
onlyISO=set(isoCodes.Countryname)-set(co2.name)

[(f,fz.extractOne(f, onlyCo2)) for f in sorted(onlyISO)]

[('ANDORRA', ('KOREA, NORTH', 51)),
 ('ANGUILLA', ('LAOS', 60)),
 ('BONAIRE\xa0SINT EUSTATIUS\xa0SABA', ('WEST BANK', 48)),
 ('BOUVET ISLAND', ('WAKE ISLAND', 67)),
 ('BRITISH INDIAN OCEAN TERRITORY (THE)', ('CONGO, REPUBLIC OF THE', 86)),
 ('CHRISTMAS ISLAND', ('WAKE ISLAND', 67)),
 ('COCOS (KEELING) ISLANDS (THE)', ('WAKE ISLAND', 70)),
 ('CONGO (THE)', ('CONGO, REPUBLIC OF THE', 86)),
 ('CURAÇAO', ('BURMA', 55)),
 ('FALKLAND ISLANDS (THE)', ('WAKE ISLAND', 66)),
 ('FRENCH GUIANA', ('WAKE ISLAND', 42)),
 ('FRENCH SOUTHERN TERRITORIES (THE)', ('KOREA, SOUTH', 62)),
 ('GUADELOUPE', ('LAOS', 45)),
 ('GUERNSEY', ('BURMA', 36)),
 ('HEARD ISLAND AND MCDONALD ISLANDS', ('WAKE ISLAND', 86)),
 ('HOLY SEE (THE)', ('CONGO, REPUBLIC OF THE', 86)),
 ('ISLE OF MAN', ('CONGO, REPUBLIC OF THE', 86)),
 ("KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)",
  ('CONGO, REPUBLIC OF THE', 86)),
 ('KOREA (THE REPUBLIC OF)', ('KOREA, NORTH', 86)),
 ("LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE)", ('CONGO, REPUBLIC OF T

This second iteration gives weird results. Let's use a different function to get more than one result:

In [51]:
[(f,fz.extract(f, onlyCo2)) for f in sorted(onlyISO)]

[('ANDORRA',
  [('KOREA, NORTH', 51),
   ('WAKE ISLAND', 49),
   ('CONGO, REPUBLIC OF THE', 39),
   ('LAOS', 36),
   ('BURMA', 33)]),
 ('ANGUILLA',
  [('LAOS', 60),
   ('WAKE ISLAND', 42),
   ('BURMA', 36),
   ('MACAU', 36),
   ('CONGO, REPUBLIC OF THE', 27)]),
 ('BONAIRE\xa0SINT EUSTATIUS\xa0SABA',
  [('WEST BANK', 48),
   ('KOREA, NORTH', 45),
   ('KOREA, SOUTH', 45),
   ('GAZA STRIP', 43),
   ('WAKE ISLAND', 40)]),
 ('BOUVET ISLAND',
  [('WAKE ISLAND', 67),
   ('LAOS', 45),
   ('WEST BANK', 45),
   ('BURMA', 36),
   ('KOREA, NORTH', 32)]),
 ('BRITISH INDIAN OCEAN TERRITORY (THE)',
  [('CONGO, REPUBLIC OF THE', 86),
   ('WEST BANK', 48),
   ('GAZA STRIP', 45),
   ('KOREA, NORTH', 45),
   ('KOREA, SOUTH', 45)]),
 ('CHRISTMAS ISLAND',
  [('WAKE ISLAND', 67),
   ('LAOS', 45),
   ('WEST BANK', 40),
   ('BURMA', 36),
   ('MACAU', 36)]),
 ('COCOS (KEELING) ISLANDS (THE)',
  [('WAKE ISLAND', 70),
   ('CONGO, REPUBLIC OF THE', 52),
   ('LAOS', 45),
   ('WEST BANK', 38),
   ('KOREA, NORTH', 3

In [35]:
# remember you can use this for a particular case:
isoCodes.loc[isoCodes.Countryname.str.contains('LAO')]

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3
122,LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE),The Lao People's Democratic Republic,.la,LA,LAO


Then, just prepare manual changes:

In [52]:
changesIso_2={'CONGO (THE)':'CONGO, REPUBLIC OF THE',
              "KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF)":'KOREA, NORTH',
             'KOREA (THE REPUBLIC OF)':'KOREA, SOUTH',
             "LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE)":'LAOS',
             'MACAO':'MACAU',
             'MYANMAR':"BURMA"}
isoCodes.Countryname.replace(to_replace=changesIso_2,inplace=True)

Then,

In [53]:
onlyCo2=set(co2.name)-set(isoCodes.Countryname)
onlyISO=set(isoCodes.Countryname)-set(co2.name)

[(f,fz.extractOne(f, onlyCo2)) for f in sorted(onlyISO)]

[('ANDORRA', ('WAKE ISLAND', 49)),
 ('ANGUILLA', ('WAKE ISLAND', 42)),
 ('BONAIRE\xa0SINT EUSTATIUS\xa0SABA', ('WEST BANK', 48)),
 ('BOUVET ISLAND', ('WAKE ISLAND', 67)),
 ('BRITISH INDIAN OCEAN TERRITORY (THE)', ('WEST BANK', 48)),
 ('CHRISTMAS ISLAND', ('WAKE ISLAND', 67)),
 ('COCOS (KEELING) ISLANDS (THE)', ('WAKE ISLAND', 70)),
 ('CURAÇAO', ('GAZA STRIP', 30)),
 ('FALKLAND ISLANDS (THE)', ('WAKE ISLAND', 66)),
 ('FRENCH GUIANA', ('WAKE ISLAND', 42)),
 ('FRENCH SOUTHERN TERRITORIES (THE)', ('WEST BANK', 38)),
 ('GUADELOUPE', ('GAZA STRIP', 30)),
 ('GUERNSEY', ('WEST BANK', 24)),
 ('HEARD ISLAND AND MCDONALD ISLANDS', ('WAKE ISLAND', 86)),
 ('HOLY SEE (THE)', ('KOSOVO', 30)),
 ('ISLE OF MAN', ('WAKE ISLAND', 52)),
 ('LIECHTENSTEIN', ('WEST BANK', 36)),
 ('MARTINIQUE', ('GAZA STRIP', 30)),
 ('MAYOTTE', ('WAKE ISLAND', 26)),
 ('MONACO', ('KOSOVO', 33)),
 ('NORFOLK ISLAND', ('WAKE ISLAND', 67)),
 ('PALAU', ('WAKE ISLAND', 38)),
 ('PALESTINE, STATE OF', ('WEST BANK', 40)),
 ('PITCAIRN', 

Notice this:

In [58]:
isoCodes.loc[isoCodes.Countryname.str.contains('PALES')]

Unnamed: 0,Countryname,Officialstatename,InternetccTLD,iso2,iso3
170,"PALESTINE, STATE OF",The State of Palestine,.ps,PS,PSE


In [60]:
co2.loc[co2.name.str.contains('GAZ|BANK')]

Unnamed: 0,name,co2,co2_date,region
144,GAZA STRIP,3341000.0,2019,MIDDLE EAST
145,WEST BANK,3341000.0,2019,MIDDLE EAST


We could do this:

In [62]:
changesCo2={'GAZA STRIP':'PALESTINE, STATE OF'}
co2.name.replace(to_replace=changesCo2,inplace=True)

Now, let's merge:

In [64]:
ciaiso=co2.merge(isoCodes,how='inner',left_on='name',right_on='Countryname')
ciaiso

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3
0,CHINA,1.077325e+10,2019,EAST AND SOUTHEAST ASIA,CHINA,The People's Republic of China,.cn,CN,CHN
1,UNITED STATES,5.144361e+09,2019,NORTH AMERICA,UNITED STATES,"Baker Island, Howland Island, Jarvis Island, J...",,UM,UMI
2,UNITED STATES,5.144361e+09,2019,NORTH AMERICA,UNITED STATES,The United States of America,.us,US,USA
3,INDIA,2.314738e+09,2019,SOUTH ASIA,INDIA,The Republic of India,.in,IN,IND
4,RUSSIA,1.848070e+09,2019,CENTRAL ASIA,RUSSIA,The Russian Federation,.ru,RU,RUS
...,...,...,...,...,...,...,...,...,...
211,ANTARCTICA,2.800000e+04,2019,ANTARCTICA,ANTARCTICA,All land and ice shelves south of the 60th par...,.aq,AQ,ATA
212,"SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA",1.300000e+04,2019,AFRICA,"SAINT HELENA, ASCENSION, AND TRISTAN DA CUNHA","Saint Helena, Ascension and Tristan da Cunha",.sh,SH,SHN
213,NIUE,8.000000e+03,2019,AUSTRALIA AND OCEANIA,NIUE,Niue,.nu,NU,NIU
214,NORTHERN MARIANA ISLANDS,0.000000e+00,2019,AUSTRALIA AND OCEANIA,NORTHERN MARIANA ISLANDS,The Commonwealth of the Northern Mariana Islands,.mp,MP,MNP


It is always good to check for duplicates in the iso:

In [66]:
ciaiso[ciaiso.duplicated(subset=['iso3'])]

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3


We do have duplicates:

In [67]:
ciaiso[ciaiso.duplicated(subset=['name'])]

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3
2,UNITED STATES,5144361000.0,2019,NORTH AMERICA,UNITED STATES,The United States of America,.us,US,USA


In [68]:
ciaiso[ciaiso.name=='UNITED STATES']

Unnamed: 0,name,co2,co2_date,region,Countryname,Officialstatename,InternetccTLD,iso2,iso3
1,UNITED STATES,5144361000.0,2019,NORTH AMERICA,UNITED STATES,"Baker Island, Howland Island, Jarvis Island, J...",,UM,UMI
2,UNITED STATES,5144361000.0,2019,NORTH AMERICA,UNITED STATES,The United States of America,.us,US,USA


We could drop some columns:

In [69]:
ciaiso.drop(columns=['Countryname','name'],inplace=True)
ciaiso

Unnamed: 0,co2,co2_date,region,Officialstatename,InternetccTLD,iso2,iso3
0,1.077325e+10,2019,EAST AND SOUTHEAST ASIA,The People's Republic of China,.cn,CN,CHN
1,5.144361e+09,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J...",,UM,UMI
2,5.144361e+09,2019,NORTH AMERICA,The United States of America,.us,US,USA
3,2.314738e+09,2019,SOUTH ASIA,The Republic of India,.in,IN,IND
4,1.848070e+09,2019,CENTRAL ASIA,The Russian Federation,.ru,RU,RUS
...,...,...,...,...,...,...,...
211,2.800000e+04,2019,ANTARCTICA,All land and ice shelves south of the 60th par...,.aq,AQ,ATA
212,1.300000e+04,2019,AFRICA,"Saint Helena, Ascension and Tristan da Cunha",.sh,SH,SHN
213,8.000000e+03,2019,AUSTRALIA AND OCEANIA,Niue,.nu,NU,NIU
214,0.000000e+00,2019,AUSTRALIA AND OCEANIA,The Commonwealth of the Northern Mariana Islands,.mp,MP,MNP


This result is very good, but notice you can reorganize the column order like this:

In [73]:
ciaiso=ciaiso.set_index(['iso3','iso2','InternetccTLD']).reset_index()
ciaiso


Unnamed: 0,iso3,iso2,InternetccTLD,co2,co2_date,region,Officialstatename
0,CHN,CN,.cn,1.077325e+10,2019,EAST AND SOUTHEAST ASIA,The People's Republic of China
1,UMI,UM,,5.144361e+09,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
2,USA,US,.us,5.144361e+09,2019,NORTH AMERICA,The United States of America
3,IND,IN,.in,2.314738e+09,2019,SOUTH ASIA,The Republic of India
4,RUS,RU,.ru,1.848070e+09,2019,CENTRAL ASIA,The Russian Federation
...,...,...,...,...,...,...,...
211,ATA,AQ,.aq,2.800000e+04,2019,ANTARCTICA,All land and ice shelves south of the 60th par...
212,SHN,SH,.sh,1.300000e+04,2019,AFRICA,"Saint Helena, Ascension and Tristan da Cunha"
213,NIU,NU,.nu,8.000000e+03,2019,AUSTRALIA AND OCEANIA,Niue
214,MNP,MP,.mp,0.000000e+00,2019,AUSTRALIA AND OCEANIA,The Commonwealth of the Northern Mariana Islands


Let's save what we have:

In [75]:
import os
ciaiso.to_csv(os.path.join("data","ciaiso.csv"), index=False)

A next step will be to merge _ciaiso_ into an actual map.
Let's bring the map:

In [78]:
import geopandas as gpd
import os

#read in:
mapWorld=gpd.read_file(os.path.join("maps","mapWorld.gpkg"),layer="countries_valid")

mapWorld.head()

Unnamed: 0,TYPE,FORMAL_EN,WB_NAME,NAME_EN,FIPS_10_,ISO_A2,ISO_A3,ISO_A3_EH,ISO_N3,UN_A3,WB_A2,WB_A3,REGION_UN,geometry
0,Sovereign country,Republic of Indonesia,Indonesia,Indonesia,ID,ID,IDN,IDN,360,360,ID,IDN,Asia,"MULTIPOLYGON (((117.70361 4.16341, 117.83855 4..."
1,Sovereign country,Malaysia,Malaysia,Malaysia,MY,MY,MYS,MYS,458,458,MY,MYS,Asia,"MULTIPOLYGON (((117.70361 4.16341, 117.90704 4..."
2,Sovereign country,Republic of Chile,Chile,Chile,CI,CL,CHL,CHL,152,152,CL,CHL,Americas,"MULTIPOLYGON (((-69.51009 -17.50659, -69.68390..."
3,Sovereign country,Plurinational State of Bolivia,Bolivia,Bolivia,BL,BO,BOL,BOL,68,68,BO,BOL,Americas,"POLYGON ((-69.51009 -17.50659, -69.49712 -17.6..."
4,Sovereign country,Republic of Peru,Peru,Peru,PE,PE,PER,PER,604,604,PE,PER,Americas,"MULTIPOLYGON (((-69.51009 -17.50659, -69.52260..."


In [77]:
# as usual check dimensions:
mapWorld.shape, ciaiso.shape

((251, 14), (216, 7))

Remember that the merge **can not** give you more than the amount of rows *ciaiso* has:

In [79]:
ciaiso_geo=mapWorld.merge(ciaiso,left_on='ISO_A3', right_on='iso3')

In [80]:
ciaiso_geo.shape

(221, 21)

This seems wrong. Let's check:

In [82]:
ciaiso_geo[ciaiso_geo.duplicated(subset=['ISO_A3'])]

Unnamed: 0,TYPE,FORMAL_EN,WB_NAME,NAME_EN,FIPS_10_,ISO_A2,ISO_A3,ISO_A3_EH,ISO_N3,UN_A3,...,WB_A3,REGION_UN,geometry,iso3,iso2,InternetccTLD,co2,co2_date,region,Officialstatename
212,Dependency,Jarvis Island,Jarvis Island (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"POLYGON ((-160.02998 -0.37615, -160.02477 -0.3...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
213,Dependency,Baker Island,Baker Island (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"MULTIPOLYGON (((-176.47529 0.19310, -176.47688...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
214,Dependency,Howland Island,Howland Island (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"POLYGON ((-176.63618 0.79019, -176.63610 0.803...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
215,Dependency,Wake Island,Wake Island (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"POLYGON ((166.61940 19.28164, 166.64422 19.275...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
216,Dependency,Midway Islands,Midway Islands (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"MULTIPOLYGON (((-177.32490 28.20498, -177.3289...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
217,Dependency,Navassa Island,Navassa Island (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,USA,Seven seas (open ocean),"POLYGON ((-75.02432 18.41726, -75.01781 18.396...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
218,Dependency,Palmyra Atoll,Palmyra Atoll (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"POLYGON ((-162.06086 5.88719, -162.07136 5.890...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."
219,Dependency,Kingman Reef,Kingman Reef (US),United States Minor Outlying Islands,-99,UM,UMI,UMI,581,-99,...,UMI,Seven seas (open ocean),"POLYGON ((-162.40018 6.44514, -162.40018 6.430...",UMI,UM,,5144361000.0,2019,NORTH AMERICA,"Baker Island, Howland Island, Jarvis Island, J..."


These are islands (UMI), but each is a different polygon. So, we are done!

With ISO codes, this step was easy. Let's save our map with added columns:

In [84]:
ciaiso_geo.to_file(os.path.join("maps","mapWorld.gpkg"), layer='countries_valid_co2', driver="GPKG")