# Country Name Map

Our main dataset contains country data from 2017. We are also supplementing our main dataset with data from World Bank. However a problem arises as the names of the countries in the different in the 2 datasets:
1. Difference in naming conventions
2. Changes in the name of countries over time

To resolve this problem, this notebook will create a hash map mapping the the names of countries in our main dataset to the other possible variations in the other datasets.

Data sources:
1. country_profile_variables.csv (main dataset)
2. Multiple csv files from World Bank (since all are from world bank, naming convention is the same)

In [1]:
# imports
import pandas as pd
import numpy as np
from collections import Counter 

In [2]:
# load in the 2 datasets

# load in main dataset and extract country column into a series
main_data = pd.read_csv("./data/country_profile_variables.csv")
main_data = main_data["country"]

# load in one of the world bank csvs
# in this case we will use gni data
world_bank_countries = pd.read_csv('./data/gni_per_capita_constant_lcu.csv')
world_bank_countries = world_bank_countries["Country Name"]

In [3]:
# we will use fuzzy string matching to speed up the process of finding same countries
# lewenshtien distance for fuzzy string matching
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])


def sorted_word_match(str1, str2):
    arr1 = sorted(str1.lower().replace(',', '').split())
    arr2 = sorted(str2.lower().replace(',', '').split())
    return set(arr1) == set(arr2)




In [4]:
# from ratio analysis we can see that the following countries have been named differently:
# mismatch = {
#      "Saint Kitts and Nevis": "St. Kitts and Nevis",
#      "Saint Vincent and the Grenadines": "St. Vincent and the Grenadines",
#      "Viet Nam": "Vietnam"
# }
country_map = {}

# map all the countries that we are certain about first
for index, main in main_data.items():
    country_map[main] = []
    for secondary in world_bank_countries:
        ratio = levenshtein_ratio_and_distance(main.lower(), secondary.lower(), True)
        if ratio > 0.5:
            country_map[main].append([ratio, secondary])

for key in country_map:
    country = ""
    max_ratio = 0
    for item in country_map[key]:
        if item[0] > max_ratio:
            max_ratio = item[0]
            country = item[1]
    country_map[key] = country

count_values = Counter(country_map.values())
values_2_or_more = []
for key, value in count_values.items():
    if key == "":
        continue
    elif value > 1:
        values_2_or_more.append(key)

compare = {}
for country in country_map:
    if country_map[country] in values_2_or_more:
        if country_map[country] not in compare:
            compare[country_map[country]] = [country]
        else:
            compare[country_map[country]].append(country)

for key in compare:
    max_country = ""
    max_ratio = 0
    for item in compare[key]:
        ratio = levenshtein_ratio_and_distance(key, item, True)
        if (ratio > max_ratio):
            max_ratio = ratio
            max_country = item
    compare[key] = max_country

for key in country_map:
    if key in compare:
        if country_map[key] != compare[key]:
            country_map[key] = ""

print("Check empty values and those that did not have ratio of 1: ")
for key, value in country_map.items():
    if not value or levenshtein_ratio_and_distance(key, value, True) != 1:
        print("key: ", key, "value: ", value)


Check empty values and those that did not have ratio of 1: 
key:  Anguilla value:  Angola
key:  Bahamas value:  Bahamas, The
key:  Bolivia (Plurinational State of) value:  
key:  Bonaire, Sint Eustatius and Saba value:  Antigua and Barbuda
key:  China, Hong Kong SAR value:  Hong Kong SAR, China
key:  China, Macao SAR value:  Macao SAR, China
key:  Congo value:  Togo
key:  Cook Islands value:  Solomon Islands
key:  Czechia value:  China
key:  Democratic People's Republic of Korea value:  Korea, Dem. People's Rep.
key:  Democratic Republic of the Congo value:  Dominican Republic
key:  Egypt value:  
key:  Falkland Islands (Malvinas) value:  Virgin Islands (U.S.)
key:  French Guiana value:  French Polynesia
key:  Gambia value:  Zambia
key:  Guadeloupe value:  Guatemala
key:  Holy See value:  
key:  Iran (Islamic Republic of) value:  Iran, Islamic Rep.
key:  Kyrgyzstan value:  Kazakhstan
key:  Lao People's Democratic Republic value:  Central African Republic
key:  Martinique value:  Maurit

In [5]:
# map remaining countries manually
manual = {
  'Anguilla' :"",
  'Bolivia (Plurinational State of)' :"Bolivia",
  'Bonaire, Sint Eustatius and Saba' :"",
  'Congo' :"Congo, Rep.",
  'Cook Islands' :"",
  'Czechia' :"Czech Republic",
  'Democratic Republic of the Congo' :"Congo, Dem. Rep.",
  'Egypt' :"Egypt, Arab Rep.",
  'Falkland Islands (Malvinas)' :"",
  'French Guiana' :"",
  'Gambia' :"Gambia, The",
  'Guadeloupe' :"",
  'Holy See' :"",
  'Kyrgyzstan' :"Kyrgyz Republic",
  "Lao People's Democratic Republic" :"Lao PDR",
  'Martinique' :"",
  'Mayotte' :"",
  'Montserrat' :"",
  'Niue' :"",
  'Republic of Korea' :"Korea, Rep.",
  'Saint Helena' :"",
  'Saint Pierre and Miquelon' :"",
  'Slovakia' :"",
  'State of Palestine' :"",
  'Swaziland' :"Eswatini",
  'The former Yugoslav Republic of Macedonia' :"North Macedonia",
  'Tokelau' :"",
  'United Republic of Tanzania' :"Tanzania",
  'United States Virgin Islands' :"Virgin Islands (U.S.)",
  "Venezuela (Bolivarian Republic of')" :"Venezuela, RB",
  'Wallis and Futuna Islands' :"",
  'Western Sahara' :""
}

country_map.update(manual)
for key, value in country_map.items():
        print("key: ", key, "value: ", value)


key:  Afghanistan value:  Afghanistan
key:  Albania value:  Albania
key:  Algeria value:  Algeria
key:  American Samoa value:  American Samoa
key:  Andorra value:  Andorra
key:  Angola value:  Angola
key:  Anguilla value:  
key:  Antigua and Barbuda value:  Antigua and Barbuda
key:  Argentina value:  Argentina
key:  Armenia value:  Armenia
key:  Aruba value:  Aruba
key:  Australia value:  Australia
key:  Austria value:  Austria
key:  Azerbaijan value:  Azerbaijan
key:  Bahamas value:  Bahamas, The
key:  Bahrain value:  Bahrain
key:  Bangladesh value:  Bangladesh
key:  Barbados value:  Barbados
key:  Belarus value:  Belarus
key:  Belgium value:  Belgium
key:  Belize value:  Belize
key:  Benin value:  Benin
key:  Bermuda value:  Bermuda
key:  Bhutan value:  Bhutan
key:  Bolivia (Plurinational State of) value:  Bolivia
key:  Bonaire, Sint Eustatius and Saba value:  
key:  Bosnia and Herzegovina value:  Bosnia and Herzegovina
key:  Botswana value:  Botswana
key:  Brazil value:  Brazil
key:

In [6]:
# export country_map
%store country_map

Stored 'country_map' (dict)
