# Country Name Map

Our main dataset contains country data from 2017. We are also supplementing our main dataset with data from external sources. However a problem arises as the names of the countries in the different datasets is not the same. This is due to two reasons:
1. Difference in naming conventions
2. Changes in the name of countries over time

To resolve this problem, this notebook will create a hash map mapping the the names of countries in our main dataset to the other possible variations in the other datasets.

Data sources:
1. country_profile_variables.csv (main dataset)
2. countries.csv (contains a list of countries ex. territories)
3. Multiple csv files from World Bank (since all are from world bank, naming convention is the same)

In [10]:
# imports
import pandas as pd
import numpy as np

In [13]:
# load in the 3 datasets

# load in main dataset and extract country column into a series
main_data = pd.read_csv("./data/country_profile_variables.csv")
main_countries = main_data["country"]

# load in countries.csv that contains a list of countries
countries_list = pd.read_csv('./data/countries.csv')
countries_list = countries_list["name"]

# load in one of the world bank csvs
# in this case we will use gni data
world_bank_countries = pd.read_csv('./data/gni_per_capita_constant_lcu.csv')
world_bank_countries = world_bank_countries["Country Name"]

In [12]:
# we will use fuzzy string matching to speed up the process of finding same countries
# lewenshtien distance for fuzzy string matching
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

In [18]:
# explore similarities between main_countries and countries_list
threshold = 0.8 # minimum levenshtein ratio to consider 2 strings as the same
searched = {} # some strings 
for index, main in main_countries.items():
    for secondary in countries_list:
        ratio = levenshtein_ratio_and_distance(main.lower(), secondary.lower(), True)
        if (ratio > threshold):
            print(index, main, secondary, ratio)

0 Afghanistan Afghanistan 1.0
1 Albania Albania 1.0
2 Algeria Algeria 1.0
4 Andorra Andorra 1.0
5 Angola Angola 1.0
7 Antigua and Barbuda Antigua and Barbuda 1.0
8 Argentina Argentina 1.0
9 Armenia Armenia 1.0
11 Australia Australia 1.0
11 Australia Austria 0.875
12 Austria Australia 0.875
12 Austria Austria 1.0
13 Azerbaijan Azerbaijan 1.0
14 Bahamas Bahamas 1.0
15 Bahrain Bahrain 1.0
16 Bangladesh Bangladesh 1.0
17 Barbados Barbados 1.0
18 Belarus Belarus 1.0
19 Belgium Belgium 1.0
20 Belize Belize 1.0
21 Benin Benin 1.0
23 Bhutan Bhutan 1.0
24 Bolivia (Plurinational State of) Bolivia (Plurinational State of) 1.0
26 Bosnia and Herzegovina Bosnia and Herzegovina 1.0
27 Botswana Botswana 1.0
28 Brazil Brazil 1.0
30 Brunei Darussalam Brunei Darussalam 1.0
31 Bulgaria Bulgaria 1.0
32 Burkina Faso Burkina Faso 1.0
33 Burundi Burundi 1.0
34 Cabo Verde Cabo Verde 1.0
35 Cambodia Cambodia 1.0
36 Cameroon Cameroon 1.0
37 Canada Canada 1.0
39 Central African Republic Central African Republic 1