# Location Corrector using Fuzzy Matching

## Objective:
Provide corrected location, spelling or otherwise, for either city, state, or country using fuzzy string matching based on Python's fuzzywuzzy module (https://pypi.org/project/fuzzywuzzy/).

## Tasks:
- Download the country ISO codes and world cities file from applicable websites, which will be used as the location crosswalks.
- Create a function that identifies the appropriate location and standardizes the output, i.e., proper capitalization, two-digit state codes and spelled out country names, based on different scenarios, including but not limited to:
  - Spelled out state.
  - Two or three-digit country code.
  - State code with an equivalent two-digit country code (e.g., CA can be either California or Canada).
  - "OCONUS" instead of actual country name.
  - Mismatched city and country (e.g., Calgary, Alberta vs. Calgary, Canada; Incheon, Korea vs. Incheon, South Korea)
  - City spelling variations exist in different states (e.g., Pittsburg in CA, FL, OK, etc. vs. Pittsburgh, PA)
  - Mispelled state code (e.g., Pine Bluff, AK vs. Pine Bluff, AR)
- Apply the function on a file with city and state/country columns. In addtion to creating corrected city and state/country columns, a note column is created for each to indicate if the identified city or state/country needs further verification, or if matches were not found.

## Open in nbviewer:
https://nbviewer.jupyter.org/github/ChE99/projects/blob/master/Location%20Corrector%20using%20Fuzzy%20Matching.ipynb

## 1. Import Modules

In [1]:
import pandas as pd
import os
import re
import shutil
import urllib.request
import zipfile
import operator
from fuzzywuzzy import fuzz
from fuzzywuzzy import process



## 2. Download location files from applicable websites.

In [2]:
# Read country three-digit iso codes from website.
iso_codes = pd.read_html('https://www.iban.com/country-codes/')[0]

# Download world cities file from website.
url = 'https://geolite.maxmind.com/download/geoip/database/GeoLite2-City-CSV.zip'
urllib.request.urlretrieve(url, 'GeoLite2-City-CSV.zip')

# Extract to the current directory.
zip_file = '\\GeoLite2-City-CSV.zip'
zip_extr = zipfile.ZipFile(os.getcwd() + zip_file, 'r')
zip_extr.extractall()
zip_extr.close()

# Extract the applicable file and move to the current directory.
[path] = [file for file in os.listdir() if re.search(r'GeoLite2-City-CSV_', file) != None]
loc_file = 'GeoLite2-City-Locations-en.csv'
shutil.copy(path + '\\' + loc_file, os.getcwd())

# Read the file.
locations = pd.read_csv(loc_file)
# Set the applicable columns from the locations file.
cols = ['city_name', 'subdivision_1_name', 'subdivision_1_iso_code', 'country_name', 'country_iso_code']

# Create function to identify cities, states, countries and state/country codes and contain them in lists.
def set_location(col):
    if col in cols[1:3]:
        loc_list = list(locations[col][locations[cols[3]]=='United States'].unique())
    else:
        loc_list = list(locations[col].unique())
    # Delete 'nan' from the list.
    return [loc for loc in loc_list if str(loc) != 'nan']

# Apply the function and create the state, countries, and codes variables.
cities, states, state_codes, countries, country_codes = map(set_location, cols)

# Delete the downloaded files since they're no longer needed.
shutil.rmtree(path)
os.remove(os.getcwd() + zip_file)
os.remove(os.getcwd() + '\\' + loc_file)

## 3. Function for Location Correction

In [3]:
def loc_check(row):
    # Define parameters.
    city = row['city']
    st_or_co = row['state/country']
    city_corrected = ''
    state_country_corrected = ''
    country_code = ''
    city_note = ''
    state_country_note = ''
    val = 0
    
    # Create city, state, and country lists.
    csco = [city] + [st_or_co]*4
    loc_list = [cities, states, state_codes, countries, country_codes]
    
    # Apply the fuzzywuzzy scorer. The token_sort_ratio with a score of over 81 was identified as the optimum for extracting the
    # closest matches for the cities, states and countries.
    city_list, state_list, state_code_list, country_list, country_code_list = map(
        lambda csco,loc_list: [loc[0] for loc in process.extract(csco,loc_list, scorer=fuzz.token_sort_ratio) if loc[1] > 81], 
        csco,loc_list)

    # Function for identifying the best match.
    def locator(op1, op2, param1, param2, param3, param4, param5):
        return locations[param1][op1(locations[param2], param3)][op2(locations[param4], param5)].unique().tolist()
    
    # Identify, state, state code and country code based on three input types.
    # 1. Spelled out state.
    if state_list != []:
        if len(state_list) == 1:
            [state] = state_list
        else:
            state = state_list[0]
        state_code_list = locator(
            operator.eq, operator.eq, 'subdivision_1_iso_code', 'subdivision_1_name', state, 'country_iso_code', 'US')
    
    # 2. Three-digit country code.
    elif len(st_or_co) == 3:
        [alph_co] = iso_codes['Country'][iso_codes['Alpha-3 code'] == st_or_co].unique().tolist()
        col = process.extractOne(alph_co, countries)[0]
        country_list = [col]
    
    # 3. State code also exists in country codes (e.g., CA).
    elif len(country_code_list) == 1:
        [country_code] = country_code_list
    elif country_code_list != []:
        country_code = country_code_list[0]    
    else:
        pass

    # For loop to identify country based on city.
    conus, country, oconus = [], [], []
    city_conus, city_country, city_oconus = {}, {}, {}
    for cloc in city_list:
        loc_con = locator(operator.eq, operator.eq, 'subdivision_1_iso_code', 'city_name', cloc, 'country_iso_code', 'US')
        loc_coun = locator(operator.eq, operator.eq, 'country_name', 'city_name', cloc, 'country_iso_code', country_code)
        loc_ocon = locator(operator.eq, operator.ne, 'country_name', 'city_name', cloc, 'country_iso_code', 'US')

        if (len(loc_con) != 0) or (len(loc_coun) !=0) or (len(loc_ocon) != 0):
            conus.append(loc_con)
            city_conus[cloc] = loc_con
            country.append(loc_coun)
            city_country[cloc] = loc_coun
            oconus.append(loc_ocon)
            city_oconus[cloc] = loc_ocon
        else:
            pass

    # Function to identify unique values.    
    def sublist(var):
        return set([item for sublist in var for item in sublist])
    
    # Apply sublist function to country lists.
    conus, country, oconus = map(sublist, [conus, country, oconus])

    # Define applicable state and country lists intersections.
    sco_cco = set(state_code_list).intersection(country_code_list)
    sco_con = set(state_code_list).intersection(conus)
    col_ocon = set(country_list).intersection(oconus)
    coun_ocon = set(country).intersection(oconus)
    
    # Function to identify corrected city and state or country.
    def corr_city():
        nonlocal city_list, city_corrected, state_country_corrected, city_oconus, city_note, state_country_note
        if len(city_list) == 1:           
            [city_corrected] = city_list
        else:
            city_corrected = city_list[0]
        [state_country_corrected] = city_oconus[city_corrected]
        city_note = 'for_review'
        state_country_note = 'for_review'
     
    # Function to indicate no matches found with the file crosswalk.
    def unk():
        nonlocal city_corrected, state_country_corrected, city_note, state_country_note
        city_corrected, state_country_corrected, city_note, state_country_note = city, st_or_co, 'unk', 'unk'
    
    # Function for identifying the city from a dictionary of cities and states or countries.
    def keys_city():
        nonlocal val, city_country, city_oconus, city_conus, state_country_corrected, city_corrected, city_note
        if val == 1:
            loc_city = city_country
        elif val == 2:
            loc_city = city_oconus
        else:
            loc_city = city_conus
        
        city_keys = [k for k,v in loc_city.items() if state_country_corrected in v]
        if len(city_keys) == 1:
            [city_corrected] = city_keys
            if fuzz.token_set_ratio(city, city_corrected) <= 95:
                city_note = 'for_review'
        else:
            city_corrected = city_keys[0]
            city_note = 'for_review'
    
    try:
        # Conditionals based on the following scenarios.
        # 1. Country = OCONUS (i.e., FAS)
        if (st_or_co == 'OCONUS'):
            if len(oconus) == 1:
                corr_city()
            else:
                unk()

        # 2. Mismatched city and country (e.g., Calgary, Alberta vs. Calgary, Canada; Incheon, Korea vs. Incheon, South Korea)
        elif (state_code_list == []) & (country_list == []):
            corr_city()

        # 3. State code exists in country codes (e.g., CA, PA). CA, for example, can be California or Canada.
        elif (bool(sco_cco) == True) & (bool(coun_ocon) == True) & (bool(sco_con) == False):
            val=1
            [state_country_corrected] = coun_ocon
            state_country_note = 'for_review'
            keys_city()

        # 4. Regular city and state code check; and
        # 5. City spelling variations exist in different states (e.g., Pittsburg in CA, FL, OK, etc. vs. Pittsburgh, PA)
        elif bool(sco_con):
            [state_country_corrected] = sco_con
            keys_city()

        # 6. Mispelled state code (e.g., Pine Bluff, AK vs. Pine Bluff, AR)
        elif (bool(sco_con) == False) & (bool(col_ocon) == False):
            [scol]=state_code_list
            state_country_corrected = process.extractOne(scol,list(conus), scorer=fuzz.token_sort_ratio)[0]
            state_country_note = 'for_review'
            keys_city()

        # 7. Regular city and country check.
        elif bool(col_ocon):
            val=2
            [state_country_corrected] = col_ocon
            keys_city()

        else:
            unk()

    except Exception:   
        unk()
    
    return city_corrected, city_note, state_country_corrected, state_country_note

## 4. Open, Read, and Apply

In [4]:
# Open the file with the locations to be corrected.
df_loc = pd.read_csv('locations.csv')
df_loc

Unnamed: 0,city,state/country
0,SAN DIEGO,CALIFORNIA
1,Vancouver,CA
2,Manla,PHL
3,NICE,OCONUS
4,Arlinton,Txas
5,pittsburg,Pensylbania
6,Calgary,ALBERTA
7,INCHEON,KOREA
8,Pine Bluf,AK
9,Lake Buena Vista,FL


In [5]:
# Apply the location corrector function on the file.
df_loc['city_corrected'], df_loc['city_note'], df_loc['state_country_corrected'], df_loc['state_country_note'] = zip(
    *df_loc[['city','state/country']].apply(loc_check, axis=1))

In [6]:
# View the updated file, which now consists of the corrected locations and note columns. Non-matching entries, i.e., those that 
# didn't match the location crosswalk are indicated by "unk" in the note columns.
df_loc

Unnamed: 0,city,state/country,city_corrected,city_note,state_country_corrected,state_country_note
0,SAN DIEGO,CALIFORNIA,San Diego,,CA,
1,Vancouver,CA,Vancouver,,Canada,for_review
2,Manla,PHL,Manila,for_review,Philippines,
3,NICE,OCONUS,Nice,for_review,France,for_review
4,Arlinton,Txas,Arlington,for_review,TX,
5,pittsburg,Pensylbania,Pittsburgh,for_review,PA,
6,Calgary,ALBERTA,Calgary,for_review,Canada,for_review
7,INCHEON,KOREA,Incheon,for_review,South Korea,for_review
8,Pine Bluf,AK,Pine Bluff,for_review,AR,for_review
9,Lake Buena Vista,FL,Lake Buena Vista,unk,FL,unk
