## Geocoding of GeoAid Data V2

The original geoaid data data includes a git repo of shapefiles that are linked in a csv to project descriptions. Since only 30% of this data was geocoded, we need to find additional references to locations to geocode. We use city (lat, long) coordinates and country/continental shapefiles. This is done iterativly to provide the most accurate granular location for a particular financial expenditure. (ie, only if a city cannot be found would a project be geocoded at the country level). 

Expansions in this line of work would include: 
- automation of geocoding 
- more granular locations identified 

Geocoded data appears as such: 

*Public Opinion Data Set*

|Data Entity ID   |   Entity Attribute Fields  | Geocode Level 1 (country) ID | Geocode Level 2 (regional) ID  | Geocode Level 3 (Lat/Long) ID  | 
|---------|-----------------|--------------|---|-----------------|
|Respondant_1 |   age/gender/opinions/... | 100 | 1001 | 100101  | 
|Respondant_2  |    age/gender/opinions/... | 200 | 2001 | 2001001  | 

*Geocoded Data*

| Geocode Level   |   Name  | ID | geometry | Attribute Fields  |
|---------|-------| -----|-----|--------|
|1  |  USA | 100 | MultiPolygon(...) | gdp/bri_partnership/wealth/access/... | 
|3  |  Siegen | 2001 | Point(...)        | gdp/bri_partnership/wealth/access/... |
|2  |  Noth Rhine-Westphalia | 2001001 | MultiPolygon(...)  |  gdp/bri_partnership/wealth/access/... |

After entities are geocoded, we convert to a csv and then export for use in ArcGIS. 

__More information on geocoding conventions and processes for adding data to the location tables can be found [here](https://github.ncsu.edu/nakraft/LAS-BRI/blob/master/data_final/geocode.md).__

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import os
import osmapi as osm
import regex as re

# to retrieve geojson files from dataframe 
from urllib.request import urlopen
import json
import geojson
from geojson import Feature, FeatureCollection
import geopandas as gdp

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

Set the location of the git repo for future use. 

In [2]:
git_repo_loc = os.path.dirname(os.path.realpath("AidDataV2_Parsing.ipynb"))
temp_directory = os.path.join(os.path.join(os.path.expanduser('~')), 'Desktop')

#### Needed Functions

In [3]:
def load_dict(path): 
    file = open(path, "r")
    contents = file.read()
    dictionary = json.loads(contents)
    file.close()
    return dictionary

## Data set collection

There are three different datasets available for collection. 
- df: features development, commercial, representational or mixed intent 
- mil: features military financial expenditures 
- hu: features huawei financial expenditures 

In [4]:
dev = pd.read_excel(git_repo_loc + "/AidDatasGlobalChineseDevelopmentFinanceDataset_v2.0.xlsx", sheet_name='Global_CDF2.0')
dev['data'] = 'Development'

In [5]:
mil = pd.read_excel(git_repo_loc + "/AidDatasGlobalChineseDevelopmentFinanceDataset_v2.0.xlsx", sheet_name='Military')
mil['data'] = 'Military'

In [6]:
hu = pd.read_excel(git_repo_loc + "/AidDatasGlobalChineseDevelopmentFinanceDataset_v2.0.xlsx", sheet_name='Huawei')
hu['data'] = 'Huawei'

The data collected has synonymous fields, with the exception if they are good for military, huawei, or developmental aggregates. 

In [7]:
df = pd.concat([dev, mil, hu])

In [8]:
df['aggregate'] = (df['Recommended For Aggregates'].fillna("No").map({"Yes":1, "No":0}) + 
                   df['Recommended For Military Aggregates'].fillna("No").map({"Yes":1, "No":0}) + 
                   df['Recommended For Huawei Aggregates'].fillna("No").map({"Yes":1, "No":0}))

In [9]:
print("These values describe the count of projects to be aggregated moving forward.")
df['aggregate'].value_counts()

These values describe the count of projects to be aggregated moving forward.


1    11394
0     2644
Name: aggregate, dtype: int64

In [10]:
# map the names to a common value for searching for the geometry 
# grabs it from txt file for consistency of country naming conventions

recipient_mapping = load_dict("../country_config.txt")
df['Recipient'] = df['Recipient'].replace(recipient_mapping)

In [11]:
wk = df

## Determining our geospatial capabilities

In [12]:
none_loc = df.loc[(df['geoJSON URL DL'].notna()) & (df['aggregate'] == 1)] 
print("There is " + str(none_loc['AidData TUFF Project ID'].count() * 100 / len(df[df['aggregate'] == 1])) + "% projects already geocoded in the dataset.")
print("We can geocode the rest of the data using our geocoding process.")

There is 27.87431981744778% projects already geocoded in the dataset.
We can geocode the rest of the data using our geocoding process.


In [73]:
# pull in the mapping of files with geocode available
geocode_map = pd.read_csv(git_repo_loc + "/geoaid_data_to_loc_mapping.csv").drop(columns='Unnamed: 0')
wk = wk[~wk['AidData TUFF Project ID'].isin(geocode_map['AidData TUFF Project ID'])].reset_index()

assert len(wk) == (len(df) - len(geocode_map))
print("There are still " + str(len(wk)) + " more entities to geocode")

AssertionError: 

## Use wk as a dataframe of non-geocoded entities

Narrow down the 'wk' to include key location data.

__Let's first geocode anything that has explicit url demonstrating a location__

In [14]:
wk = wk[['AidData TUFF Project ID', 'Recipient', 'Recipient Region', 'Geographic Location']]

In [15]:
api = osm.OsmApi()

In [16]:
wk['urls'] = ""
for inde, element in enumerate((wk['Geographic Location'].astype(str))):
    mask = (pd.notna(element)) and (("openstreetmap" in element) or ("google.com/maps" in element))
    wk['urls'][inde] = [x for x in element.split(" ") if ("openstreetmap" in x) or ("google.com/maps" in x)]

In [17]:
urls = wk.loc[(wk['urls'] != "") & (wk['urls'].str.len() != 0)].reset_index()

In [18]:
def find_coordinates_from_url(url, api): 
    
    if "https://www.openstreetmap.org/node" == url[:34]: 
        x = re.findall(r"https://www.openstreetmap.org/node/(.*)", url)[0]
        j = api.NodeGet(x)
        lat = j['lat']
        long = j['lon']
        try: 
            name = j['tag']['name']
        except: 
            name = ""
        return [name, lat, long]
    
    if "https://www.openstreetmap.org/way/" == url[:34]: 
        try: 
            x = re.findall(r"https://www.openstreetmap.org/way/(.*)", url)[0]
            j = api.WayGet(x)
            nodes = j['nd']
            loc = api.NodeGet(nodes[0])
            lat = loc['lat']
            long = loc['lon']
            try: 
                name = nodes['tag']['name']
            except: 
                name = ""
            return [name, lat, long]
        except: 
            pass
            #nothing happens
    
    r = {
        "https://www.openstreetmap.org/qu" : r"(.*)https://www.openstreetmap.org/query\?lat=(.*)&lon=(.*)", 
        "https://www.openstreetmap.org/wa" : r'https://www.openstreetmap.org/way/(.*)\#map=(?:.*)\/(.*)\/(.*)',
        "https://www.openstreetmap.org/re" : r"https://www.openstreetmap.org/relation/(.*)\#map=[\d]/(.*)/(.*)",
        "https://www.openstreetmap.org/se" : r"(.*)https://www.openstreetmap.org/search\?query=(?:.*)\#map=(?:.*)\/(.*)\/(.*)",
        "https://www.google.com/maps/plac" : r"https://www.google.com/maps/place/(.*)/@(.*),(.*),(?:.*)data=(?:.*)", 
        "https://www.google.com/maps/dir/" : r"https://www.google.com/maps/dir/(.*)/@(.*),(.*),", 
        "https://www.google.com/maps/d/u/" : r"(.*)https://www.google.com/maps/d/u/0/edit(?:.*)ll=(.*)\%2C(.*)&z=(?:.*)",  
    }
    
  #  "https://www.google.com/maps/dir/Utexrwa,+KG+15+Ave,+Kigali,+Rwanda/Kinyinya,+Kigali,+Rwanda/
  #  @-1.9215882,30.0667421,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x19dca6adc81ee515:0xe7149cb605da8846!2m2!1d30.075835!2d-1.9277261!1m5!1m1!1s0x19dca13db5bbf103:0xce704dade862c428!2m2!1d30.0954076!2d-1.9113692!3e0"
    # determine how to parse the data 
    # t will correspond to (name, lat, long) or (id, lat, long), where id can be parsed to get the name
    try: 
        regex = r[url[:32]]
    except: 
        # print("the " + str(url) + " was not matched.")
        # nothing occurs
        pass
        
    try:  
        t = list(re.findall(regex, url)[0])
        t[0] = t[0].replace("+", " ")
        return t  
    except: 
        # print("the regex " + regex + " had issues with url " + url)
        pass

In [19]:
urls['coord'] = [list() for x in range(len(urls))]
for i in range(0, len(urls)): 
    for e in range(0, len(urls['urls'][i])): 
        t = find_coordinates_from_url(urls['urls'][i][e], api)
        if t is not None: 
            urls['coord'][i].append(t)

('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))


In [20]:
urls['latitude'] = ""
urls['longitude'] = ""
urls['location_name'] = ""

for i in range(0, len(urls)): 
    # if the array is of length 1, use it as the location 
    # if it has multiple, where there is only 1 with an actual location name, use that as the location
    # if it has multiple, where they is more than 1 with a location name, use the project id and use the average of coordinates for point
    # if there is multiple but none of them have a name, average all the locations for the point 
    # if no locations identified, print error message 
    
    if (len(urls['coord'][i]) == 0) | (urls['coord'][i] == [""]):
        continue
        
    if len(urls['coord'][i]) == 1: 
        if urls['coord'][i][0][0] == "": 
            urls['location_name'][i] = ""
        else: 
            urls['location_name'][i] = urls['coord'][i][0][0]
        urls['latitude'][i] = urls['coord'][i][0][1]
        urls['longitude'][i] = urls['coord'][i][0][2]
        continue 
    
    if len(urls['coord'][i]) > 1: 
        temp = urls['coord'][i]
        first_ele = [x[0] for x in urls['coord'][i]]
        if first_ele.count("") == len(first_ele) - 1: 
            loc = [x for x in urls['coord'][i] if x[0] != ""][0]
            urls['location_name'][i] = loc[0]
            urls['latitude'][i] = loc[1]
            urls['longitude'][i] = loc[2]
            
        else: 
            urls['location_name'][i] = ""
            urls['latitude'][i] = np.mean([float(x[1]) for x in urls['coord'][i]])
            urls['longitude'][i] = np.mean([float(x[2]) for x in urls['coord'][i]])

# drop all empty data
urls = urls[urls['location_name'] != ""]

In [21]:
urls['location_name'] = [re.sub('(\%[a-zA-Z0-9]{2})', '', x).split(",")[0] for x in urls['location_name'].astype(str)]
urls['location_name'] = [re.sub("\d{1,3}'\d{2}.\d[NESW]", "", x).strip() for x in urls['location_name'].astype(str)]
urls['location_name'] = [re.sub("[NESW]\d", "", x) for x in urls['location_name'].astype(str)]
urls['location_name'] = [re.sub('[0-9] [0-9]', '', x) for x in urls['location_name']]
urls['location_name'] = [re.sub('[0-9]', '', x) if x.isdigit() else x for x in urls['location_name']]

# now we have coordinates. These need to go to geocoder to go from coordinates to location 
loc_data = temp_directory + "/city_reverse_geocoding_temp.csv"
urls = urls[['AidData TUFF Project ID', 'Recipient', 'latitude', 'longitude', 'location_name']]
urls.to_csv(loc_data, index=False)

#### We can run our geocoding through a unique script. This will ensure consistency across other entity geocoding. 

In [22]:
%cd ..
%run autogeocode.py /Users/natalie_kraft/Desktop/city_reverse_geocoding_temp.csv C location_name Recipient "AidData TUFF Project ID"

/Users/natalie_kraft/Documents/LAS/LAS-BRI/data_processing
Preparing system configuration.
Loading file to geocode
You are reverse geocoding cities. Begin geocoding.
Loading geocoded location entities.
Loading geocoded location entities.
Loading formatted geocoded file...
	found Centro Cultural Chins with id 63
	found 53281.0, near Outapi, NA with id 28
	found Levy Mwanawasa General Hospital with id 59
	found Levy Mwanawasa General Hospital with id 59
	This country Africa was not found
	The country Africa will not be added to the listing.
	found Mariel with id 50
	found 30481.0, near Sao Tome, ST with id 39
	found 73057.0, near Vestmannaeyjar, IS with id 9
	found 58671 near Aplahoue, BJ with id 51
	found Levy Mwanawasa General Hospital with id 59
	found Maina Soko Military Hospital with id 61
	found 58671 near Aplahoue, BJ with id 51
	found 30629.0, near Yandev, NG with id 86
	found Centre Hospitalier National de Pikine with id 32
	found Maina Soko Military Hospital with id 61
	found M

In [74]:
results = pd.read_csv(temp_directory + "/city_reverse_geocoding_temp_results.csv")
geocode_map = geocode_map.append(results[results['city_temp_id'] != "-1"])

In [24]:
# any result that has been geocoded, restrict from wk 
temp_wk = wk.merge(results, on='AidData TUFF Project ID', how ="left")
temp_wk = temp_wk[(pd.isna(temp_wk['city_temp_id'])) | (temp_wk['city_temp_id'] == "-1")]
wk = temp_wk.drop(columns='city_temp_id')

assert len(wk) == (len(df) - len(geocode_map))
print("There are still " + str(len(wk)) + " more entities to geocode")

There are still 10527 more entities to geocode


## Use wk as a dataframe of non-geocoded entities - no URL locations


__Next we can geocode anything that has any city information present demonstrating a location__
- inclusive of standard city naming conventions 
- and inclusive of "town of" and "city of" tags 


In [25]:
# create a city column for all of the cities listed in the "Geographic Location" columns 
# establishes more details on cities given the full description
wk['city'] = [re.findall("(?:town|city|City|located|commune|University) (?:of|in) (?:[A-Z]\w*(?:\s|\.|,))+", str(x)) for x in wk['Geographic Location']]
wk['city'] = [re.sub("(town|city|City|located) (of|in) ", "", x[0])[:-1] if len(x) !=0 else "" for x in wk['city']]

# establishes based on "located in"
# checks for standard city conventions, one word or all caps 'city, country'
p = re.compile("(?:[A-Z]\w*(?:\s|\.|,){0,2})+")
temp_city_list = [p.match(str(x)) for x in wk['Geographic Location']]
temp_city_list = [x.group() if x != None else "" for x in temp_city_list]
temp_city_list = [x if ("This " not in x) & ("The " not in x) & ("There " not in x) else "" for x in temp_city_list]

wk['city'] = wk['city'] + temp_city_list 

export = wk[wk['city'] != ""][['AidData TUFF Project ID', 'Recipient', 'city']]

loc_data = temp_directory + "/city_geocoding_temp.csv"
export.to_csv(loc_data, index=False)

In [29]:
%run autogeocode.py /Users/natalie_kraft/Desktop/city_geocoding_temp.csv gl3 city Recipient "AidData TUFF Project ID"

Preparing system configuration.
Loading file to geocode
You are geocoding cities. Begin geocoding.
Loading geocoded location entities.
Loading geocoded location entities.
	This country Asia was not found
Location Asia not added to mapping or location entities.
	This country Asia was not found
Location Asia not added to mapping or location entities.
	This country South America was not found
Location South America not added to mapping or location entities.
	This country Asia was not found
Location Asia not added to mapping or location entities.
	This country Africa was not found
Location Africa not added to mapping or location entities.
	This country Africa was not found
Location Africa not added to mapping or location entities.
	This country South America was not found
Location South America not added to mapping or location entities.
	This country South America was not found
Location South America not added to mapping or location entities.
	This country Africa was not found
Location Afr

In [75]:
results = pd.read_csv(temp_directory + "/city_geocoding_temp_results.csv").rename(columns={'cities_temp_id':'city_temp_id'})
geocode_map = geocode_map.append(results[results['city_temp_id'] != "-1"])

In [31]:
# any result that has been geocoded, restrict from wk 
temp_wk = wk.merge(results, on='AidData TUFF Project ID', how ="left")
temp_wk = temp_wk[(pd.isna(temp_wk['city_temp_id'])) | (temp_wk['city_temp_id'] == "-1")]
wk = temp_wk.drop(columns=['city_temp_id'])

assert len(wk) == (len(df) - len(geocode_map))

In [32]:
print("There are still " + str(len(wk)) + " more entities to geocode")

There are still 8883 more entities to geocode


## Regional geocoding 
__Next we can geocode anything that has any regional information present demonstrating a location__
- inclusive of standard region naming conventions --> we will consider all of wk['city'] non coded entities to carry over into regional localities as well
- and inclusive of "district of" and "province of" tags 
- and x "Region"/"Province"/"region"/"province"/"district"/"District"

In [45]:
# upon inspection, some of the cities actually are administration boundaries. We will run them through the
# administrative boundaries API to see if we pick up any localities 
wk['region'] = ""

# we will also identify our priority key words to register
regional_temp = [re.findall("(?:Region|region|Province|province|district|District|state|commune) (?:of|in) (?:[A-Z]\w*(?:\s|\.|,))+", str(x)) for x in wk['Geographic Location']]
regional_temp = [re.sub("(Region|region|Province|province|district|District|commune) (of|in) ", "", x[0])[:-1] if len(x) !=0 else "" for x in regional_temp]

regional_temp_2 = [re.findall("(?:[A-Z]\w*(?:\s|\.|,))+ (?:Region|region|Province|province|district|District|commune)", str(x)) for x in wk['Geographic Location']]
regional_temp_2 = [re.sub(" (Region|region|Province|province|district|District)", "", x[0])[:-1] if len(x) !=0 else "" for x in regional_temp_2]

# consolidate into one columns 
#wk['region'] = [x if len(x) != 0 else y for x,y in zip(regional_temp, regional_temp_2)]
regional_temp = regional_temp + regional_temp_2

wk['region'] = [x if len(x) != 0 else y for x,y in zip(regional_temp, wk['city'])]
wk['region'] = [re.sub("(Province|District|,)", "", x) for x in wk['region']]

export = wk[wk['region'] != ""][['AidData TUFF Project ID', 'Recipient', 'region']]

loc_data = temp_directory + "/region_geocoding_temp.csv"
export.to_csv(loc_data, index=False)

In [47]:
%run autogeocode.py /Users/natalie_kraft/Desktop/region_geocoding_temp.csv gl2 region Recipient "AidData TUFF Project ID"

Preparing system configuration.
Loading file to geocode
You are geocoding regions. Begin geocoding.
Loading geocoded location entities.
Loading geocoded location entities.
	This country Africa was not found
Location Africa not added to mapping or location entities.
	This country Asia was not found
Location Asia not added to mapping or location entities.
	This country Africa was not found
Location Africa not added to mapping or location entities.
	This country Asia was not found
Location Asia not added to mapping or location entities.
	This country Asia was not found
Location Asia not added to mapping or location entities.
	This country Africa was not found
Location Africa not added to mapping or location entities.
	This country Africa was not found
Location Africa not added to mapping or location entities.
	This country South America was not found
Location South America not added to mapping or location entities.
	This country Asia was not found
Location Asia not added to mapping or loc

In [76]:
results = pd.read_csv(temp_directory + "/region_geocoding_temp_results.csv").rename(columns={'regions_temp_id':'region_temp_id'})
geocode_map = geocode_map.append(results)

In [49]:
# any result that has been geocoded, restrict from wk 
temp_wk = wk.merge(results, on='AidData TUFF Project ID', how ="left")
temp_wk = temp_wk[(pd.isna(temp_wk['region_temp_id'])) | (temp_wk['region_temp_id'] == "-1")]
wk = temp_wk.drop(columns=['region_temp_id'])

assert len(wk) == (len(df) - len(geocode_map))

In [50]:
print("There are still " + str(len(wk)) + " more entities to geocode")

There are still 8602 more entities to geocode


## Country Level Geocoding 

__We have exhausted all other methods of geocoding__ 

All remaining locations will be geocoded at the country level.

In [58]:
wk = wk[['AidData TUFF Project ID', 'Recipient', 'Geographic Location']]
export = wk

loc_data = temp_directory + "/country_geocoding_temp.csv"
export.to_csv(loc_data, index=False)

In [61]:
%run autogeocode.py /Users/natalie_kraft/Desktop/country_geocoding_temp.csv gl1 Recipient Recipient "AidData TUFF Project ID"

Preparing system configuration.
Loading file to geocode
You are geocoding entities at the country level. Begin geocoding.
Loading geocoded location entities.
Index(['AidData TUFF Project ID', 'Recipient', 'Geographic Location',
       'country', 'country_id'],
      dtype='object')
Exporting mapping results.
Geocoding complete.


In [77]:
results = pd.read_csv(temp_directory + "/country_geocoding_temp_results.csv")
geocode_map = geocode_map.append(results)

In [64]:
# any result that has been geocoded, restrict from wk 
temp_wk = wk.merge(results, on='AidData TUFF Project ID', how ="left")
wk = temp_wk[(pd.isna(temp_wk['country_id'])) | (temp_wk['country_id'] == "-1")]
assert len(wk) == (len(df) - len(geocode_map))

print("There are still " + str(len(wk)) + " more entities to geocode")

There are still 168 more entities to geocode


Despite their being additional entities to geocode, they are defined above the country level. All non-geocoded data will be logged as an unknown point at the coordinates Long: -130.29 Lat: -33.819. 

## Finalize Geocoded Mapping & Export Geocoded Dataset

In [91]:
# update the column names to match final conventions
geocode_map = geocode_map.rename(columns={'city_temp_id': 'gl3_id', 'region_temp_id': 'gl2_id'})

# merge data on TUFF ID 
merged = df.merge(geocode_map, on='AidData TUFF Project ID', how='left')

# export original expenditure data with this map attached
merged.to_csv("geoaid/geocoded_data.csv")

In [100]:
# we have generated some temporary files. remove them. 
%cd
%cd /Users/natalie_kraft/Desktop

%rm city_geocoding_temp.csv city_geocoding_temp_results.csv 
%rm city_reverse_geocoding_temp.csv city_reverse_geocoding_temp_results.csv

%rm country_geocoding_temp.csv country_geocoding_temp_results.csv
%rm region_geocoding_temp.csv region_geocoding_temp_results.csv

/Users/natalie_kraft
/Users/natalie_kraft/Desktop
