Fetch the source data, in this case from an open shared Google drive. We cache the source data in the `in` directory for further processing and to avoid re-fetching each time this notebook is run.

Note that there appears to be a bug in rdf-tabular where the filenames of CSV files need to be in lower case.

In [1]:
import requests
from pathlib import Path

sourceFolder = Path('in')
sourceFolder.mkdir(exist_ok=True)

sources = [
    ('cn8_2012.csv', '1P7YyFF6qXKXWVtR0Vt3kkvFPOjThMQH8'),
    ('cn8_2013.csv', '1de-Le9ungrbdoGyvWI_RwmEhNpTmR-70'),
    ('cn8_2014.csv', '1oC3jlItfsUshd54KOR7yn9NxpR83iCbC'),
    ('cn8_2015.csv', '1H54-FYrCFa1DylCBg38RAPAeCtkGq4la'),
    ('cn8_2016.csv', '11fLsnoiWzTcA1d3nSDWvyrKQEHwIf6Hz')
]

for filename, google_id in sources:
    sourceFile = sourceFolder / filename

    if not (sourceFile.exists() and sourceFile.is_file()):
        response = requests.get(f'https://drive.google.com/uc?export=download&id={google_id}')
        with open(sourceFile, 'wb') as f:
            f.write(response.content)

This data is already in [Tidy Data format](http://vita.had.co.nz/papers/tidy-data.pdf), but needs some "reconciliation" of the country column in order to corresponding linked data resources for each country name.

We'll use [Getty's Thesaurus of Geographic Names](http://www.getty.edu/research/tools/vocabularies/tgn/index.html) as they provide well curated linked open data.

Getty offers a SPARQL interface with many example queries that could be used to reconcile the given country names. We'll use their [reconciliation service](http://vocab.getty.edu/queries#OpenRefine_Reconciliation_Service) for now.

Other queries worth considering:
* http://vocab.getty.edu/queries#Places_with_English_or_GVP_Label

In [2]:
import pandas
import json
import sys
from urllib.parse import quote_plus

countriesFile = Path('metadata') / 'countries.json'
countries = json.load(open(countriesFile))

for filename, google_id in sources:
  table = pandas.read_csv(sourceFolder / filename)
  countryNames = table['country'].unique()

  for country in countryNames:
      if country not in countries:
          try:
              response = requests.get(f'http://vocab.getty.edu/sparql.json?query=select+distinct*{{?x+skos:inScheme+tgn:;(xl:prefLabel|xl:altLabel)/gvp:term"{quote_plus(country)}"@en}}')
              countries[country] = [binding['x']['value'] for binding in response.json()['results']['bindings']]
          except:
              print("Unexpected error for '%s':" % country, sys.exc_info()[0])

with open(countriesFile, 'w') as ctr:
    json.dump(countries, ctr, indent=2)
    
countries

{'Afghanistan': ['http://vocab.getty.edu/tgn/7016612'],
 'Albania': ['http://vocab.getty.edu/tgn/7006417'],
 'Algeria': ['http://vocab.getty.edu/tgn/7016752'],
 'American Samoa': ['http://vocab.getty.edu/tgn/7005667'],
 'Andorra': ['http://vocab.getty.edu/tgn/1000061'],
 'Angola': ['http://vocab.getty.edu/tgn/1000149'],
 'Anguilla': ['http://vocab.getty.edu/tgn/7004637'],
 'Antarctica': ['http://vocab.getty.edu/tgn/1000007'],
 'Antigua:Barbuda': ['http://vocab.getty.edu/tgn/1000009'],
 'Argentina': ['http://vocab.getty.edu/tgn/7006477'],
 'Armenia': ['http://vocab.getty.edu/tgn/7004538',
  'http://vocab.getty.edu/tgn/7593985',
  'http://vocab.getty.edu/tgn/7006651'],
 'Aruba': ['http://vocab.getty.edu/tgn/7004548'],
 'Australia': ['http://vocab.getty.edu/tgn/7000490'],
 'Azerbaijan': ['http://vocab.getty.edu/tgn/7006646'],
 'Bahamas': ['http://vocab.getty.edu/tgn/7005332'],
 'Bahrain': ['http://vocab.getty.edu/tgn/7016770'],
 'Bangladesh': ['http://vocab.getty.edu/tgn/1000105'],
 'Barb

In [3]:
tgn = {c: countries[c][0][len('http://vocab.getty.edu/tgn/'):] for c in countries if countries[c] != []}
tgn

{'Afghanistan': '7016612',
 'Albania': '7006417',
 'Algeria': '7016752',
 'American Samoa': '7005667',
 'Andorra': '1000061',
 'Angola': '1000149',
 'Anguilla': '7004637',
 'Antarctica': '1000007',
 'Antigua:Barbuda': '1000009',
 'Argentina': '7006477',
 'Armenia': '7004538',
 'Aruba': '7004548',
 'Australia': '7000490',
 'Azerbaijan': '7006646',
 'Bahamas': '7005332',
 'Bahrain': '7016770',
 'Bangladesh': '1000105',
 'Barbados': '7004770',
 'Belarus': '7006657',
 'Belize': '7005346',
 'Benin': '1000160',
 'Bermuda': '7032026',
 'Bhutan': '7927800',
 'Bolivia': '7864299',
 'Bonaire': '7593113',
 'Bosnia & Herz.': '7006664',
 'Botswana': '1000150',
 'Bouvet Island': '7016858',
 'Br Ind Oc Terr': '7008651',
 'Br Virgin Is': '7004677',
 'Brazil': '1000047',
 'Brunei': '1000107',
 'Burkina': '1000208',
 'Burma': '1000108',
 'Burundi': '7001630',
 'Cambodia': '1000109',
 'Cameroon': '1000153',
 'Canada': '7743720',
 'Cape Verde': '7001632',
 'Cayman Islands': '7004623',
 'Cent Afr Rep': '10

Next, each CSV file needs a JSON metadata file so that a CSVW processor can understand how to convert to a linked data cube.

At the same time, we copy across the CSV files, replacing the country string with the corresponding TGN ID. We also fill out a template JSON metadata file for each CSV file, putting the results in the `out` directory for further processing.

In [4]:
metadataTemplate = json.load(open('metadata/cn8_template.json'))

destinationFolder = Path('out')
destinationFolder.mkdir(exist_ok=True)

for filename, google_id in sources:
    sourceFile = sourceFolder / filename
    
    table = pandas.read_csv(sourceFile)[:1000]
    table['country'] = table['country'].map(lambda x: tgn.get(x))
    destFile = destinationFolder / filename
    table[table.country != None].to_csv(destFile, index=False)
    metadataFile = destinationFolder / (filename + '-metadata.json')
    metadataTemplate['url'] = filename
    with open(metadataFile, 'w') as meta:
        json.dump(metadataTemplate, meta, indent=2)
