# __Step 7.1: Get country info__

## ___Readme___

Goal
- Get country info out of each doc
- Get # of docs per country
- Get # of docs per continent
- Get # of docs per country over time
- Get # of docs per country per topic
- Get # of docs per country per topic over time

Approach:
- Get the right token for country info in the AD (address) field, some have email address as the last token
- `pycountry`: use both ISO3166 (Countries) and ISO3166-3 (deleted countries)
- Supplement dictionary: some special considerations, e.g., UK, Taiwan, etc.
- `geopy`: pass the location token directly. This is a powerful module but does not deal with historical country properly.

Deprecated:
- `uszipcode`: search for zip code if the token has two parts delimited by " " and the 2nd part is a number as is or after taking the 1st subtoken before "-". This is done by `geopy`.

Key info:
- Total records: 421658
- Medline available: 421626
  - No AD in medline record: 19851
  - With a3 based on pycountry/suppl dict: 329237
  - To geopy: 72502

Thoughts
- The addresses that do not work tend to be earlier records where only institution names are available or from addresses that no longer exists.

Issues:
- 3/13/23:
  - Gone through north america and asia, tried to use the merged europe1 file but encountered: `ERROR: Input data is not ordered: relation id 33702 appears more than once.` See `Working with Multiple Input Files` of the [Osmium manual](https://osm2pgsql.org/doc/manual.html#updating-an-existing-database) which suggest [merge and simplify first](https://osm2pgsql.org/doc/manual.html#merging-osm-change-files).
- 3/7/23:
  - Have gotten the North American Nominatim search to work. But looking into the one that addresses cannot be found, I realize that `get_location_str()` need to be further improved.
    - See `script_7_1_c_assess_not_found.ipynb` for considerations.
    - This leads to a modified `get_location_str()`.
- 3/1/23:
  - The North America file have been loading for 4 days and died this morning because I run out of virtual disc space. There was ~250Gb free space. Increase the virtual disc from 512Gb to 1024Gb and run Nominatim again...
- 2/27/23:
  - First thing in the morning, I notice the proess died with kill signal. Suspect it is a memory issue.
  - Revert back to do continent level. But this does not work for all because I ran into memory problems still for Europe. But North America works. So for Europe, exclude 3 combined regions, and consolidate the rest into two osm pbfs.
  - For efficiency, also combine africa, antarctica, australia-oceania, central-america, south-america into all_others.osm.pbf.
  - Originally was thinking about doing each successively so each time there are fewer searches needed. But realize that this is problematic because a search can turn up FPs. For example, if I search for Paris with the North America file, I will find some place with the top match but that will not be the one I am looking for. So instead, all `not_found` records need to go through all regions, then compare the importance score afterward.
- 2/24/23:
  - Figure out how to point the input planet file to docker.
  - Run nominatim container using local planet osm. Let it go over the weekend.
- 2/23/23:
  - Was planning to do the search continent by continent, then realize that there is osm file for the whole planet from [OpenStreeMap](https://planet.openstreetmap.org/). Download this instead.
  - Run into issue pointing the downloaded file to the docker image.  
- 2/22/23: 
  - Turned out that the docker command line only point to one region. Download OSM files from [Geofabrik](https://download.geofabrik.de/index.html).
  - There are some issues with email address containing address.
- 2/21/23:
  - Try to query locations using Nominatim servive (OpenStreetMap) via geopy but after couple hundred queries, the connection time out. Likely the service just refuse to handle too many request. Try install Nominatim locally. 
  - [Nominatim Docker version](https://github.com/mediagis/nominatim-docker/tree/master/4.2) and to [get Docker started](https://docs.docker.com/config/daemon/start/).
  - Got `docker: Got permission denied while trying to connect to the Docker daemon socket`. Try [this fix](https://www.digitalocean.com/community/questions/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket).
  - See [this guide](https://www.linkedin.com/pulse/geocoding-geopy-your-own-nominatim-server-chonghua-yin?trk=related_artice_Geocoding%20with%20GeoPy%20and%20Your%20Own%20Nominatim%20Server_article-card_title).
- 2/20/23: 
  - The corpus dataset from 2_5_predict_pubmed does not have author or affiliation info. This needs to be done from the very beginning when I process the pubmed records.
  - In 
[MEDLINE/PubMed Data Element (Field) Descriptions](https://www.nlm.nih.gov/bsd/mms/medlineelements.html), there are several important info:
    - The affiliation of the authors, corporate authors and investigators appear in this repeating field.
      - 1988- The address of the first author's affiliation is included. The institution, city, and state including zip code for U.S. addresses, and country for countries outside of the United States, are included if provided in the journal; sometimes the street address is also included if provided in the journal.
      - 1995-2013 The designation USA is added at the end of the address when the first author's affiliation is in the fifty United States or the District of Columbia.
        - Q: Does this mean that this is not done for records before 1995?
      - 1996- The primary author's electronic mail (e-mail) address is included at the end of the Affiliation field, if present in the journal.
      - 2003- The complete first author address is entered as it appears in the article with no words omitted.
      - October 2013- Quality control of this field ceased in order to accommodate the affiliations for all authors and contributors.
      - December 2014- Multiple affiliations for each author or contributor are included.
        - __Because of this, only 1st author info is considered.__
  - For dealing with countries, there is the issue of historical country names, see [ISSO_3166-3](https://en.wikipedia.org/wiki/ISO_3166-3)

## ___Set up___

### Module import

In conda env `geo`

In [1]:
import pickle, pycountry, subprocess, requests, urllib
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from Bio import Entrez, Medline
from time import sleep
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "7_countries"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with date and other info
dir2        = proj_dir / "2_text_classify//2_5_predict_pubmed"
corpus_file = dir2 / "corpus_plant_421658.tsv.gz"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"

medline_dir = work_dir / "medline"
medline_dir.mkdir(parents=True, exist_ok=True)

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Get PubMed records___

### Read plant science corpus and get pmids

In [3]:
pmid_file = work_dir / "pmids.pickle"

# PMID file does not exist
if not pmid_file.is_file():
  # Read corpus file
  corpus = pd.read_csv(corpus_file, compression='gzip', sep='\t')
  # get pmids
  pmids = corpus.PMID.values
  # save pmids
  with open(pmid_file, 'wb') as f:
    pickle.dump(pmids, f)
else:
  with open(pmid_file, "rb") as f:
    pmids = pickle.load(f)

print(pmids.shape)

(421658,)


### Get Pubmed docs using PMIDs


In [4]:
#https://stackoverflow.com/questions/59267992/biopython-how-to-download-all-of-the-peptide-sequences-or-all-records-associat

Entrez.email = 'shius@msu.edu'

id_list  = [str(pmid) for pmid in pmids]
post_xml = Entrez.epost(db='pubmed', id=','.join(id_list))
results  = Entrez.read(post_xml)
webenv   = results['WebEnv']
qkey     = results['QueryKey']

In [5]:
#http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec166

step    = 10000
for begin in tqdm(range(0, len(pmids), step)):
  # first check if this file is present
  medline_file = medline_dir / f"corpus_plant_421658_medline_{begin}.pickle"

  # Check if the file is already there, if so, continue to the next one
  if not medline_file.is_file():
    subset   = pmids[begin:begin+step]

    # Get Medline records for subset
    handle  = Entrez.efetch(db='pubmed', id=subset, rettype='medline', 
                            retmode='text', webenv=webenv, query_key=qkey)
    records  = Medline.parse(handle)
    rec_list = list(records)

    with open(medline_file, "wb") as f:
      pickle.dump(rec_list, f)


100%|██████████| 43/43 [00:00<00:00, 112.07it/s]


### Process PubMed Medline docs

In [6]:
# Read individuial pickle files and compile the full list
all_rec = []
for begin in tqdm(range(0, len(pmids), step)):
  medline_file = medline_dir / f"corpus_plant_421658_medline_{begin}.pickle"
  with open(medline_file, "rb") as f:
    rec_list = pickle.load(f)
  all_rec.extend(rec_list)

# The number of docs don't add up. Some records are not downloaded
len(all_rec)

100%|██████████| 43/43 [01:54<00:00,  2.67s/it]


421585

### Check what's missing

In [7]:
# Go thorugh all downloaded docs and get PMIDs
def check_missing(pmids, all_rec):
  '''
  Args:
    pmids (list): list of integer PMIDs
    all_rec (list): list of dictionary of medline records
  Return:
    id_list_missed (list): list of items in pmids but not all_rec
  '''

  # Downloaded
  pmids_dn = []
  for rec in tqdm(all_rec):
    pmids_dn.append(int(rec['PMID']))
  
  # Compare lists
  #https://stackoverflow.com/questions/15455737/python-use-set-to-find-the-different-items-in-list
  print("differnce:",len(pmids)-len(pmids_dn))

  pmids_ori_set = set(pmids)
  pmids_dn_set  = set(pmids_dn)
  missing = pmids_ori_set - pmids_dn_set
  print("# missing:", len(missing))

  id_list_missed = [str(pmid) for pmid in missing]

  return id_list_missed

In [8]:
# Get the missing records and add to all_rec
id_list_missed = check_missing(pmids, all_rec)

# Get Medline records for subset straight without epost
handle  = Entrez.efetch(db='pubmed', id=id_list_missed, rettype='medline', 
                        retmode='text')
records  = Medline.parse(handle)
rec_list = list(records)

# Can only get 41, so some still missing
print("Retrieved:", len(rec_list))

100%|██████████| 421585/421585 [00:00<00:00, 876778.67it/s]


differnce: 73
# missing: 72
Retrieved: 41


In [9]:
# Save the missing records as pickle
medline_file = medline_dir / "corpus_plant_421658_medline_missed.pickle"

with open(medline_file, "wb") as f:
  pickle.dump(rec_list, f)

In [10]:
# Add to all_rec, then check again
all_rec.extend(rec_list)

In [11]:
still_missing = check_missing(pmids, all_rec)
len(still_missing)

100%|██████████| 421626/421626 [00:00<00:00, 987372.51it/s]


differnce: 32
# missing: 31


31

### Check AU length

In [12]:
ad_len_dict = {}
for rec in tqdm(all_rec):
  if 'AD' in rec:
    ad_len = len(rec['AD'])
    if ad_len not in ad_len_dict:
      ad_len_dict[ad_len] = 1
    else:
      ad_len_dict[ad_len]+= 1

    #if ad_len == 2285:
    #  print(rec['AD'][0])

print(ad_len_dict)
        

100%|██████████| 421626/421626 [00:00<00:00, 792934.38it/s]

{1: 235025, 4: 20509, 5: 20868, 2: 13440, 9: 10572, 6: 20175, 3: 17537, 10: 8071, 8: 13912, 11: 5575, 7: 17230, 16: 1384, 12: 4400, 13: 2978, 15: 1766, 14: 2466, 17: 952, 23: 239, 21: 401, 19: 620, 18: 890, 59: 2, 25: 164, 30: 93, 32: 77, 22: 352, 20: 552, 34: 46, 36: 49, 43: 23, 28: 118, 54: 10, 35: 50, 44: 12, 27: 150, 29: 93, 24: 276, 31: 64, 26: 156, 45: 26, 38: 27, 71: 4, 50: 12, 37: 34, 55: 6, 86: 1, 41: 17, 80: 4, 82: 1, 33: 61, 47: 12, 64: 1, 42: 22, 58: 2, 39: 28, 69: 2, 52: 10, 51: 10, 40: 41, 62: 3, 46: 17, 96: 2, 48: 12, 65: 5, 57: 7, 72: 4, 128: 1, 60: 6, 114: 2, 78: 2, 76: 1, 56: 5, 95: 2, 136: 1, 63: 3, 126: 2, 112: 2, 162: 1, 216: 1, 77: 3, 105: 3, 66: 4, 70: 5, 88: 3, 153: 1, 135: 1, 200: 1, 111: 2, 152: 1, 75: 4, 130: 1, 273: 1, 85: 2, 68: 2, 92: 2, 137: 2, 61: 5, 53: 4, 113: 1, 366: 1, 67: 3, 93: 1, 120: 1, 164: 1, 110: 2, 89: 1, 143: 1, 83: 1, 49: 4, 131: 1, 101: 1, 103: 1, 73: 3, 179: 1, 886: 1, 74: 2, 2285: 1, 79: 1, 139: 1, 166: 1, 168: 1}





### Spot check AD fields

In [13]:
# Use "." as delimiter will work for most, but exceptions:
'''
['Instituto de Fitosanidad, Colegio de Postgraduados, km. 35.5 Carr. 
  Mexico-Texcoco, 56230-Texcoco, Edo. de Mexico, Mexico.']
['Botanisches Institut der Ludwig-Maximilians Universitat, Munchen, F.R.G.']
'''

for idx in range(0, len(all_rec), 1000):
  rec = all_rec[idx]
  if 'AD' in rec:
    print([rec['AD'][0]]) 

['Faculty of Pharmaceutical Sciences, Kumamoto University, Japan.']
["URA Centre National de la Recherche Scientifique 576, Departement de Biologie Moleculaire et Structurale, Centre d'Etudes Nucleaires de Grenoble, France."]
['Department of Biological Sciences, Stanford University, CA 94305-5020.']
['Botanisches Institut der Ludwig-Maximilians Universitat, Munchen, F.R.G.']
['Department of Biochemistry, Temple University School of Medicine, Philadelphia, PA 19140.']
['Department of Biochemistry, Johns Hopkins University, School of Hygiene and Public Health, Baltimore, Maryland 21205.']
['Institut de Biologie Moleculaire des Plantes du CNRS, Strasbourg, France.']
['Department of Agronomy, Purdue University, West Lafayette, Indiana 47907.']
["Departement de Biologie/Service de Biologie Cellulaire, Institut National de la Sante et de la Recherche Medicale U246, Centre d'Etudes Nucleaires de Saclay, Gif sur Yvette, France."]
['Ministry of Agriculture, Fisheries and Food, Slough Laboratory

## ___Search for country codes___

### Set up country dictionaries

In [14]:
# Build {country_name or official_name: alpha_3 code}
countries   = list(pycountry.countries)
cname_to_a3 = {}

for country in countries:
  name_a2    = country.alpha_2
  name_a3    = country.alpha_3
  name_short = country.name

  cname_to_a3[name_a2] = name_a3 # store this for situation like US
  cname_to_a3[name_a3] = name_a3 # store this for sitiation like USA
  cname_to_a3[name_short] = name_a3
  
  # put official name in
  try:
    name_offic = country.official_name
    cname_to_a3[name_offic] = name_a3
  except AttributeError:
    #print("No official name:", name_short)
    name_offic = "NA"

In [15]:
# Also build a dictionary for historical countries
countries_hist = list(pycountry.historic_countries)
cname_hist_to_a3 = {}

for country in countries_hist:

  # the name in historical countries are the official names
  name_offic = country.name
  cname_hist_to_a3[name_offic] = name_a3
  
  name_short = name_offic.split(",")[0]
  cname_hist_to_a3[name_short] = name_a3


In [16]:
# For some issues that arise
suppl_dict = {"UK":"GBR", "The Netherlands":"NLD", "Taiwan":"TWN", 
              "Republic of China":"TWN", "the Netherlands":"NLD"}

### Search for country info with pycountry

In [17]:
def get_location_str(AD, token_idx=-1, debug=0):
  '''Get the potential location string from AD
  Args:
    AD (str): The content in the AD field
    token_idx (int): -1, -2, or 0 (whole thing)
  Return
    location (str): the string that likely contain location info
    errflag (int): the AD info is empty and thus erroneous (1) or not (0)
  '''

  # The first element in the AD list is used (1st author)
  add_str  = AD[0]

  if debug: print("add_str:", add_str)

  # But there are 12 records where the AD field looks like:
  # ['.', '.', '.', '.', '.', '.']
  # So tokens will be "", dealt with in the if-else statement below.
  if add_str == "":
    loc = "NA"
    errflag = 1

  else:
    errflag = 0
    # Multipe authors:
    # ['From xxx, xxx, xxx, xxx, xxx (Ranade, Ganea, Razzak, and Garcia Gil)']
    # Another case:
    # [['xxx, xx, xxx, Maryland 20742 (M.H., G.F.D.).']
    if add_str[-1] == ")" or add_str[-2:] == ").":
      leftMargin = add_str.rfind(" (")
      add_str = add_str[:leftMargin]

    # Email field: Was contemplating using email for country, but some 1st
    # author emails are not the same as for the institution (see 2nd example).

    # 18930883, loc:China E-mail: suzhi1026@163.com
    # Some have ". Email:..."
    if add_str.find("E-mail:") != -1:
      tmp_str = add_str[:add_str.find("E-mail:")]
      # Use space as delimiter, then the empty space is taken care of later
      leftMargin = tmp_str.rfind(" ")
      add_str = add_str[:leftMargin]
      if debug: print("found: 'E-mail:", add_str)

    # 22016614, loc:Poland; E-Mails: agnieszka.pszczolkowska@uwm.edu.pl (A.P.); macieklojko@wp.p
    # 22072902, loc:China; E-Mail: yiruizao@163.com
    if add_str.find("; E-Mail") != -1:
      add_str = add_str[:add_str.find("; E-Mail")]
      if debug: print("found: '; E-Mail", add_str)

    # Below are more example for using @ field for parsing.
    #   case 1: 17296497 ['xxx, Denmark. blah@aki.ku.dk <blah@aki.ku.dk>']
    #   case 2: ?? ['Institut ..., France. achmustilli@libero.it']
    # The next one is weird, look like there is something not parsed properly
    #   case 3: 17632571 ['xxx, xxx, UK. ib 103@mole.bio.cam.ac.uk']
    # Also the next one, so cannot use "." as delimiter first.
    #   case 4: 18613594 ['xxx, ACT2601 Australia. rod.mahon@csiro.au']
    # Ok, some is missing space between country and email... Man...
    #   case 5: 9931476 ['PO Box 12, Rehovot 76100, Israel.cohenk@agri.huji.ac.il']
    if add_str.find("@") != -1:

      # Find where the 1st email address is and generate a temp_str
      tmp_str = add_str[:add_str.find("@")]
      
      # Originally using space, but the 3rd example shows that it is not good.
      # But then "." is regularly used in email address. So do space delimter
      # first, then use "."
      tmp_str = tmp_str[:tmp_str.rfind(" ")] # this takes care of case 4
      leftMargin = tmp_str.rfind(".")        # this takes care of case 3
      if leftMargin == -1:                   # this takes care of case 5
        leftMargin = tmp_str.rfind(",")
      add_str = add_str[:leftMargin]
      if debug: print("found: '@", add_str)

    # ISNI code in ~7k records
    #   30263677 ['27601 Republic of Korea. ISNI: 0000 0004 1775 9398. GRID: grid.444122.5'
    if add_str.find("ISNI:") != -1:
      tmp_str = add_str[:add_str.find("ISNI:")]
      leftMargin = tmp_str.rfind(".")
      add_str = add_str[:leftMargin]

    # Strip empty space before
    add_str = add_str.strip()

    if debug: print("final add_str:",[add_str])

    # Some just have email address so after parsing, add_str is "".
    #   25548975: '. kehrig@pharmazie.uni-kiel.de.'
    if add_str == "":
      loc = "NA"
      errflag = 1
    # if tokens ends with ".", rid of it
    else:
      if add_str[-1] == ".": 
        add_str = add_str[:-1]

      # Originally splot with ", " then "." but there are edge cases like this:
      #   17444520, loc:Hsinchu,Taiwan
      # So split with ",", if it does not exist, split with " "
      if "," not in add_str:
        # Only one large token, split with space instead
        tokens = add_str.split(" ")
        try:
          loc = tokens[token_idx]
        except IndexError:
          loc = "NA"
          errflag = 1
      else:
        tokens = add_str.split(",")
        try:
          # rid of space if present
          loc = tokens[token_idx].strip()
        except IndexError:
          loc = "NA"
          errflag = 1

      # 19140172, loc:IR Iran, there are other variations. OpenStreeMap cannot
      # find these so deal with them manually.
      if loc.endswith(" Iran"):
        loc = loc[loc.find("Iran"):]

      # 19651701, loc:Taiwan ROC
      if loc.endswith(" ROC"):
        loc = loc[:loc.rfind(" ")]

      # 21299880, loc:DF- 70770-917 - Brasil
      if loc.find("- ") != -1:
        loc = loc.split("- ")[-1]

      # 1915409, loc:Stuttgart/Bundesrepublik Deutschland
      if loc.find("/") != -1:
        loc = loc.split("/")[-1]

    if debug: print(loc)

  return loc, errflag

In [18]:
def get_a3(location, cname_to_a3, cname_hist_to_a3, suppl_dict):
  # Found current country name
  if location in cname_to_a3:
    a3 = cname_to_a3[location]
  # Found historical country name
  elif location in cname_hist_to_a3:
    a3 = cname_hist_to_a3[location]
  # Found name in suppl dict
  elif location in suppl_dict:
    a3 = suppl_dict[location]
  # Leave this for geopy in the next step
  else:
    a3 = 'NA'

  return a3

In [20]:
# Without country a3
# Before checking for US state: 24867

country_info = {} # {pmid:[first_AU, first_AD, alpha_3]}
not_found    = {} # {pmid:[AU, AD]}, for records with no a3 code
count_AD_NA  = 0  # count records without AD field
count_a3     = 0  # count records with a3 code

# Go through each record
for rec in tqdm(all_rec):
  pmid = rec["PMID"]
  a3   = "NA" # set default value

  # Deal with AU info
  try:
    AU = rec["AU"]
  except KeyError:
    AU = "NA"  

  # Deal with AD info
  try:
    AD = rec["AD"]
    debug = 0
    #if pmid == "1308645":
    #  debug = 1
    #else:
    

    loc, errflag = get_location_str(AD, -1, debug)
    if not errflag:
      a3 = get_a3(loc, cname_to_a3, cname_hist_to_a3, suppl_dict)
      # Leave this for geopy in the next step
      if a3 == "NA":
        not_found[pmid] = [AU, AD]
    # The AD field is effectively empty so AD is "NA"
    else:
      AD = "NA"
      count_AD_NA += 1

  # AD field does not exist
  except KeyError:
    AD = "NA"
    count_AD_NA += 1

  if a3 != "NA":
    count_a3 += 1

  # For AD is NA, no point in doing more, so include these in country_info. 
  # For those with a3 code, they are done, so include these in country info.
  if AD == "NA" or a3 != "NA":
    country_info[pmid] = [AU, AD, a3]

  #print(pmid, AD, a3)

print("Total   :", len(all_rec))
print("With a3 :", count_a3)
print("No AD   :", count_AD_NA)
print("To geopy:", len(not_found.keys()))

100%|██████████| 421626/421626 [00:03<00:00, 127189.47it/s]

Total   : 421626
With a3 : 329237
No AD   : 19851
To geopy: 72502





In [21]:
# Save country_info
country_info_file = work_dir / "country_info-pycountry.pickle"
with open(country_info_file, "wb") as f:
  pickle.dump(country_info, f)

### Spot check again

In [22]:
for idx, key in enumerate(not_found):
  print(key, not_found[key][1])
  if idx == 1000:
    break

803110 ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus 43210.']
1279697 ['Department of Entomology, University of Illinois, Urbana 61801.']
1279702 ['Sandoz Agro, Inc., Palo Alto, CA 94304.']
1280165 ['Isotope and Structural Chemistry Division, Los Alamos National Laboratory, NM 87545.']
1280601 ['Institute of Biophysics, Czechoslovak Academy of Sciences, Brno.']
1280857 ['Department of Biological Chemistry and Biophysics, University of Michigan, Ann Arbor 48109.']
1281435 ['Plant Biology Division, Samuel Roberts Noble Foundation, Ardmore, OK 73402.']
1281438 ['Life Science Research Laboratory, Japan Tobacco Inc., Kanagawa.']
1281482 ['Department of Energy-Plant Research Laboratory, Michigan State University, East Lansing 48824.']
1281700 ['Department of Genetics, Harvard Medical School, Boston, Massachusetts.']
1281816 ['Department of Biology, Yale University, New Haven, Connecticut 06511.']
1282045 ['Department of Botany and Plant Pathology,

## ___Nominatim setup___

### Docker setup and initial Nominatim run

Tried a few options:
- [This tutorial](https://www.linkedin.com/pulse/geocoding-geopy-your-own-nominatim-server-chonghua-yin?trk=related_artice_Geocoding%20with%20GeoPy%20and%20Your%20Own%20Nominatim%20Server_article-card_title).
- Installation: quite challenging, did not finish. Use [Docker image](https://github.com/mediagis/nominatim-docker/blob/master/4.2/README.md) instead.
- Query through the Nominatim service resulted in time out. So use a docker version
  - Intial docker run: found out that only Monaco was included. 
- Look for ways to load the whole world.
  - Found and downloaded `planet-230213.osm.pbf` from [OpenStreeMap](https://planet.openstreetmap.org/).
  - [Not needed] Figure out [how to transfer files between local host and container](https://www.edureka.co/community/10534/copying-files-from-host-to-docker-container) and move all pbf files into docker container and commit the image with files.
  - Found out docker bind mount option (-v) so local file system can be mounted for use in the container.
  - Run out of memory
- Do each continent instead
  - Europe (26.4G):
    - It ends with error.
    - Try merge multiple smaller files instead with `osmium`, see [this post](https://gis.stackexchange.com/questions/242704/merging-osm-pbf-files) on install and merging osm.pbf files.
  - North America (12.2G): 
    - osm2pgsql took 8240s (2h 17m 20s) overall


In [None]:
# Install and start docker daemon:
'''bash
sudo apt-get update
sudo apt-get upgrade
sudo apt install docker.io
sudo dockerd
'''

# Test Nominatim docker image:
'''bash
sudo groupadd docker
sudo usermod -aG docker shius
su -s shius
docker run hello-world # testing
docker run -it \
  -e PBF_URL=https://download.geofabrik.de/europe/monaco-latest.osm.pbf \
  -e REPLICATION_URL=https://download.geofabrik.de/europe/monaco-updates/ \
  -p 8080:8080 \
  --name nominatim \
  mediagis/nominatim:4.2
'''

# Testing, in browser:
#-http://localhost:8080/search.php?q=avenue%20pasteur

'bash\nsudo groupadd docker\nsudo usermod -aG docker shius\nsu -s shius\ndocker run hello-world # testing\ndocker run -it   -e PBF_URL=https://download.geofabrik.de/europe/monaco-latest.osm.pbf   -e REPLICATION_URL=https://download.geofabrik.de/europe/monaco-updates/   -p 8080:8080   --name nominatim   mediagis/nominatim:4.2\n'

### Merge osm.pbf files

#### Get files

The Europe osm.pbf file (26.4Gb) cannot be loaded but the North America file work (12.2Gb). 
- So try to download individual country files then merge them into smaller files with osmium.
  - Found out that the Europe folder does not have Russian Federation file. Download it by itself and rerun below.
- [Extract all the URLs from the webpage Using Python](https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-python/)
- Save the european files into `/home/shius/data_nominatim/continent_osm/europe` by running `script_7_1b_get_europe_osm_pbfs.py`.
  - Note the following are excluded: `["alps", "britain-and-ireland", "dach"]`
  - Because of this, the data is now below 20Gb, so will break them into 2 subgroups.
- [Get linked file size](https://stackoverflow.com/questions/55226378/how-can-i-get-the-file-size-from-a-link-without-downloading-it-in-python)
- Then run the following to:
  - Get groups.
  - [Merge multiple osm pbf files](https://gis.stackexchange.com/questions/242704/merging-osm-pbf-files): try 2nd answer.

Also merge all other continents except North America, Asia, and Europe.


In [23]:
# Generate file name dictionary: {file size: name}
size_name_europe = {}

# Special subregions to exclude so they are not overlapping
exclude = ["alps-latest.osm.pbf", 
           "britain-and-ireland-latest.osm.pbf", 
           "dach-latest.osm.pbf"]

# get the soup obj
url = 'https://download.geofabrik.de/europe/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

# Go through links
for link in soup.find_all('a'):
  name = link.get('href')
  if name.endswith('-latest.osm.pbf') and name not in exclude:
    #print(name)
    req = urllib.request.Request(f"{url}{name}", method='HEAD')
    f   = urllib.request.urlopen(req)
    
    # size in Gb
    size = int(f.headers['Content-Length'])/(1024*1024*1024)
    size_name_europe[size] = name

# The above does not include Runssian Federation, add it in manually
size_name_europe[3.2] = "russia-latest.osm.pbf"

In [24]:
# sort by size and break them into 2 groups with ~9Gb each
sizes = list(size_name_europe.keys())
sizes.sort()

groups         = [] # list of subgroups
subgroup       = [] # name of country in a subgroup
subgroup_total = 0  # total within subgroup
threshold_size = 12 # threshold size for a subgroup
for size in sizes:
  if subgroup_total < threshold_size:
    subgroup.append(size_name_europe[size])
    subgroup_total += size
  else:
    print(subgroup_total)
    groups.append(subgroup)
    # reset
    subgroup       = []
    subgroup_total = 0

# Add the last subgroup
if subgroup != []:
  print(subgroup_total)
  groups.append(subgroup)

13.08294640481472
12.736804478615522


In [25]:
for i in groups:
  print(i)

['monaco-latest.osm.pbf', 'andorra-latest.osm.pbf', 'liechtenstein-latest.osm.pbf', 'guernsey-jersey-latest.osm.pbf', 'isle-of-man-latest.osm.pbf', 'faroe-islands-latest.osm.pbf', 'malta-latest.osm.pbf', 'azores-latest.osm.pbf', 'macedonia-latest.osm.pbf', 'kosovo-latest.osm.pbf', 'cyprus-latest.osm.pbf', 'montenegro-latest.osm.pbf', 'luxembourg-latest.osm.pbf', 'albania-latest.osm.pbf', 'iceland-latest.osm.pbf', 'moldova-latest.osm.pbf', 'georgia-latest.osm.pbf', 'estonia-latest.osm.pbf', 'latvia-latest.osm.pbf', 'bosnia-herzegovina-latest.osm.pbf', 'bulgaria-latest.osm.pbf', 'serbia-latest.osm.pbf', 'croatia-latest.osm.pbf', 'lithuania-latest.osm.pbf', 'hungary-latest.osm.pbf', 'romania-latest.osm.pbf', 'slovakia-latest.osm.pbf', 'ireland-and-northern-ireland-latest.osm.pbf', 'slovenia-latest.osm.pbf', 'belarus-latest.osm.pbf', 'greece-latest.osm.pbf', 'portugal-latest.osm.pbf', 'switzerland-latest.osm.pbf', 'denmark-latest.osm.pbf', 'turkey-latest.osm.pbf', 'belgium-latest.osm.pbf',

#### Merge Europe files

When loading the merged pbf file, encountered a error:
- `ERROR: Input data is not ordered: relation id 33702 appears more than once`
- Solution, use osmium merge-changes
  - Run out of memory locally (64G) and job killed.
  - Try to run in HPC with a docker image
    - But HPC does no allow docker, need to use singularity
  - Europe1: In `dev-intel16-k80`
    - Peak memory: virtual-509g, physical-167Gb
  - Europe2 Process kill `dev-intel16-k80`, use `dev-amd20` instead.
    - Peak memory: virtual-601g, physical-209g

In [26]:
# In /home/shius/projects/plant_sci_hist/7_countries/continent_osm/europe

'''bash
mkdir europe1 europe2
mv italy-latest.osm.pbf russia-latest.osm.pbf germany-latest.osm.pbf france-latest.osm.pbf europe1/
mv *.pbf europe2/
'''

###############
# NOT WORKING #
###############
# cd europe1
# osmium merge * -o europe1.osm.pbf
# cd ../europe2
# osmium merge * -o europe2.osm.pbf

'\nmkdir europe1 europe2\nmv italy-latest.osm.pbf russia-latest.osm.pbf germany-latest.osm.pbf france-latest.osm.pbf europe1/\nmv *.pbf europe2/\n\ncd europe1\nosmium merge * -o europe1.osm.pbf\n\ncd ../europe2\nosmium merge * -o europe2.osm.pbf\n'

In [None]:
# Europe1 merging
# move files to HPC
# In HPC: dev-intel16-k80
'''bash
# Pull and run docker image
singularity pull docker://stefda/osmium-tool
singularity run docker://stefda/osmium-tool

# In singularity
cd /mnt/home/shius/projects/plant_sci_hist/7_countries/

# Merge
osmium merge-changes -s -v -o europe1.osm.pbf *.pbf
'''

In [None]:
# Europe2 merging
# Move merged file, delete individual country files, upload europe2 country
# files, in dev-amd20
'''bash
singularity run docker://stefda/osmium-tool

cd /mnt/home/shius/projects/plant_sci_hist/7_countries/

osmium merge-changes -s -v -o europe2.osm.pbf *.pbf
'''

#### Merge all other continents

North American, Europe, and Asia are big files. The other continents are:
- africa, antarctica, australia-oceania, central-america, south-america -- merged into all_others.osm.pbf

In [27]:
# In /home/shius/data_nominatim/continent_osm
'''
mkdir all_others
mv africa-latest.osm.pbf antarctica-latest.osm.pbf australia-oceania-latest.osm.pbf central-america-latest.osm.pbf south-america-latest.osm.pbf all_others/
cd all_others
'''

# move files to HPC
# pull and run docker image in singularity
'''
singularity run docker://stefda/osmium-tool

cd /mnt/home/shius/projects/plant_sci_hist/7_countries

osmium merge-changes -s -v -o all_others.osm.pbf *.pbf
'''

'\nmkdir all_others\nmv africa-latest.osm.pbf antarctica-latest.osm.pbf australia-oceania-latest.osm.pbf central-america-latest.osm.pbf south-america-latest.osm.pbf all_others/\ncd all_others\nosmium merge * -o all_others.osm.pbf\n'

## ___Continue to identify location using geopy___

### Search function

In [28]:
def call_geolocator(geolocator, AD, token_idx, suppl_dict, debug=0):
  '''Subrontine for calling geolocator
  Args
    AD (list): A list of addresses for authors
    token_idx (int): define which token the location string should be obtained
      default to -1 which is typically where the broadest info (e.g., city,
      zip code) is located. If this does not work, will try -2 token, or 0
      which means the entire AD string will be used for geolocator search.
  Return:
    a3 (string): the a3 country code, if not found, return empty string.
    geo (geolocator): the object returned from the search
    err_str (string): If an exception is thrown, this is the error string
  '''
  # Get location string
  loc, errflag   = get_location_str(AD, token_idx)

  if debug: print(loc, errflag)

  # This happens when there is only one token delimited by "," but the token_idx
  # is set to -2
  if errflag == 1:
    return "", None, "Only_1_token"

  # Call Nominatim to get a response:
  err_str = "NO_ERR"
  try:
    geo  = geolocator.geocode(loc, language='en')
  except Exception as ex:
    err_str = str(ex)
    geo = None

  if geo is not None:
    country = geo.raw['display_name'].split(", ")[-1]
    a3 = get_a3(country, cname_to_a3, cname_hist_to_a3, suppl_dict)
  else:
    a3 = "" 

  return a3, geo, err_str

In [29]:
def call_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, dir_out,
                   sleep_time=0.5):
  '''Search entries against Nominatim server
  Args:
    nominatim_nf (dict): {pmid:[AU,AD]}, record still not found
    not_found (dict): {pmid:[AU, AD]}, for records with no a3 code after
      pycountry search
    suppl_dict (dict): regions that do not have proper pycountry info
    dir_pmid_log (Path): path to pmid log file of completed searches
    dir_out (Path): path to geolocator outputs
    sleep_time (int): time in second between queries.
  Return:
    nominatim_nf (dict): {pmid:[AU, AD]} for pmids with NA as search results
  '''

  # Access local Nominatim server
  geolocator = Nominatim(domain=f'localhost:8080', scheme='http')

  # Found info, because timeout keep happening, decide to save results as
  # things go.
  #nominatim_out = {} # {pmid:[AU, AD, a3, geo.raw]}

  # Info not found
  nominatim_nf  = {} # {pmid:[AU, AD]}

  # Because I keep getting time out, try to track what's working so I can
  # continue what what's not.
  # Create directory for output search result files
  dir_out.mkdir(parents=True, exist_ok=True)

  # Get the last pmid with result
  if not dir_pmid_log.is_file():
    out_names     = ""
    last_out_name = ""
    print("Starting with no output yet")
  else:
    with open(dir_pmid_log, "r") as f:
      out_names = f.readline()
      out_names_list = out_names.split(' ')
      out_names_list.sort()
      last_out_name = out_names_list[-1]
      print("Started, last_out_name:", last_out_name)

  # Save the log file again
  with open(dir_pmid_log, "w") as f:
    # Write the names aleady processed
    f.write(out_names)

    # sort pmids
    pmids = list(not_found.keys())
    pmids.sort()

    # determine where to restart
    if last_out_name == "":
      starting_idx = 0
    else:
      starting_idx = pmids.index(last_out_name)+1

    pmids_remaining = pmids[starting_idx:]

    # Go through records with no a3 info  
    for pmid in tqdm(pmids_remaining):

      err_str1, err_str2, err_str3 = "", "", ""

      # Get AU and AD
      [AU, AD] = not_found[pmid]

      # Get location string based on the first author's AD field
      a3, geo, err_str1 = call_geolocator(geolocator, AD, -1, suppl_dict)

      # Not found using the last field
      if geo is None:
        # Try the last second field
        a3, geo, err_str2 = call_geolocator(geolocator, AD, -2, suppl_dict)

        # Still not found
        if geo is None:
          # Try using the whole thing
          a3, geo, err_str3 = call_geolocator(geolocator, AD, 0, suppl_dict)
          

      if geo is None:
        nominatim_nf[pmid] = [AU, AD]
        geo_file = dir_out / f"{pmid}_na.txt"
        with open(geo_file, "w") as f_geo:
          f_geo.write(f"{AU}\t{AD}\tNA\t{None}\t{err_str1},{err_str2},{err_str3}")
      else:
        # Save result instead of put it in dictionary
        #nominatim_out[pmid] = [AU, AD, a3, geo.raw]
        geo_file = dir_out / f"{pmid}.txt"
        with open(geo_file, "w") as f_geo:
          f_geo.write(f"{AU}\t{AD}\t{a3}\t{geo.raw}")

      # Write the pmid of this record into after search result is returned
      f.write(f" {pmid}")

      # To reduce possibilities of timeout
      sleep(sleep_time)

  return nominatim_nf

In [30]:
# Deprecated because there are some issues with handling some weird records
# E.g., 16656381
'''
def recursive_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, 
                                    dir_nominatim_na_out, err_recur=0):
  try:
    return call_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, 
                          dir_nominatim_na_out, err_recur)
  except Exception as ex:
    #print("ERROR:", ex, err_recur, 1)
    return recursive_nominatim(nominatim_nf, not_found, suppl_dict, 
                               dir_pmid_log, dir_nominatim_na_out, 1)
'''

'\ndef recursive_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, \n                                    dir_nominatim_na_out, err_recur=0):\n  try:\n    return call_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, \n                          dir_nominatim_na_out, err_recur)\n  except Exception as ex:\n    #print("ERROR:", ex, err_recur, 1)\n    return recursive_nominatim(nominatim_nf, not_found, suppl_dict, \n                               dir_pmid_log, dir_nominatim_na_out, 1)\n'

### Search region 1: north america

#### Nominatim server setup

In [335]:
# Set up server
'''bash
docker run -it \
  -e PBF_PATH=/nominatim/data/north-america-latest.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim_na \
  mediagis/nominatim:4.2
'''

# Export container
'''bash
# Find container ID
docker ps
docker export 3c9d0c835455 > nominatim_north-america.tar
'''

# Load container (did not try)
# REPOSITORY: mediagis/nominatim
# TAG: 4.2
'''bash
cat nominatim_north-america.tar | docker import - mediagis/nominatim:4.2
docker run -it \
  -e PBF_PATH=/nominatim/data/north-america-latest.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim_na \
  mediagis/nominatim:4.2
'''

'bash\ndocker export 3c9d0c835455 > nominatim_north-america.tar\n'

#### Get nominatim search results

In [341]:
# call nominatim and get the dictionary for records not found still

# Define output dir
dir_nominatim_na_out = work_dir / "nominatim_na_out"

# Define pmid log file
dir_pmid_log = work_dir / "log_nominatim_na_pmids"

# nominatim search north america, record not found
nominatim_na_nf = {}

nominatim_na_nf = call_nominatim(nominatim_na_nf, not_found, suppl_dict, 
                                 dir_pmid_log, dir_nominatim_na_out, 0.1)


Started, last_out_name: 17295415


100%|██████████| 81225/81225 [4:32:01<00:00,  4.98it/s]   


In [342]:
# Save the nominatim_na obj

# nominatim north american with found records
# Decide not to create this. Generate output files for each records instead

#nominatim_na_file    = work_dir / "country_info-nominatim_na.pickle"
#with open(nominatim_na_file, "wb") as f:
#  pickle.dump(nominatim_na, f)

# nominatim north american records not found
nominatim_na_nf_file = work_dir / "country_info-nominatim_na_NF.pickle"
with open(nominatim_na_nf_file, "wb") as f:
  pickle.dump(nominatim_na_nf, f)

#### Search gain

In case some are not found because of timeout

In [343]:
# call nominatim and get the dictionary for records not found still

# Define output dir
dir_nominatim_na_out2 = work_dir / "nominatim_na_out2"

# Define pmid log file
dir_pmid_log2 = work_dir / "log_nominatim_na_pmids2"

# nominatim search north america, record not found
nominatim_na_nf2 = {}

nominatim_na_nf2 = call_nominatim(nominatim_na_nf2, nominatim_na_nf, suppl_dict, 
                                  dir_pmid_log2, dir_nominatim_na_out2, 0.1)


Starting with no output yet


100%|██████████| 14735/14735 [46:06<00:00,  5.33it/s] 


In [352]:
# Save the nominatim_na obj

# nominatim north american with found records
# Decide not to create this. Generate output files for each records instead

#nominatim_na_file    = work_dir / "country_info-nominatim_na.pickle"
#with open(nominatim_na_file, "wb") as f:
#  pickle.dump(nominatim_na, f)

# nominatim north american records not found
nominatim_na_nf_file2 = work_dir / "country_info-nominatim_na_2_NF.pickle"
with open(nominatim_na_nf_file2, "wb") as f:
  pickle.dump(nominatim_na_nf2, f)

In [353]:
len(nominatim_na_nf2.keys())

14674

### Search region 2: asia

#### Docker command

In [31]:
# Run nominatim and build database
'''bash
docker run -it \
  -e PBF_PATH=/nominatim/data/asia-latest.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim_as \
  mediagis/nominatim:4.2
'''

# Export container
'''bash
docker ps
docker export 2828854c2700 > nominatim_asia.tar
'''

'bash\ndocker ps\ndocker export 2828854c2700 > nominatim_asia.tar\n'

#### Asia run 1

In [32]:
dir_nominatim_as_out = work_dir / "nominatim_as_out"       # output dir
dir_pmid_log_as      = work_dir / "log_nominatim_as_pmids" # pmid log file
nominatim_as_nf      = {}                                  # record not found

nominatim_as_nf = call_nominatim(nominatim_as_nf, not_found, suppl_dict, 
                                 dir_pmid_log_as, dir_nominatim_as_out, 0.1)


Starting with no output yet


100%|██████████| 72502/72502 [3:38:04<00:00,  5.54it/s]   


In [33]:
# Save the nominatim_na obj
nominatim_as_nf_file = work_dir / "country_info-nominatim_as_NF.pickle"
with open(nominatim_as_nf_file, "wb") as f:
  pickle.dump(nominatim_as_nf, f)

In [34]:
print(len(list(nominatim_as_nf.keys())))

6531


#### Asia run 2

In [35]:
dir_nominatim_as_out2 = work_dir / "nominatim_as_out2"       # output dir
dir_pmid_log_as2      = work_dir / "log_nominatim_as_pmids2" # pmid log file
nominatim_as_nf2      = {}                                  # record not found

nominatim_as_nf2 = call_nominatim(nominatim_as_nf2, nominatim_as_nf, suppl_dict, 
                                 dir_pmid_log_as2, dir_nominatim_as_out2, 0.1)


Starting with no output yet


100%|██████████| 6531/6531 [21:19<00:00,  5.10it/s]  


In [36]:
# Save the nominatim_na obj
nominatim_as_nf_file2 = work_dir / "country_info-nominatim_as_2_NF.pickle"
with open(nominatim_as_nf_file2, "wb") as f:
  pickle.dump(nominatim_as_nf2, f)

### Search region 3: europe subgroup 1

In [None]:
# Run nominatim and build database
'''bash
docker run -it \
  -e PBF_PATH=/nominatim/data/europe-latest.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim_eu \
  mediagis/nominatim:4.2
'''

# Export container
'''bash
docker ps
docker export 3c9d0c835455 > nominatim_europe1.tar
'''

In [None]:
# Run nominatim and build database
'''bash
docker run -it \
  -e PBF_PATH=/nominatim/data/europe1.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim_eu1 \
  mediagis/nominatim:4.2
'''

# Export container
'''bash
docker ps
docker export 3c9d0c835455 > nominatim_europe1.tar
'''

#### Europe subgroup 1 run 1

In [None]:
dir_nominatim_eu1_out = work_dir / "nominatim_eu1_out"       # output dir
dir_pmid_log_eu1      = work_dir / "log_nominatim_eu1_pmids" # pmid log file
nominatim_eu1_nf      = {}                                   # record not found

nominatim_eu1_nf = call_nominatim(nominatim_eu1_nf, not_found, suppl_dict, 
                                  dir_pmid_log_eu1, dir_nominatim_eu1_out, 0.1)


In [None]:
# Save the nominatim_na obj
nominatim_eu1_nf_file = work_dir / "country_info-nominatim_eu1_NF.pickle"
with open(nominatim_eu1_nf_file, "wb") as f:
  pickle.dump(nominatim_eu1_nf, f)

#### Europe subgroup 1 run 2

In [None]:
dir_nominatim_eu1_out2 = work_dir / "nominatim_eu1_out2"       # output dir
dir_pmid_log_eu12      = work_dir / "log_nominatim_eu1_pmids2" # pmid log file
nominatim_eu1_nf2      = {}                                  # record not found

nominatim_as_nf2 = call_nominatim(nominatim_eu1_nf2, nominatim_eu1_nf, suppl_dict, 
                                 dir_pmid_log_eu12, dir_nominatim_eu1_out2, 0.1)


In [None]:
# Save the nominatim_na obj
nominatim_eu1_nf_file2 = work_dir / "country_info-nominatim_eu1_2_NF.pickle"
with open(nominatim_eu1_nf_file2, "wb") as f:
  pickle.dump(nominatim_eu1_nf2, f)

### Search region 4: europe subgroup 2

In [None]:
# Run nominatim and build database
'''bash
docker run -it \
  -e PBF_PATH=/nominatim/data/europe2.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/europe/europe2:/nominatim/data \
  --name nominatim_eu2 \
  mediagis/nominatim:4.2
'''

# Export container
'''bash
docker export 3c9d0c835455 > nominatim_europe2.tar
'''

#### Europe subgroup 2 run 1

In [None]:
dir_nominatim_eu2_out = work_dir / "nominatim_eu2_out"       # output dir
dir_pmid_log_eu2      = work_dir / "log_nominatim_eu2_pmids" # pmid log file
nominatim_eu2_nf      = {}                                   # record not found

nominatim_eu2_nf = call_nominatim(nominatim_eu2_nf, not_found, suppl_dict, 
                                  dir_pmid_log_eu2, dir_nominatim_eu2_out, 0.1)


In [None]:
# Save the nominatim_na obj
nominatim_eu2_nf_file = work_dir / "country_info-nominatim_eu2_NF.pickle"
with open(nominatim_eu2_nf_file, "wb") as f:
  pickle.dump(nominatim_eu2_nf, f)

#### Europe subgroup 2 run 2

In [None]:
dir_nominatim_eu2_out2 = work_dir / "nominatim_eu2_out2"       # output dir
dir_pmid_log_eu22      = work_dir / "log_nominatim_eu2_pmids2" # pmid log file
nominatim_eu2_nf2      = {}                                  # record not found

nominatim_as_nf2 = call_nominatim(nominatim_eu2_nf2, nominatim_eu2_nf, suppl_dict, 
                                 dir_pmid_log_eu22, dir_nominatim_eu2_out2, 0.1)


In [None]:
# Save the nominatim_na obj
nominatim_eu2_nf_file2 = work_dir / "country_info-nominatim_eu2_2_NF.pickle"
with open(nominatim_eu2_nf_file2, "wb") as f:
  pickle.dump(nominatim_eu2_nf2, f)

### Search region 5: all others

`all_others.osm.pbf`
- africa
- antarctica
- australia-oceania
- central-america
- south-america
- total = ~10.4Gb

In [None]:
# Run nominatim and build database
'''bash
docker run -it \
  -e PBF_PATH=/nominatim/data/all_others.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim_ao \
  mediagis/nominatim:4.2
'''

# Export container
'''bash
docker export 3c9d0c835455 > nominatim_all_others.tar
'''

#### All others run 1

In [None]:
dir_nominatim_ao_out = work_dir / "nominatim_ao_out"       # output dir
dir_pmid_log_ao      = work_dir / "log_nominatim_ao_pmids" # pmid log file
nominatim_ao_nf      = {}                                  # record not found

nominatim_ao_nf = call_nominatim(nominatim_ao_nf, not_found, suppl_dict, 
                                 dir_pmid_log_ao, dir_nominatim_ao_out, 0.1)


In [None]:
# Save the nominatim_na obj
nominatim_ao_nf_file = work_dir / "country_info-nominatim_ao_NF.pickle"
with open(nominatim_ao_nf_file, "wb") as f:
  pickle.dump(nominatim_ao_nf, f)

#### All others run 2

In [None]:
dir_nominatim_ao_out2 = work_dir / "nominatim_ao_out2"       # output dir
dir_pmid_log_ao2      = work_dir / "log_nominatim_ao_pmids2" # pmid log file
nominatim_ao_nf2      = {}                                  # record not found

nominatim_ao_nf2 = call_nominatim(nominatim_ao_nf2, nominatim_ao_nf, suppl_dict, 
                                 dir_pmid_log_ao2, dir_nominatim_ao_out2, 0.1)


In [None]:
# Save the nominatim_na obj
nominatim_ao_nf_file2 = work_dir / "country_info-nominatim_ao_2_NF.pickle"
with open(nominatim_ao_nf_file2, "wb") as f:
  pickle.dump(nominatim_ao_nf2, f)

## ___Test___

### pycountry and zip code

#### pycountry

In [None]:
list(pycountry.countries)[0]

In [None]:
rec = all_rec[4004]
au = rec["AU"]
ad = rec["AD"]
au, ad

In [None]:
pycountry.subdivisions.lookup("Urbana")

In [None]:
country = pycountry.countries.get(name="Yugoslavia")
print(country)

In [None]:
country = pycountry.historic_countries.get(name="Yugoslavia")
print(country)

In [None]:
list(pycountry.historic_countries)

#### uszipcode

In [None]:
from uszipcode import SearchEngine

sr = SearchEngine()
z = sr.by_zipcode("02167")
print(z)

#### zipcodes

In [None]:
import zipcodes

exact_zip = zipcodes.matching('02167')
print(exact_zip)

#### us

In [None]:
import us

print(us.states.lookup('24'))
print(us.states.lookup('bleh'))

### Checking medline parse

In [None]:
test_file = medline_dir / 'corpus_plant_421658_medline_100000.pickle'

with open(test_file, 'rb') as f:
    test_medline = pickle.load(f)

In [None]:
# Check those with country_info records with non NA country code
c = 0
for rec in test_medline:
    pmid = rec['PMID']
    if pmid in country_info and country_info[pmid][2] != 'NA':
        print("---\nPMID:", pmid)
        print("Country:", country_info[pmid][2])
        print("AD  :", rec['AD'])

        if c == 5:
            break
        c += 1

In [None]:
# Check those without country_info records
c = 0
for rec in test_medline:
    pmid = rec['PMID']
    if pmid not in country_info:
        print("---\nPMID:", pmid)
        if 'AD' in rec:
          print("AD  :", rec['AD'])
          print("loc :", not_found2[pmid])
        if c == 5:
            break
        c += 1

### Nominatim setup

#### [mediagis/nominatim-docker](https://github.com/mediagis/nominatim-docker/blob/master/4.2/README.md)

Starting out with this, realize that different regions need to be downloaded.

In [None]:
'''
TESTING
---
# Start Nominatim container in another terminal
docker run -it \
  -e PBF_URL=https://download.geofabrik.de/europe/monaco-latest.osm.pbf \
  -e REPLICATION_URL=https://download.geofabrik.de/europe/monaco-updates/ \
  -p 8080:8080 \
  --name nominatim \
  mediagis/nominatim:4.2

# Find container id in another terminal, for this instance: cce4a525a8e2
docker ps

# Open another terminal and copy planet osm psf file into container
# Actually, this is not necessary, because I can use bind mounts (docker run -v)
#docker cp ~/data_nominatim/planet-230213.osm.pbf cce4a525a8e2:/nominatim/

# Start bash shell for the container and check if the file is there
docker exec -it cce4a525a8e2 bash
ls /nominatim/ # <--- confirmed

# Commit the container
docker commit -m "With planet data" -a "Shinhan Shiu" cce4a525a8e2 shius/nominatim_planet
docker images

# Save image
docker save shius/nominatim_planet > nominatim_planet.tar

# Stop and remove docker container
docker stop cce4a525a8e2
docker rm cce4a525a8e2

# Run this again to test local .pbf
docker run -it \
  -e PBF_PATH=/nominatim/data/monaco-latest.osm.pbf \
  -p 8080:8080  \
  -v /osm-maps/data:/nominatim/data \
  --name nominatim \
  mediagis/nominatim:4.2

docker run -it \
  -e PBF_PATH=/nominatim/planet-230213.osm.pbf \
  -p 8080:8080  \
  -v /osm-maps/data:/nominatim/ \
  --name nominatim_planet \
  shius/nominatim_planet
'''

In [None]:
# Try michigan
'''bash
docker run -it \
  -e PBF_URL=https://download.geofabrik.de/north-america/us/michigan-latest.osm.pbf \
  -p 8080:8080  \
  --name nominatim_mi \
  mediagis/nominatim:4.2
'''


#### [Aximem/nominatim-docker-multiple-regions](https://github.com/Aximem/nominatim-docker-multiple-regions)

Not well documented and it did not work for me.

In [None]:
'''bash
# In /home/shius/github
git clone https://github.com/Aximem/nominatim-docker-multiple-regions.git
cd nominatim-docker-multiple-regions/

# Build image
docker build --pull --rm -t nominatim .

# update file: multiple_regions/init_multiple_regions.sh
# Issue: Not sure where to find this file.
#COUNTRIES="africa australia-oceania europe antarctica central-america south-america asia"

# init multiple regions: did not work...
docker run -t -v /home/shius/data_nominatim/:/data nominatim sh /app/multiple_regions/init.sh
'''

### Note on local Nominatim install

Using the Docker version I could just specify the different continent in different runs. But that is very clunky. So I am going to try install locally. Um... just too many dependent packages... Not doing this for now, stop at ICU install.

Following the [instruction](https://nominatim.org/release-docs/latest/admin/Installation/).

#### Install dependency

- cmake: `sudo apt install cmake`
- [expat](https://github.com/libexpat/libexpat/releases): configure, make, make install
- [proj](https://proj.org/):  `conda install -c conda-forge proj`
- [bzip2](https://sourceforge.net/projects/bzip2/): already in Ubuntu
- [Zlib](https://www.zlib.net/): [instruction](https://geeksww.com/tutorials/libraries/zlib/installation/installing_zlib_on_ubuntu_linux.php)
- [ICU](https://icu.unicode.org/): 
  - [source code access](https://icu.unicode.org/repository), confusing
  - [how to build and install on unix](https://unicode-org.github.io/icu/userguide/icu4c/build.html)
    - [libicu66 package download](https://www.ubuntuupdates.org/package/core/focal/main/base/libicu66): note the most recent version (72) not available
- [Boost libraries](https://www.boost.org/)

#### PostgreSQL

##### Install

```bash
# Create the file repository configuration:
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

# Import the repository signing key:
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

# Update the package lists:
sudo apt-get update

# Install the latest version of PostgreSQL.
# If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
sudo apt-get -y install postgresql
```

##### Tuning

```bash
vim /etc/postgresql/15/main//postgresql.conf
```




### Nominatim call by geopy

#### geocoder test

See [this post](https://stackoverflow.com/questions/44208780/find-the-county-for-a-city-state)

In [None]:
import geocoder

results = geocoder.google("Chicago")
print(results)

In [None]:
help(geolocator.geocode)

#### Nominatim call by geocoder

In [None]:
geolocator = Nominatim(domain='localhost:8080', scheme='http')
location = geolocator.geocode('avenue pasteur')
location

#### From city to location

See [this post](https://www.tutorialspoint.com/how-to-get-the-longitude-and-latitude-of-a-city-using-python)

In [None]:
from geopy.geocoders import Nominatim

# initialize Nominatim API
#geolocator = Nominatim(user_agent="plant_sci_hist")
geolocator = Nominatim(domain='localhost:8080', scheme='http')
# Get location
location = geolocator.geocode("Ann Arbor")

print(location.latitude, location.longitude)


#### From location to country
- In the same post, codes directly copied. But that give a `ConfigurationError`.
- Use [this](https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/) instead.

In [None]:
loc_rev = geolocator.reverse(f'{location.latitude},{location.longitude}')
 
# Display
print(loc_rev)

#### Man, this works for many other things

Retrieve in English based on [this post](https://stackoverflow.com/questions/29360910/geopy-retrieving-country-names-in-english)

In [None]:
print(geolocator.geocode("Dusseldorf", language='en'))
print(geolocator.geocode("the Netherlands", language='en'))
print(geolocator.geocode("Academia Sinica", language='en'))
print(geolocator.geocode("Michigan"))
print(geolocator.geocode("Ingham"))
print(geolocator.geocode("Michigan 48823"))
print(geolocator.geocode("MI 48823"))
print(geolocator.geocode("48823"))
print(geolocator.geocode("East Lansing 48823"))
print(geolocator.geocode("Kunming", language='en'))

In [None]:
# This one is wrong
print(geolocator.geocode("Yugoslavia", language='en'))

In [None]:
# This return none
# Ok, this is because I only load Monaco into the server
print(geolocator.geocode("MI USA", language='en'))
print(geolocator.geocode("Republic of Korea", language='en'))
print(geolocator.geocode("PR China", language='en'))

#### Old call_nominatim function

In [None]:
# This is the original function simply update country_info if ANYTHING is found.
# But this is problematic because earlier searches can lead to results that are
# not as important and are false positives. So since I search with north
# america first, this will lead to exaggerated number of matches to north 
# america countries which will bias the result. So create another one
def call_nominatim_OLD(country_info, not_found, port):
  # Access local Nominatim server
  geolocator = Nominatim(domain=f'localhost:{port}', scheme='http')

  # for records that still have no location info
  not_found_local     = {} # {pmid:[AU, AD]}
  count_geopy_a3 = 0
  for pmid in tqdm(not_found):
    if pmid not in country_info:
      [AU, AD] = not_found[pmid]
      loc, _   = get_location_str(AD)
      geo      = geolocator.geocode(loc, language='en')

      # geocode return something useful
      if geo is not None:
        country = geo.raw['display_name'].split(", ")[-1]
        a3      = get_a3(country, cname_to_a3, cname_hist_to_a3, suppl_dict)
        
        country_info[pmid] = [AU, AD, a3]
        count_geopy_a3 += 1
      # nothing found
      else:
        not_found_local[pmid] = [AU, AD, loc]

  print("Still missing:", len(not_found_local.keys()))
  return not_found_local

### Pyosmium

In [None]:
dir(osmium)

In [None]:
help(osmium.osmium)

### Test download osm pbf data with url

Use osm.pbf data from three states.
- Follow [this post](https://askubuntu.com/questions/1160575/how-to-make-python-wait-for-a-program-to-stop-before-going-to-the-next-line-of-c) to make sure each process finish before the next one is called.

In [None]:
'''
# Load Europe data into docker
docker run -it \
  -e PBF_PATH=/nominatim/data/michigan-latest.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/projects/plant_sci_hist/7_countries/test/:/nominatim/data \
  --name nominatim_mi \
  mediagis/nominatim:4.2
'''

In [None]:
# Get test data
test_data_url = "https://download.geofabrik.de/north-america/us/"
test_data_dir = work_dir / "test"
states = ["michigan", "ohio", "wisconsin"]

#https://stackoverflow.com/questions/10251391/suppressing-output-in-python-subprocess-call
for states in states:
  url = f"{test_data_url}{states}-latest.osm.pbf"
  subprocess.call(['wget', '-P', test_data_dir, url], 
                  stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

### Test call_nominatim

#### Test one address

In [None]:
not_found.keys()

dict_keys(['803110', '1279697', '1279702', '1280165', '1280601', '1280857', '1281435', '1281438', '1281482', '1281700', '1281816', '1282045', '1282044', '1282067', '1282087', '1282347', '1282396', '1283354', '1284656', '1284657', '1285798', '1288845', '1288849', '1291136', '1292497', '1292658', '1292668', '1292669', '1293889', '1294174', '1294178', '1294707', '1294810', '1294925', '1295516', '1295730', '1299140', '1300199', '1301212', '1301214', '1301216', '1302182', '1302203', '1302638', '1302641', '1303795', '1303798', '1304750', '1304756', '1305827', '1308645', '1309714', '1310005', '1310057', '1310058', '1310059', '1310086', '1310400', '1310521', '1310524', '1310692', '1310976', '1311200', '1311542', '1311699', '1311852', '1312237', '1312344', '1312350', '1312527', '1312532', '1312916', '1312950', '1312979', '1312981', '1312998', '1313232', '1313711', '1314161', '1314487', '1314662', '1314663', '1314801', '1314807', '1314808', '1314811', '1314816', '1315759', '1316191', '1316302', 

In [None]:
AD     = not_found['803110'][1]
loc, _ = get_location_str(AD)
print(loc)

Columbus 43210


In [None]:
geolocator = Nominatim(domain='localhost:8080', scheme='http')
geo = geolocator.geocode(loc)
geo

Location(Columbus, Franklin County, Ohio, United States, (39.9622601, -83.0007065, 0.0))

In [None]:
help(geo)

Help on Location in module geopy.location object:

class Location(builtins.object)
 |  Location(address, point, raw)
 |  
 |  Contains a parsed geocoder response. Can be iterated over as
 |  ``(location<String>, (latitude<float>, longitude<Float))``.
 |  Or one can access the properties ``address``, ``latitude``,
 |  ``longitude``, or ``raw``. The last
 |  is a dictionary of the geocoder's response for this item.
 |  
 |  Methods defined here:
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __getitem__(self, index)
 |      Backwards compatibility with geopy<0.98 tuples.
 |  
 |  __getstate__(self)
 |  
 |  __init__(self, address, point, raw)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  __ne__(self, other)
 |      Return self!=value.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  -------------

In [None]:
geo.address

'Columbus, Franklin County, Ohio, United States'

In [None]:
geo.raw

{'place_id': 64543031,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'relation',
 'osm_id': 182706,
 'boundingbox': ['39.8086936', '40.1573082', '-83.2101797', '-82.7713119'],
 'lat': '39.9622601',
 'lon': '-83.0007065',
 'display_name': 'Columbus, Franklin County, Ohio, United States',
 'class': 'boundary',
 'type': 'administrative',
 'importance': 0.4600099999999999}

#### Test call_nominatim

In [None]:
#https://stackoverflow.com/questions/5352546/extract-a-subset-of-key-value-pairs-from-dictionary
test_not_found = dict((k, not_found[k]) 
                      for k in ('803110', '1279697', '1279702', '1280165'))

call_nominatim(test_not_found, suppl_dict)


100%|██████████| 4/4 [00:00<00:00, 12.12it/s]


{'803110': [['Sagone AL Jr', 'Balcerzak SP', 'Metz EN'],
  ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus 43210.'],
  'USA',
  {'place_id': 64543031,
   'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
   'osm_type': 'relation',
   'osm_id': 182706,
   'boundingbox': ['39.8086936', '40.1573082', '-83.2101797', '-82.7713119'],
   'lat': '39.9622601',
   'lon': '-83.0007065',
   'display_name': 'Columbus, Franklin County, Ohio, United States',
   'class': 'boundary',
   'type': 'administrative',
   'importance': 0.4600099999999999}],
 '1279697': [['Cohen MB', 'Schuler MA', 'Berenbaum MR'],
  ['Department of Entomology, University of Illinois, Urbana 61801.'],
  'USA',
  {'place_id': 64531114,
   'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
   'osm_type': 'relation',
   'osm_id': 126133,
   'boundingbox': ['40.069549', '40.157339', '-88.318616', '-88.15306'],
   'lat': '

#### Testing with limited info

In [None]:
test_not_found = {
  '1': [["AU1"], ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus 43210']],
  '2': [["AU2"], ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus']],
  '3': [["AU3"], ['Division of Hematology and Oncology, Ohio State University College of Medicine, 43210']],
  '4': [["AU4"], ['Division of Hematology and Oncology, Ohio State University College of Medicine']],
  '5': [["AU5"], ['Ohio State University College of Medicine, Division of Hematology and Oncology']],
  '6': [["AU6"], ['Ohio State University, College of Medicine, Division of Hematology and Oncology']],
}

In [None]:
test_nominatim_dict = call_nominatim(test_not_found, suppl_dict)
for pmid in test_nominatim_dict:
  print(pmid, test_nominatim_dict[pmid][-1])


100%|██████████| 6/6 [00:03<00:00,  1.87it/s]

1 {'place_id': 64585686, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'relation', 'osm_id': 2528689, 'boundingbox': ['33.9444786', '34.4848477', '-79.0712116', '-78.1622008'], 'lat': '34.2814497', 'lon': '-78.666593', 'display_name': 'Columbus County, North Carolina, United States', 'class': 'boundary', 'type': 'administrative', 'importance': 0.55001}
2 {'place_id': 48186762, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'way', 'osm_id': 636148229, 'boundingbox': ['41.3980482', '41.4019585', '-81.6505146', '-81.6457541'], 'lat': '41.399906200000004', 'lon': '-81.64807882250798', 'display_name': 'Kent State University College of Podiatric Medicine, Rockside Place, Independence, Cuyahoga County, Ohio, 44131, United States', 'class': 'amenity', 'type': 'university', 'importance': 0.6100099999999999}
3 {'place_id': 48186762, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https:/




### Fix log_nominatim_pmids file

For some reason, it got stuck at 16656380 even though a result is generated.

In [261]:
# mv log_nominatim_na_pmids log_nominatim_na_pmids_BAK

with open(work_dir / "log_nominatim_na_pmids_BAK", "r") as f:
  pmids = f.readline().split(" ")[1:]
  pdict = {}
  for i in pmids:
    if i not in pdict:
      pdict[i] = 1

with open(work_dir / "log_nominatim_na_pmids", "w") as f:
  pmids_sorted = list(pdict.keys())
  pmids_sorted.sort()
  f.write(" ".join(pmids_sorted))


In [265]:
not_found['16656381']

[['Palmer JM'],
 ["Department of Botany, King's College, 68 Half Moon Lane, London, S.E.24."]]

### Deprecated functions

In [None]:
# Deal with timeout
# https://gis.stackexchange.com/questions/173569/avoid-time-out-error-nominatim-geopy-openstreetmap
# replace geopy.geocode with geolocator.geocode

# This give module not found error
#from geopy.exec import GeocoderTimedOut

def do_geocode(geolocator, address, attempt=1, max_attempts=5):
    try:
        return geolocator.geocode(address, language='en')
    except Exception as ex:
        #print("Exception:", ex)
        if attempt <= max_attempts:
            return do_geocode(geolocator, address, attempt=attempt+1)
        raise

In [None]:
# As of 3/7/23
def get_location_str(AD, token_idx=-1, debug=0):
  '''Get the potential location string from AD
  Args:
    AD (str): The content in the AD field
    token_idx (int): -1, -2, or 0 (whole thing)
  Return
    location (str): the string that likely contain location info
    errflag (int): the AD info is empty and thus erroneous (1) or not (0)
  '''

  # The first element in the AD list is used (1st author)
  add_str  = AD[0]
  if debug: print(AD, add_str)

  # But there are 12 records where the AD field looks like:
  # ['.', '.', '.', '.', '.', '.']
  # So tokens will be "", dealt with in the if-else statement below.
  if add_str == "":
    loc = "NA"
    errflag = 1

  else:
    errflag = 0
    # Multipe authors:
    # ['From xxx, xxx, xxx, xxx, xxx (Ranade, Ganea, Razzak, and Garcia Gil)']
    # Another case:
    # [['xxx, xx, xxx, Maryland 20742 (M.H., G.F.D.).']
    if add_str[-1] == ")" or add_str[-2:] == ").":
      leftMargin = add_str.rfind(" (")

    # Email fields: without ending "."
    # ['Institut ..., France. achmustilli@libero.it']
    elif add_str[-1] != ".":
      leftMargin = add_str.rfind(" ")

    # nothing to do, will take the whole thing
    else:
      leftMargin = len(add_str)

    add_str = add_str[:leftMargin] # rid of author info

    # if tokens ends with ".", rid of it
    if add_str[-1] == ".": 
      add_str = add_str[:-1]
    
    # Split with ", ", if it does not exist, split with " "
    if ", " not in add_str:
      # Only one large token, split with space instead
      tokens = add_str.split(" ")
      try:
        loc = tokens[token_idx]
      except IndexError:
        loc = "NA"
    else:
      tokens = add_str.split(", ")
      try:
        loc = tokens[token_idx]
      except IndexError:
        loc = "NA"

    if debug: print(loc)

  return loc, errflag