# __Step 7.1: Get country info__

## ___Readme___

### Goal
- Get country info out of each doc
- Get # of docs per country
- Get # of docs per continent
- Get # of docs per country over time
- Get # of docs per country per topic
- Get # of docs per country per topic over time

### Notes for writeup

Approach:
- Get the right token for country info in the AD (address) field, some have email address as the last token
- `pycountry`: use both ISO3166 (Countries) and ISO3166-3 (deleted countries)
- Supplement dictionary: some special considerations, e.g., UK, Taiwan, etc.
- `geopy`: pass the location token directly. This is a powerful module but does not deal with historical country properly.
- Brute-force search: Use ISO3166-1, 2, and 3.
- When consolidate results, the order of preference will be:
  - pycountry search in 7_1
  - brute force search in 7_1e
  - consolidated nominatim results from 7_1d

Deprecated:
- `uszipcode`: search for zip code if the token has two parts delimited by " " and the 2nd part is a number as is or after taking the 1st subtoken before "-". This is done by `geopy`.

Key info:
- Total records: 421658
- Medline available: 421585
- Unique PMID: 421276
  - No AD in medline record: 19851
  - With a3 based on pycountry/suppl dict: 329029
  - To geopy (if a3 not found via pycountry): 72396
- With brute force search of all docs except those with no AD:
  - Current country name: 361242
  - Historical country name: 16573
  - Subregion name: 279839
- Search based on ccTLD: 90799 with emails, 72640 with country domain info
  - .com: 16275
  - .org: 1107

Thoughts
- The addresses that do not work tend to be earlier records where only institution names are available or from addresses that no longer exists.
- There are also tons of examples where the delimiter is just crap.
- For historical countries, decide not to merge or modify names and just use as is to illustrate political transitions.

### Issues

- 3/21/23:
  - Found the historial country bit has a bug (name_a3 was not specified properly).
- 3/20/23:
  - Found out that the main issue with backup/restore volume containing databases is because the directory structure is wrong, as of 3/19, I finally got the test data to backup and restore properly. Apply that to the North America file today.
  - For the NA and AS run, found 16656795 with New Haven, Connecticut XXXX and this is a problem because for NA search results in a match with Connecticult XXXX resulted in a lower importance than AS searc result using New Haven because the last token return NA. This kinds of FP is problematic so I modify the function and will not submit the last second token and full add_str for search. This means I need to rerun search.
- 3/15/23:
  - Move all Nominatim testing bit to script_7_1a to test things out.
- 3/14/23:
  - Working on script_7_1d parsing some of the nominatim outputs and discovery some anomaly, e.g.:
    - PMID: 16656795
    - AD  =['The Connecticut Agricultural Experiment Station, New Haven, Connecticut 06504.']
    - nomi={0.11000999999999997: ['USA'], 0.4000099999999999: ['CHN']}
    - So CHN is more important than USA in this case, not sure why. 
  - Reload North American docker container tar, and see if I can run it again.
- 3/13/23:
  - Gone through north america and asia, tried to use the merged europe1 file but encountered: `ERROR: Input data is not ordered: relation id 33702 appears more than once.` See `Working with Multiple Input Files` of the [Osmium manual](https://osm2pgsql.org/doc/manual.html#updating-an-existing-database) which suggest [merge and simplify first](https://osm2pgsql.org/doc/manual.html#merging-osm-change-files).
- 3/7/23:
  - Have gotten the North American Nominatim search to work. But looking into the one that addresses cannot be found, I realize that `get_location_str()` need to be further improved.
    - See `script_7_1_c_assess_not_found.ipynb` for considerations.
    - This leads to a modified `get_location_str()`.
- 3/1/23:
  - The North America file have been loading for 4 days and died this morning because I run out of virtual disc space. There was ~250Gb free space. Increase the virtual disc from 512Gb to 1024Gb and run Nominatim again...
- 2/27/23:
  - First thing in the morning, I notice the proess died with kill signal. Suspect it is a memory issue.
  - Revert back to do continent level. But this does not work for all because I ran into memory problems still for Europe. But North America works. So for Europe, exclude 3 combined regions, and consolidate the rest into two osm pbfs.
  - For efficiency, also combine africa, antarctica, australia-oceania, central-america, south-america into all_others.osm.pbf.
  - Originally was thinking about doing each successively so each time there are fewer searches needed. But realize that this is problematic because a search can turn up FPs. For example, if I search for Paris with the North America file, I will find some place with the top match but that will not be the one I am looking for. So instead, all `not_found` records need to go through all regions, then compare the importance score afterward.
- 2/24/23:
  - Figure out how to point the input planet file to docker.
  - Run nominatim container using local planet osm. Let it go over the weekend.
- 2/23/23:
  - Was planning to do the search continent by continent, then realize that there is osm file for the whole planet from [OpenStreeMap](https://planet.openstreetmap.org/). Download this instead.
  - Run into issue pointing the downloaded file to the docker image.  
- 2/22/23: 
  - Turned out that the docker command line only point to one region. Download OSM files from [Geofabrik](https://download.geofabrik.de/index.html).
  - There are some issues with email address containing address.
- 2/21/23:
  - Try to query locations using Nominatim servive (OpenStreetMap) via geopy but after couple hundred queries, the connection time out. Likely the service just refuse to handle too many request. Try install Nominatim locally. 
  - [Nominatim Docker version](https://github.com/mediagis/nominatim-docker/tree/master/4.2) and to [get Docker started](https://docs.docker.com/config/daemon/start/).
  - Got `docker: Got permission denied while trying to connect to the Docker daemon socket`. Try [this fix](https://www.digitalocean.com/community/questions/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket).
  - See [this guide](https://www.linkedin.com/pulse/geocoding-geopy-your-own-nominatim-server-chonghua-yin?trk=related_artice_Geocoding%20with%20GeoPy%20and%20Your%20Own%20Nominatim%20Server_article-card_title).
- 2/20/23: 
  - The corpus dataset from 2_5_predict_pubmed does not have author or affiliation info. This needs to be done from the very beginning when I process the pubmed records.
  - In 
[MEDLINE/PubMed Data Element (Field) Descriptions](https://www.nlm.nih.gov/bsd/mms/medlineelements.html), there are several important info:
    - The affiliation of the authors, corporate authors and investigators appear in this repeating field.
      - 1988- The address of the first author's affiliation is included. The institution, city, and state including zip code for U.S. addresses, and country for countries outside of the United States, are included if provided in the journal; sometimes the street address is also included if provided in the journal.
      - 1995-2013 The designation USA is added at the end of the address when the first author's affiliation is in the fifty United States or the District of Columbia.
        - Q: Does this mean that this is not done for records before 1995?
      - 1996- The primary author's electronic mail (e-mail) address is included at the end of the Affiliation field, if present in the journal.
      - 2003- The complete first author address is entered as it appears in the article with no words omitted.
      - October 2013- Quality control of this field ceased in order to accommodate the affiliations for all authors and contributors.
      - December 2014- Multiple affiliations for each author or contributor are included.
        - __Because of this, only 1st author info is considered.__
  - For dealing with countries, there is the issue of historical country names, see [ISSO_3166-3](https://en.wikipedia.org/wiki/ISO_3166-3)

## ___Set up___

### Module import

In conda env `geo`

In [3]:
import pickle, pycountry, subprocess, requests, urllib
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from Bio import Entrez, Medline
from time import sleep
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup

### Key variables

In [4]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "7_countries"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with date and other info
dir2        = proj_dir / "2_text_classify//2_5_predict_pubmed"
corpus_file = dir2 / "corpus_plant_421658.tsv.gz"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"

dir71 = work_dir / "7_1_parse_countries"
dir71.mkdir(parents=True, exist_ok=True)

# 3/27/23: move wsl to m.2 ssd use a temp data folder for operations
dir_tmp = Path.home() / 'data_nominatim/tmp_out/'
dir_tmp.mkdir(parents=True, exist_ok=True)

medline_dir = dir71 / "medline"
medline_dir.mkdir(parents=True, exist_ok=True)

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Get PubMed records___

### Read plant science corpus and get pmids

In [5]:
pmid_file = dir71 / "pmids.pickle"

# PMID file does not exist
if not pmid_file.is_file():
  # Read corpus file
  corpus = pd.read_csv(corpus_file, compression='gzip', sep='\t')
  # get pmids
  pmids = corpus.PMID.values
  # save pmids
  with open(pmid_file, 'wb') as f:
    pickle.dump(pmids, f)
else:
  with open(pmid_file, "rb") as f:
    pmids = pickle.load(f)

print(pmids.shape)

(421658,)


### Get Pubmed docs using PMIDs


In [6]:
#https://stackoverflow.com/questions/59267992/biopython-how-to-download-all-of-the-peptide-sequences-or-all-records-associat

Entrez.email = 'shius@msu.edu'

id_list  = [str(pmid) for pmid in pmids]
post_xml = Entrez.epost(db='pubmed', id=','.join(id_list))
results  = Entrez.read(post_xml)
webenv   = results['WebEnv']
qkey     = results['QueryKey']

In [7]:
#http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec166

step    = 10000
for begin in tqdm(range(0, len(pmids), step)):
  # first check if this file is present
  medline_file = medline_dir / f"corpus_plant_421658_medline_{begin}.pickle"

  # Check if the file is already there, if so, continue to the next one
  if not medline_file.is_file():
    subset   = pmids[begin:begin+step]

    # Get Medline records for subset
    handle  = Entrez.efetch(db='pubmed', id=subset, rettype='medline', 
                            retmode='text', webenv=webenv, query_key=qkey)
    records  = Medline.parse(handle)
    rec_list = list(records)

    with open(medline_file, "wb") as f:
      pickle.dump(rec_list, f)


100%|██████████| 43/43 [00:00<00:00, 132.78it/s]


### Process PubMed Medline docs

In [8]:
# Read individuial pickle files and compile the full list
all_rec = []
for begin in tqdm(range(0, len(pmids), step)):
  medline_file = medline_dir / f"corpus_plant_421658_medline_{begin}.pickle"
  with open(medline_file, "rb") as f:
    rec_list = pickle.load(f)
  all_rec.extend(rec_list)

# The number of docs don't add up. Some records are not downloaded
len(all_rec)

100%|██████████| 43/43 [00:43<00:00,  1.02s/it]


421585

### Check what's missing

In [9]:
# Go thorugh all downloaded docs and get PMIDs
def check_missing(pmids, all_rec):
  '''
  Args:
    pmids (list): list of integer PMIDs
    all_rec (list): list of dictionary of medline records
  Return:
    id_list_missed (list): list of items in pmids but not all_rec
  '''

  # Downloaded
  pmids_dn = []
  for rec in tqdm(all_rec):
    pmids_dn.append(int(rec['PMID']))
  
  # Compare lists
  #https://stackoverflow.com/questions/15455737/python-use-set-to-find-the-different-items-in-list
  print("differnce:",len(pmids)-len(pmids_dn))

  pmids_ori_set = set(pmids)
  pmids_dn_set  = set(pmids_dn)
  missing = pmids_ori_set - pmids_dn_set
  print("# missing:", len(missing))

  id_list_missed = [str(pmid) for pmid in missing]

  return id_list_missed

In [10]:
# Get the missing records and add to all_rec
id_list_missed = check_missing(pmids, all_rec)

# Get Medline records for subset straight without epost
handle  = Entrez.efetch(db='pubmed', id=id_list_missed, rettype='medline', 
                        retmode='text')
records  = Medline.parse(handle)
rec_list = list(records)

# Can only get 41, so some still missing
print("Retrieved:", len(rec_list))

100%|██████████| 421585/421585 [00:00<00:00, 1026264.42it/s]


differnce: 73
# missing: 72
Retrieved: 41


In [11]:
# Save the missing records as pickle
medline_file = medline_dir / "corpus_plant_421658_medline_missed.pickle"

with open(medline_file, "wb") as f:
  pickle.dump(rec_list, f)

In [12]:
# Add to all_rec, then check again
all_rec.extend(rec_list)

In [13]:
still_missing = check_missing(pmids, all_rec)
len(still_missing)

100%|██████████| 421626/421626 [00:00<00:00, 1095213.20it/s]


differnce: 32
# missing: 31


31

### Check AU length

In [14]:
ad_len_dict = {}
for rec in tqdm(all_rec):
  if 'AD' in rec:
    ad_len = len(rec['AD'])
    if ad_len not in ad_len_dict:
      ad_len_dict[ad_len] = 1
    else:
      ad_len_dict[ad_len]+= 1

    #if ad_len == 2285:
    #  print(rec['AD'][0])

print(ad_len_dict)
        

100%|██████████| 421626/421626 [00:00<00:00, 882162.34it/s]

{1: 235025, 4: 20509, 5: 20868, 2: 13440, 9: 10572, 6: 20175, 3: 17537, 10: 8071, 8: 13912, 11: 5575, 7: 17230, 16: 1384, 12: 4400, 13: 2978, 15: 1766, 14: 2466, 17: 952, 23: 239, 21: 401, 19: 620, 18: 890, 59: 2, 25: 164, 30: 93, 32: 77, 22: 352, 20: 552, 34: 46, 36: 49, 43: 23, 28: 118, 54: 10, 35: 50, 44: 12, 27: 150, 29: 93, 24: 276, 31: 64, 26: 156, 45: 26, 38: 27, 71: 4, 50: 12, 37: 34, 55: 6, 86: 1, 41: 17, 80: 4, 82: 1, 33: 61, 47: 12, 64: 1, 42: 22, 58: 2, 39: 28, 69: 2, 52: 10, 51: 10, 40: 41, 62: 3, 46: 17, 96: 2, 48: 12, 65: 5, 57: 7, 72: 4, 128: 1, 60: 6, 114: 2, 78: 2, 76: 1, 56: 5, 95: 2, 136: 1, 63: 3, 126: 2, 112: 2, 162: 1, 216: 1, 77: 3, 105: 3, 66: 4, 70: 5, 88: 3, 153: 1, 135: 1, 200: 1, 111: 2, 152: 1, 75: 4, 130: 1, 273: 1, 85: 2, 68: 2, 92: 2, 137: 2, 61: 5, 53: 4, 113: 1, 366: 1, 67: 3, 93: 1, 120: 1, 164: 1, 110: 2, 89: 1, 143: 1, 83: 1, 49: 4, 131: 1, 101: 1, 103: 1, 73: 3, 179: 1, 886: 1, 74: 2, 2285: 1, 79: 1, 139: 1, 166: 1, 168: 1}





### Spot check AD fields

In [15]:
# Use "." as delimiter will work for most, but exceptions:
'''
['Instituto de Fitosanidad, Colegio de Postgraduados, km. 35.5 Carr. 
  Mexico-Texcoco, 56230-Texcoco, Edo. de Mexico, Mexico.']
['Botanisches Institut der Ludwig-Maximilians Universitat, Munchen, F.R.G.']
'''

for idx in range(0, len(all_rec), 1000):
  rec = all_rec[idx]
  if 'AD' in rec:
    print([rec['AD'][0]]) 

['Faculty of Pharmaceutical Sciences, Kumamoto University, Japan.']
["URA Centre National de la Recherche Scientifique 576, Departement de Biologie Moleculaire et Structurale, Centre d'Etudes Nucleaires de Grenoble, France."]
['Department of Biological Sciences, Stanford University, CA 94305-5020.']
['Botanisches Institut der Ludwig-Maximilians Universitat, Munchen, F.R.G.']
['Department of Biochemistry, Temple University School of Medicine, Philadelphia, PA 19140.']
['Department of Biochemistry, Johns Hopkins University, School of Hygiene and Public Health, Baltimore, Maryland 21205.']
['Institut de Biologie Moleculaire des Plantes du CNRS, Strasbourg, France.']
['Department of Agronomy, Purdue University, West Lafayette, Indiana 47907.']
["Departement de Biologie/Service de Biologie Cellulaire, Institut National de la Sante et de la Recherche Medicale U246, Centre d'Etudes Nucleaires de Saclay, Gif sur Yvette, France."]
['Ministry of Agriculture, Fisheries and Food, Slough Laboratory

### Set up dict_pmid_au_ad

In [16]:
dict_pmid_au_ad = {} # {pmid:[AU, AD]}
pmid_count      = {} # {pmid:count}, see how many are redundant

connt_AU_NA = connt_AD_NA = 0
# Go through each record
for rec in tqdm(all_rec):
  pmid = rec["PMID"]

  if pmid not in pmid_count:
    pmid_count[pmid] = 1
  else:
    pmid_count[pmid]+= 1

  # Deal with AU info
  try:
    AU = rec["AU"]
  except KeyError:
    AU = "NA"
    connt_AU_NA += 1

  # Deal with AD info
  try:
    AD = rec["AD"]
  except KeyError:
    AD = "NA"
    connt_AD_NA += 1

  # populate dictionary
  dict_pmid_au_ad[pmid] = [AU, AD]

print(f"Total:{len(dict_pmid_au_ad)}, NO AU:{connt_AU_NA}, No AD:{connt_AD_NA}")

100%|██████████| 421626/421626 [00:01<00:00, 373696.60it/s]

Total:421276, NO AU:74, No AD:19850





In [17]:
# Get the number of PMIDs that are redundant

dict_count = {} # of redundant PMIDs: count}
for pmid in pmid_count:
  count_redun = pmid_count[pmid]
  if count_redun > 1:
    if count_redun not in dict_count:
      dict_count[count_redun] = 1
    else:
      dict_count[count_redun]+= 1

print(dict_count)

{2: 350}


In [18]:
# export dictionary
dict_pmid_au_ad_file = dir71 / "dict_pmid_AU_AD.pickle"
with open(dict_pmid_au_ad_file, "wb") as f:
  pickle.dump(dict_pmid_au_ad, f)

## ___Search for country codes___

### Set up country dictionaries

In [19]:
# Build {country_name or official_name: alpha_3 code}
countries   = list(pycountry.countries)
cname_to_a3 = {}

for country in countries:
  name_a2    = country.alpha_2
  name_a3    = country.alpha_3
  name_short = country.name

  cname_to_a3[name_a2]    = name_a3 # store this for situation like US
  cname_to_a3[name_a3]    = name_a3 # store this for sitiation like USA
  cname_to_a3[name_short] = name_a3
  
  # put official name in
  try:
    name_offic = country.official_name
    cname_to_a3[name_offic] = name_a3
  except AttributeError:
    #print("No official name:", name_short)
    name_offic = "NA"

In [20]:
# Also build a dictionary for historical countries

# Conversion for couple renamed countries
cnames_convert = {"BUR": "MMR", "ZAR": "COD"}

countries_hist = list(pycountry.historic_countries)
cname_hist_to_a3 = {}

for country in countries_hist:
  # the name in historical countries are the official names
  name_a3    = country.alpha_3
  name_offic = country.name
  name_short = name_offic.split(",")[0]

  if name_a3 in cnames_convert:
    name_a3 = cnames_convert[name_a3]
    
  cname_hist_to_a3[name_a3]    = name_a3
  cname_hist_to_a3[name_short] = name_a3
  cname_hist_to_a3[name_offic] = name_a3



In [21]:
# For some issues that arise
suppl_dict = {"UK":"GBR", "The Netherlands":"NLD", "Taiwan":"TWN", 
              "Republic of China":"TWN", "the Netherlands":"NLD"}

### Functions: get location and get a3

In [22]:
def get_location_str(add_str, token_idx=-1, debug=0):
  '''Get the potential location string from AD
  Args:
    add_str (str): The content in the 1st AD element (1st author address)
    token_idx (int): -1, -2, or 0 (whole thing)
  Return
    location (str): the string that likely contain location info
    errflag (int): the AD info is empty and thus erroneous (1) or not (0)
  '''

  if debug: print("add_str:", add_str)

  # But there are 12 records where the AD field looks like:
  # ['.', '.', '.', '.', '.', '.']
  # So tokens will be "", dealt with in the if-else statement below.
  if add_str == "":
    loc = "NA"
    errflag = 1

  else:
    errflag = 0
    # Multipe authors:
    # ['From xxx, xxx, xxx, xxx, xxx (Ranade, Ganea, Razzak, and Garcia Gil)']
    # Another case:
    # [['xxx, xx, xxx, Maryland 20742 (M.H., G.F.D.).']
    if add_str[-1] == ")" or add_str[-2:] == ").":
      leftMargin = add_str.rfind(" (")
      add_str = add_str[:leftMargin]

    # Email field: Was contemplating using email for country, but some 1st
    # author emails are not the same as for the institution (see 2nd example).

    # 18930883, loc:China E-mail: suzhi1026@163.com
    # Some have ". Email:..."
    if add_str.find("E-mail:") != -1:
      tmp_str = add_str[:add_str.find("E-mail:")]
      # Use space as delimiter, then the empty space is taken care of later
      leftMargin = tmp_str.rfind(" ")
      add_str = add_str[:leftMargin]
      if debug: print("found: 'E-mail:", add_str)

    # 22016614, loc:Poland; E-Mails: agnieszka.pszczolkowska@uwm.edu.pl (A.P.); macieklojko@wp.p
    # 22072902, loc:China; E-Mail: yiruizao@163.com
    elif add_str.find("; E-Mail") != -1:
      add_str = add_str[:add_str.find("; E-Mail")]
      if debug: print("found: '; E-Mail", add_str)

    # 24866837:Fax: (+31) 50-3636440  <-- AG Groningen (The Netherlands), Fax:
    elif add_str.find("Fax:") != -1:
      add_str = add_str[:add_str.find("Fax:")]
      if debug: print("found: 'Fax:", add_str)

    # Below are more example for using @ field for parsing.
    #   case 1: 17296497 ['xxx, Denmark. blah@aki.ku.dk <blah@aki.ku.dk>']
    #   case 2: ?? ['Institut ..., France. achmustilli@libero.it']
    # The next one is weird, look like there is something not parsed properly
    #   case 3: 17632571 ['xxx, xxx, UK. ib 103@mole.bio.cam.ac.uk']
    # Also the next one, so cannot use "." as delimiter first.
    #   case 4: 18613594 ['xxx, ACT2601 Australia. rod.mahon@csiro.au']
    # Ok, some is missing space between country and email... Man...
    #   case 5: 9931476 ['PO Box 12, Rehovot 76100, Israel.cohenk@agri.huji.ac.il']
    if add_str.find("@") != -1:

      # Find where the 1st email address is and generate a temp_str
      tmp_str = add_str[:add_str.find("@")]
      
      # Originally using space, but the 3rd example shows that it is not good.
      # But then "." is regularly used in email address. So do space delimter
      # first, then use "."
      tmp_str = tmp_str[:tmp_str.rfind(" ")] # this takes care of case 4
      leftMargin = tmp_str.rfind(".")        # this takes care of case 3
      if leftMargin == -1:                   # this takes care of case 5
        leftMargin = tmp_str.rfind(",")
      add_str = add_str[:leftMargin]
      if debug: print("found: '@", add_str)

    # ISNI code in ~7k records
    #   30263677 ['27601 Republic of Korea. ISNI: 0000 0004 1775 9398. GRID: grid.444122.5'
    if add_str.find("ISNI:") != -1:
      tmp_str = add_str[:add_str.find("ISNI:")]
      leftMargin = tmp_str.rfind(".")
      add_str = add_str[:leftMargin]

    # Strip empty space before
    add_str = add_str.strip()

    if debug: print("final add_str:",[add_str])

    # Some just have email address so after parsing, add_str is "".
    #   25548975: '. kehrig@pharmazie.uni-kiel.de.'
    if add_str == "":
      loc = "NA"
      errflag = 1
    # if tokens ends with ".", rid of it
    else:
      if add_str[-1] == ".": 
        add_str = add_str[:-1]

      # Originally splot with ", " then "." but there are edge cases like this:
      #   17444520, loc:Hsinchu,Taiwan
      # So split with ",", if it does not exist, split with " "
      if "," not in add_str:
        # Only one large token, split with space instead
        tokens = add_str.split(" ")
        try:
          loc = tokens[token_idx]
        except IndexError:
          loc = "NA"
          errflag = 1
      else:
        tokens = add_str.split(",")
        try:
          # rid of space if present
          loc = tokens[token_idx].strip()
        except IndexError:
          loc = "NA"
          errflag = 1

      # More edge cases with "(" and some with ")", examples:
      # 18636686:47023 Cesena (FC) Italy
      # 19704524:Ibaraki Japan; xxx (B & PMP); xxx; xxx; Montpellier France
      # 21665592:B-1860 Meise (Belgium);
      # 24828308:Japan (K.S	
      

      # 19140172, loc:IR Iran, there are other variations. OpenStreeMap cannot
      # find these so deal with them manually.
      if loc.endswith(" Iran"):
        loc = loc[loc.find("Iran"):]

      # 19651701, loc:Taiwan ROC
      if loc.endswith(" ROC"):
        loc = loc[:loc.rfind(" ")]

      # 21299880, loc:DF- 70770-917 - Brasil
      if loc.find("- ") != -1:
        loc = loc.split("- ")[-1]

      # 1915409, loc:Stuttgart/Bundesrepublik Deutschland
      if loc.find("/") != -1:
        loc = loc.split("/")[-1]

    if debug: print(loc)

  return loc, errflag

In [23]:
def get_a3(location, cname_to_a3, cname_hist_to_a3, suppl_dict):
  # Found current country name
  if location in cname_to_a3:
    a3 = cname_to_a3[location]
  # Found historical country name
  elif location in cname_hist_to_a3:
    a3 = cname_hist_to_a3[location]
  # Found name in suppl dict
  elif location in suppl_dict:
    a3 = suppl_dict[location]
  # Leave this for geopy in the next step
  else:
    a3 = 'NA'

  return a3

### Search for country info with pycountry

In [24]:
# Without country a3
# Before checking for US state: 24867

country_info = {} # {pmid:[first_AU, first_AD, alpha_3]}
not_found    = {} # {pmid:[AU, AD]}, for records with no a3 code
count_AD_NA  = 0  # count records without AD field
count_a3     = 0  # count records with a3 code

# Go through each record
for pmid in tqdm(dict_pmid_au_ad):
  a3 = "NA" # set default value
  AU = dict_pmid_au_ad[pmid][0]
  AD = dict_pmid_au_ad[pmid][1]
  
  if AD == "NA":
    count_AD_NA += 1
  else:
    # Get 1st author location string
    add_str = AD[0]
    loc, errflag = get_location_str(add_str, -1, 0)
    if not errflag:
      a3 = get_a3(loc, cname_to_a3, cname_hist_to_a3, suppl_dict)
      # Leave this for geopy in the next step
      if a3 == "NA":
        not_found[pmid] = [AU, AD]
    # The AD field is effectively empty so set it to be "NA"
    else:
      AD = "NA"
      count_AD_NA += 1

  if a3 != "NA":
    count_a3 += 1

  # For AD is NA, no point in doing more, so include these in country_info. 
  # For those with a3 code, they are done, so include these in country info.
  if AD == "NA" or a3 != "NA":
    country_info[pmid] = [AU, AD, a3]

  #print(pmid, AD, a3)

print("Total   :", len(dict_pmid_au_ad))
print("With a3 :", count_a3)
print("No AD   :", count_AD_NA)
print("To geopy:", len(not_found.keys()))

100%|██████████| 421276/421276 [00:02<00:00, 170421.86it/s]

Total   : 421276
With a3 : 329029
No AD   : 19851
To geopy: 72396





In [25]:
# Save country_info
country_info_file = work_dir / "country_info-pycountry.pickle"
with open(country_info_file, "wb") as f:
  pickle.dump(country_info, f)

In [26]:
# Save not_found
country_info_NF_file = work_dir / "country_info-pycountry_NF.pickle"
with open(country_info_NF_file, "wb") as f:
  pickle.dump(not_found, f)

### Spot check again

In [27]:
for idx, key in enumerate(not_found):
  print(key, not_found[key][1])
  if idx == 1000:
    break

803110 ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus 43210.']
1279697 ['Department of Entomology, University of Illinois, Urbana 61801.']
1279702 ['Sandoz Agro, Inc., Palo Alto, CA 94304.']
1280165 ['Isotope and Structural Chemistry Division, Los Alamos National Laboratory, NM 87545.']
1280601 ['Institute of Biophysics, Czechoslovak Academy of Sciences, Brno.']
1280857 ['Department of Biological Chemistry and Biophysics, University of Michigan, Ann Arbor 48109.']
1281435 ['Plant Biology Division, Samuel Roberts Noble Foundation, Ardmore, OK 73402.']
1281438 ['Life Science Research Laboratory, Japan Tobacco Inc., Kanagawa.']
1281482 ['Department of Energy-Plant Research Laboratory, Michigan State University, East Lansing 48824.']
1281700 ['Department of Genetics, Harvard Medical School, Boston, Massachusetts.']
1281816 ['Department of Biology, Yale University, New Haven, Connecticut 06511.']
1282045 ['Department of Botany and Plant Pathology,

## ___Nominatim setup___

### Merge osm.pbf files

#### Get files

The Europe osm.pbf file (26.4Gb) cannot be loaded but the North America file work (12.2Gb). 
- So try to download individual country files then merge them into smaller files with osmium.
  - Found out that the Europe folder does not have Russian Federation file. Download it by itself and rerun below.
- [Extract all the URLs from the webpage Using Python](https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-python/)
- Save the european files into `/home/shius/data_nominatim/continent_osm/europe` by running `script_7_1b_get_europe_osm_pbfs.py`.
  - Note the following are excluded: `["alps", "britain-and-ireland", "dach"]`
  - Because of this, the data is now below 20Gb, so will break them into 2 subgroups.
- [Get linked file size](https://stackoverflow.com/questions/55226378/how-can-i-get-the-file-size-from-a-link-without-downloading-it-in-python)
- Then run the following to:
  - Get groups.
  - [Merge multiple osm pbf files](https://gis.stackexchange.com/questions/242704/merging-osm-pbf-files): try 2nd answer.

Also merge all other continents except North America, Asia, and Europe.


In [28]:
# Generate file name dictionary: {file size: name}
size_name_europe = {}

# Special subregions to exclude so they are not overlapping
exclude = ["alps-latest.osm.pbf", 
           "britain-and-ireland-latest.osm.pbf", 
           "dach-latest.osm.pbf"]

# get the soup obj
url = 'https://download.geofabrik.de/europe/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

# Go through links
for link in soup.find_all('a'):
  name = link.get('href')
  if name.endswith('-latest.osm.pbf') and name not in exclude:
    #print(name)
    req = urllib.request.Request(f"{url}{name}", method='HEAD')
    f   = urllib.request.urlopen(req)
    
    # size in Gb
    size = int(f.headers['Content-Length'])/(1024*1024*1024)
    size_name_europe[size] = name

# The above does not include Runssian Federation, add it in manually
size_name_europe[3.2] = "russia-latest.osm.pbf"

In [29]:
# sort by size and break them into 2 groups with ~9Gb each
sizes = list(size_name_europe.keys())
sizes.sort()

groups         = [] # list of subgroups
subgroup       = [] # name of country in a subgroup
subgroup_total = 0  # total within subgroup
threshold_size = 12 # threshold size for a subgroup
for size in sizes:
  if subgroup_total < threshold_size:
    subgroup.append(size_name_europe[size])
    subgroup_total += size
  else:
    print(subgroup_total)
    groups.append(subgroup)
    # reset
    subgroup       = []
    subgroup_total = 0

# Add the last subgroup
if subgroup != []:
  print(subgroup_total)
  groups.append(subgroup)

13.131017194129527
12.76383773498237


In [30]:
for i in groups:
  print(i)

['monaco-latest.osm.pbf', 'andorra-latest.osm.pbf', 'liechtenstein-latest.osm.pbf', 'guernsey-jersey-latest.osm.pbf', 'isle-of-man-latest.osm.pbf', 'faroe-islands-latest.osm.pbf', 'malta-latest.osm.pbf', 'azores-latest.osm.pbf', 'macedonia-latest.osm.pbf', 'kosovo-latest.osm.pbf', 'cyprus-latest.osm.pbf', 'montenegro-latest.osm.pbf', 'luxembourg-latest.osm.pbf', 'albania-latest.osm.pbf', 'iceland-latest.osm.pbf', 'moldova-latest.osm.pbf', 'georgia-latest.osm.pbf', 'estonia-latest.osm.pbf', 'latvia-latest.osm.pbf', 'bosnia-herzegovina-latest.osm.pbf', 'bulgaria-latest.osm.pbf', 'serbia-latest.osm.pbf', 'croatia-latest.osm.pbf', 'lithuania-latest.osm.pbf', 'hungary-latest.osm.pbf', 'romania-latest.osm.pbf', 'slovakia-latest.osm.pbf', 'ireland-and-northern-ireland-latest.osm.pbf', 'slovenia-latest.osm.pbf', 'belarus-latest.osm.pbf', 'greece-latest.osm.pbf', 'portugal-latest.osm.pbf', 'denmark-latest.osm.pbf', 'switzerland-latest.osm.pbf', 'turkey-latest.osm.pbf', 'belgium-latest.osm.pbf',

#### Merge Europe files

When loading the merged pbf file, encountered a error:
- `ERROR: Input data is not ordered: relation id 33702 appears more than once`
- Solution, use osmium merge-changes
  - Run out of memory locally (64G) and job killed.
  - Try to run in HPC with a docker image
    - But HPC does no allow docker, need to use singularity
  - Europe1: In `dev-intel16-k80`
    - Peak memory: virtual-509g, physical-167Gb
  - Europe2 Process kill `dev-intel16-k80`, use `dev-amd20` instead.
    - Peak memory: virtual-601g, physical-209g

In [31]:
# In /home/shius/projects/plant_sci_hist/7_countries/continent_osm/europe

'''bash
mkdir europe1 europe2
mv italy-latest.osm.pbf russia-latest.osm.pbf germany-latest.osm.pbf france-latest.osm.pbf europe1/
mv *.pbf europe2/
'''

###############
# NOT WORKING #
###############
# cd europe1
# osmium merge * -o europe1.osm.pbf
# cd ../europe2
# osmium merge * -o europe2.osm.pbf

'bash\nmkdir europe1 europe2\nmv italy-latest.osm.pbf russia-latest.osm.pbf germany-latest.osm.pbf france-latest.osm.pbf europe1/\nmv *.pbf europe2/\n'

In [32]:
# Europe1 merging
# move files to HPC
# In HPC: dev-intel16-k80
'''bash
# Pull and run docker image
singularity pull docker://stefda/osmium-tool
singularity run docker://stefda/osmium-tool

# In singularity
cd /mnt/home/shius/projects/plant_sci_hist/7_countries/

# Merge
osmium merge-changes -s -v -o europe1.osm.pbf *.pbf
'''

'bash\n# Pull and run docker image\nsingularity pull docker://stefda/osmium-tool\nsingularity run docker://stefda/osmium-tool\n\n# In singularity\ncd /mnt/home/shius/projects/plant_sci_hist/7_countries/\n\n# Merge\nosmium merge-changes -s -v -o europe1.osm.pbf *.pbf\n'

In [33]:
# Europe2 merging
# Move merged file, delete individual country files, upload europe2 country
# files, in dev-amd20
'''bash
singularity run docker://stefda/osmium-tool

cd /mnt/home/shius/projects/plant_sci_hist/7_countries/

osmium merge-changes -s -v -o europe2.osm.pbf *.pbf
'''

'bash\nsingularity run docker://stefda/osmium-tool\n\ncd /mnt/home/shius/projects/plant_sci_hist/7_countries/\n\nosmium merge-changes -s -v -o europe2.osm.pbf *.pbf\n'

#### Merge all other continents

North American, Europe, and Asia are big files. The other continents are:
- africa, antarctica, australia-oceania, central-america, south-america -- merged into all_others.osm.pbf

In [34]:
# In /home/shius/data_nominatim/continent_osm
'''
mkdir all_others
mv africa-latest.osm.pbf antarctica-latest.osm.pbf australia-oceania-latest.osm.pbf central-america-latest.osm.pbf south-america-latest.osm.pbf all_others/
cd all_others
'''

# move files to HPC
# pull and run docker image in singularity
'''
singularity run docker://stefda/osmium-tool

cd /mnt/home/shius/projects/plant_sci_hist/7_countries

osmium merge-changes -s -v -o all_others.osm.pbf *.pbf
'''

'\nsingularity run docker://stefda/osmium-tool\n\ncd /mnt/home/shius/projects/plant_sci_hist/7_countries\n\nosmium merge-changes -s -v -o all_others.osm.pbf *.pbf\n'

## ___Continue to identify location using geopy___

### Search function

In [35]:
def call_geolocator(geolocator, AD, token_idx, suppl_dict, debug=0):
  '''Subrontine for calling geolocator
  Args
    AD (list): A list of addresses for authors
    token_idx (int): define which token the location string should be obtained
      default to -1 which is typically where the broadest info (e.g., city,
      zip code) is located. If this does not work, will try -2 token, or 0
      which means the entire AD string will be used for geolocator search.
  Return:
    a3 (string): the a3 country code, if not found, return empty string.
    geo (geolocator): the object returned from the search
    err_str (string): If an exception is thrown, this is the error string
  '''
  # Get location string for 1st author
  add_str = AD[0]
  loc, errflag   = get_location_str(add_str, token_idx)

  if debug: print(loc, errflag)

  # This happens when there is only one token delimited by "," but the token_idx
  # is set to -2
  if errflag == 1:
    return "", None, "Only_1_token"

  # Call Nominatim to get a response:
  err_str = "NO_ERR"
  try:
    geo  = geolocator.geocode(loc, language='en')
  except Exception as ex:
    err_str = str(ex)
    geo = None

  if geo is not None:
    country = geo.raw['display_name'].split(", ")[-1]
    a3 = get_a3(country, cname_to_a3, cname_hist_to_a3, suppl_dict)
  else:
    a3 = "" 

  return a3, geo, err_str, loc

In [36]:
def call_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, dir_out,
                   sleep_time=0.5):
  '''Search entries against Nominatim server
  Args:
    nominatim_nf (dict): {pmid:[AU,AD]}, record still not found
    not_found (dict): {pmid:[AU, AD]}, for records with no a3 code after
      pycountry search
    suppl_dict (dict): regions that do not have proper pycountry info
    dir_pmid_log (Path): path to pmid log file of completed searches
    dir_out (Path): path to geolocator outputs
    sleep_time (int): time in second between queries.
  Return:
    nominatim_nf (dict): {pmid:[AU, AD]} for pmids with NA as search results
  '''

  # Access local Nominatim server
  geolocator = Nominatim(domain=f'localhost:8080', scheme='http')

  # Found info, because timeout keep happening, decide to save results as
  # things go.
  #nominatim_out = {} # {pmid:[AU, AD, a3, geo.raw]}

  # Info not found
  nominatim_nf  = {} # {pmid:[AU, AD]}

  # Because I keep getting time out, try to track what's working so I can
  # continue what what's not.
  # Create directory for output search result files
  dir_out.mkdir(parents=True, exist_ok=True)

  # Get the last pmid with result
  if not dir_pmid_log.is_file():
    out_names     = ""
    last_out_name = ""
    print("Starting with no output yet")
  else:
    with open(dir_pmid_log, "r") as f:
      out_names = f.readline()
      out_names_list = out_names.split(' ')
      out_names_list.sort()
      last_out_name = out_names_list[-1]
      print("Started, last_out_name:", last_out_name)

  # Save the log file again
  with open(dir_pmid_log, "w") as f:
    # Write the names aleady processed
    f.write(out_names)

    # sort pmids
    pmids = list(not_found.keys())
    pmids.sort()

    # determine where to restart
    if last_out_name == "":
      starting_idx = 0
    else:
      starting_idx = pmids.index(last_out_name)+1

    pmids_remaining = pmids[starting_idx:]

    # Go through records with no a3 info  
    for pmid in tqdm(pmids_remaining):

      err_str1, err_str2, err_str3 = "", "", ""

      # Get AU and AD
      [AU, AD] = not_found[pmid]

      # Get location string based on the first author's AD field
      a3, geo, err_str1, loc = call_geolocator(geolocator, AD, -1, suppl_dict)

      # Not used, too problematic
      '''
      # Not found using the last field
      if geo is None:
        # Try the last second field
        a3, geo, err_str2 = call_geolocator(geolocator, AD, -2, suppl_dict)

        # Still not found
        if geo is None:
          # Try using the whole thing
          a3, geo, err_str3 = call_geolocator(geolocator, AD, 0, suppl_dict)
      '''

      if geo is None:
        nominatim_nf[pmid] = [AU, AD]
        geo_file = dir_out / f"{pmid}_na.txt"
        with open(geo_file, "w") as f_geo:
          f_geo.write(f"{AU}\t{AD}\tNA\t{None}\t{err_str1},{err_str2},{err_str3}")
      else:
        # Save result instead of put it in dictionary
        #nominatim_out[pmid] = [AU, AD, a3, loc, geo.raw]
        geo_file = dir_out / f"{pmid}.txt"
        with open(geo_file, "w") as f_geo:
          f_geo.write(f"{AU}\t{AD}\t{a3}\t{loc}\t{geo.raw}")

      # Write the pmid of this record into after search result is returned
      f.write(f" {pmid}")

      # To reduce possibilities of timeout
      sleep(sleep_time)

  return nominatim_nf

In [None]:
#######################
# NOT USED/NOT TESTED #
#######################
# Should have done this but kind of late... Written when pretty much all are 
# done.
def nominatim_run(db, iter, dir_out, nf_dict, suppl_dict, sleep=0.1):
  '''
  Args:
    db (str): database name for output files, e.g., na for North America
    iter (int): iteration, run 1, 2, ...
    dir_out (Path): location where output files should be generated
    nf_dict (dict): dictionary with entries without country info {pmid:[AU, AD]}
    suppl_dict (dict): dictionary with hard-coded name to country info
    sleep (flota): time to sleep between calls to prevent timeout 
  Return:
    nominatim_nf (dict): dictionary with entries still without country info 
      after the nominatim run, {pmid:[AU, AD]}
  '''
  dir_nominatim_out = dir_out / f"nominatim_{db}_out{iter}"       # output dir
  dir_pmid_log      = dir_out / f"log_nominatim_{db}_pmids{iter}" # pmid log
  nominatim_nf      = {}                                   # record not found
  nominatim_nf      = call_nominatim(nominatim_nf, nf_dict, suppl_dict, 
                                     dir_pmid_log, dir_nominatim_out, sleep)
  nominatim_nf_file = dir_out / f"country_info-nominatim_{db}_NF{iter}.pickle"
  with open(nominatim_nf_file, "wb") as f:
    pickle.dump(nominatim_nf, f)

  print("Not found:", len(nominatim_nf))

  return nominatim_nf

### Search region 1: north america

#### Nominatim server setup

See 7_1a for details.

In [43]:
# Set up server
'''
docker run -it \
  -e PBF_PATH=/nominatim/data/north-america-latest.osm.pbf \
  -e FREEZE=true \
  -p 8080:8080 \
  -v nominatim-data-na:/var/lib/postgresql/14/main \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim-na \
  mediagis/nominatim:4.2
'''
# Start        : 2023-03-16 17:21:34 (container time)
# Close to end : 2023-03-19 10:00:55: Warming database caches
# Take 48+17=65 hours

# Backup volume
'''
docker ps                                 # Container ID
docker volume ls                          # Check volume is created
docker volume inspect nominatim-data-na   # inspect

cd /home/shius/projects/plant_sci_hist/7_countries/nominatim_volumes

# temp container for backup
docker run --rm \
  --volumes-from nominatim-na \
  -v $PWD:/backup \
  busybox tar cvf /backup/backup-nominatim-na.tar /var/lib/postgresql/14/main
'''

# Import archived volume
'''bash
# temp container: nomi_na_restore
docker run -v /var/lib/postgresql/14/main --name nomi_na_restore ubuntu /bin/bash

cd /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/

docker run --rm \
  --volumes-from nomi_na_restore \
  -v $PWD:/backup \
  bash -c "tar xvf /backup/backup-nominatim-na.tar"

# Find volume ID
docker volume ls
docker volume create --name nominatim-data-na_restore

docker run --rm -it \
  -v [OLD_VOLUME_ID]:/from \
  -v nominatim-data-na_restore:/to \
  bash -c "cd /from ; cp -av . /to"

docker ps -a  # Check if there is container using the old volume
docker rm -f nomi_na_restore
docker volume rm [OLD_VOLUME_ID]
'''

'bash\n# temp container: nomi_na_restore\ndocker run -v /var/lib/postgresql/14/main --name nomi_na_restore ubuntu /bin/bash\n\ncd /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/\n\ndocker run --rm   --volumes-from nomi_na_restore   -v $PWD:/backup   bash -c "tar xvf /backup/backup-nominatim-na.tar"\n\n# Find volume ID\ndocker volume ls\ndocker volume create --name nominatim-data-na_restore\n\ndocker run --rm -it   -v [OLD_VOLUME_ID]:/from   -v nominatim-data-na_restore:/to   bash -c "cd /from ; cp -av . /to"\n\ndocker ps -a  # Check if there is container using the old volume\ndocker rm -f nomi_na_restore\ndocker volume rm [OLD_VOLUME_ID]\n'

In [44]:
# Restart nominatim server using imported volume
# 3/20/23: This should not rebuild database. Tested on much smaller osm.pbf but 
# not this one yet.
'''
docker run -it \
  -e PBF_PATH=/nominatim/data/north-america-latestt.osm.pbf \
  -p 8080:8080 \
  -v data_andorra_restore:/var/lib/postgresql/14/main \
  -v /home/shius/data_nominatim/continent_osm:/nominatim/data \
  --name nominatim_andorra2 \
  mediagis/nominatim:4.2
'''

'\ndocker run -it   -e PBF_PATH=/nominatim/data/north-america-latestt.osm.pbf   -p 8080:8080   -v data_andorra_restore:/var/lib/postgresql/14/main   -v /home/shius/data_nominatim/continent_osm:/nominatim/data   --name nominatim_andorra2   mediagis/nominatim:4.2\n'

#### NA run 1

In [None]:
# call nominatim and get the dictionary for records not found still

# Define output dir
dir_nominatim_na_out = dir71 / "nominatim_na_out"

# Define pmid log file
dir_pmid_log = dir71 / "log_nominatim_na_pmids"

# nominatim search north america, record not found
nominatim_na_nf = {}

nominatim_na_nf = call_nominatim(nominatim_na_nf, not_found, suppl_dict, 
                                 dir_pmid_log, dir_nominatim_na_out, 0.1)


In [None]:
# Save the nominatim_na obj

# nominatim north american with found records
# Decide not to create this. Generate output files for each records instead

#nominatim_na_file    = dir71 / "country_info-nominatim_na.pickle"
#with open(nominatim_na_file, "wb") as f:
#  pickle.dump(nominatim_na, f)

# nominatim north american records not found
nominatim_na_nf_file = dir71 / "country_info-nominatim_na_NF.pickle"
with open(nominatim_na_nf_file, "wb") as f:
  pickle.dump(nominatim_na_nf, f)

#### NA run 2

In case some are not found because of timeout

In [None]:
dir_nominatim_na_out2 = dir71 / "nominatim_na_out2"
dir_pmid_log2         = dir71 / "log_nominatim_na_pmids2"
nominatim_na_nf2 = {}
nominatim_na_nf2 = call_nominatim(nominatim_na_nf2, nominatim_na_nf, suppl_dict, 
                                  dir_pmid_log2, dir_nominatim_na_out2, 0.1)

nominatim_na_nf_file2 = dir71 / "country_info-nominatim_na_2_NF.pickle"
with open(nominatim_na_nf_file2, "wb") as f:
  pickle.dump(nominatim_na_nf2, f)
len(nominatim_na_nf2.keys())

#### NA run 3

In case some are not found because of timeout
- No change in the files with NA.

In [None]:
dir_nominatim_na_out3 = dir71 / "nominatim_na_out3"
dir_pmid_log3         = dir71 / "log_nominatim_na_pmids3"
nominatim_na_nf3 = {}
nominatim_na_nf3 = call_nominatim(nominatim_na_nf3, nominatim_na_nf2, suppl_dict, 
                                  dir_pmid_log3, dir_nominatim_na_out3, 0.1)

nominatim_na_nf_file3 = dir71 / "country_info-nominatim_na_3_NF.pickle"
with open(nominatim_na_nf_file3, "wb") as f:
  pickle.dump(nominatim_na_nf3, f)
len(nominatim_na_nf3.keys())

### Search region 2: asia

#### Docker command

In [None]:
# Remove previous server
'''
docker rm -f nominatim-na
docker volume rm nominatim-data-na
'''

# Set up server
'''
docker run -it \
  -e PBF_PATH=/nominatim/data/asia-latest.osm.pbf \
  -e FREEZE=true \
  -p 8080:8080 \
  -v nominatim-data-as:/var/lib/postgresql/14/main \
  -v /home/shius/data_nominatim/continent_osm/:/nominatim/data \
  --name nominatim-as \
  mediagis/nominatim:4.2
'''
# Start        : 2023-03-21 14:03:32: Using project directory: /nominatim
# Close to end : Forgot to paste this, done after 2.5 days

# Backup volume
'''
docker ps                                 # Container ID
docker volume ls                          # Check volume is created
docker volume inspect nominatim-data-as   # inspect

cd /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/

# temp container for backup
docker run --rm \
  --volumes-from nominatim-as \
  -v $PWD:/backup \
  busybox tar cvf /backup/backup-nominatim-as.tar /var/lib/postgresql/14/main
'''

# Restart
# After loading the server, run into IO issue and need to shutdown and restart
# WSL. So dockerd needs to be restarted also. To my surprise, the container
# is still there (volume is there also but this I know). Was going to restart
# the process by using the volume but turn out the restart command works!
'''
docker restart nominatim-as
'''

#### Asia runs

In [45]:
# Run 1
dir_nominatim_as_out = dir71 / "nominatim_as_out"       # output dir
dir_pmid_log_as      = dir71 / "log_nominatim_as_pmids" # pmid log file
nominatim_as_nf      = {}                               # record not found
nominatim_as_nf = call_nominatim(nominatim_as_nf, not_found, suppl_dict, 
                                 dir_pmid_log_as, dir_nominatim_as_out, 0.1)
nominatim_as_nf_file = dir71 / "country_info-nominatim_as_NF.pickle"
with open(nominatim_as_nf_file, "wb") as f:
  pickle.dump(nominatim_as_nf, f)

print("Not found:", len(nominatim_as_nf))

# Run 2
dir_nominatim_as_out2 = dir71 / "nominatim_as_out2"       # output dir
dir_pmid_log_as2      = dir71 / "log_nominatim_as_pmids2" # pmid log file
nominatim_as_nf2      = {}                                # record not found
nominatim_as_nf2 = call_nominatim(nominatim_as_nf2, nominatim_as_nf, suppl_dict, 
                                 dir_pmid_log_as2, dir_nominatim_as_out2, 0.1)
nominatim_as_nf_file2 = dir71 / "country_info-nominatim_as_2_NF.pickle"
with open(nominatim_as_nf_file2, "wb") as f:
  pickle.dump(nominatim_as_nf2, f)

print("Not found:", len(nominatim_as_nf2))

Starting with no output yet


100%|██████████| 72396/72396 [3:10:08<00:00,  6.35it/s]   


Not found: 25419
Starting with no output yet


100%|██████████| 25419/25419 [57:08<00:00,  7.41it/s]  


Not found: 25381


In [47]:
# Run 3
dir_nominatim_as_out3 = dir71 / "nominatim_as_out3"       # output dir
dir_pmid_log_as3      = dir71 / "log_nominatim_as_pmids3" # pmid log file
nominatim_as_nf3      = {}                                # record not found
nominatim_as_nf3 = call_nominatim(nominatim_as_nf3, nominatim_as_nf2, suppl_dict, 
                                 dir_pmid_log_as3, dir_nominatim_as_out3, 0.1)
nominatim_as_nf_file3 = dir71 / "country_info-nominatim_as_3_NF.pickle"
with open(nominatim_as_nf_file3, "wb") as f:
  pickle.dump(nominatim_as_nf3, f)

print("Not found:", len(nominatim_as_nf3))

Started, last_out_name: 


100%|██████████| 25381/25381 [57:45<00:00,  7.32it/s]  


Not found: 25381


### Search region 3: europe subgroup 1

- When I ran this part, I have installed a new M.2 SSD that should be much faster. So some of the procedures have changed from this point and on.
- Run into an error:
  - `psycopg2.errors.DiskFull: could not resize shared memory segment`
  - See [this post](https://stackoverflow.com/questions/56751565/pq-could-not-resize-shared-memory-segment-no-space-left-on-device)
    - Try the 1st solution by setting `--shm-size=1g`

#### Docker command

In [None]:
# Remove previous server
'''
docker rm -f nominatim-as
docker volume rm nominatim-data-as
'''

# Copy osm.psf from slow to fast drive where WSL is.
'''
cd /home/shius/data_nominatim
cp continent_osm/europe1.osm.pbf ./
'''

# Set up server
'''
docker run -it --shm-size=1g \
  -e PBF_PATH=/nominatim/data/europe1.osm.pbf \
  -e FREEZE=true \
  -p 8080:8080 \
  -v nominatim-data-eu1:/var/lib/postgresql/14/main \
  -v /home/shius/data_nominatim:/nominatim/data \
  --name nominatim-eu1 \
  mediagis/nominatim:4.2
'''
# Start        : 2023-03-28 12:04:29: Using project directory: /nominatim
# Close to end : 2023-03-29 02:59:11.279 UTC [6510] LOG:  database system is ready to accept connections
# Man, this is so much faster... Take 15 hours, 4 times faster

# Backup volume
'''
docker ps                                 # Container ID
docker volume ls                          # Check volume is created
docker volume inspect nominatim-data-eu1  # inspect

cd /home/shius/data_nominatim

# temp container for backup
docker run --rm \
  --volumes-from nominatim-eu1 \
  -v $HOME:/backup \
  busybox tar cvf /backup/backup-nominatim-eu1.tar /var/lib/postgresql/14/main

mv ~/backup-nominatim-eu1.tar /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/
'''

#### Europe subgroup 1 runs

From here an on, generate output in WSL, then move them over later
- Decide that run 3 is stupid...

In [35]:
# Run 1
dir_nominatim_eu1_out = dir_tmp / "nominatim_eu1_out"       # output dir
dir_pmid_log_eu1      = dir_tmp / "log_nominatim_eu1_pmids" # pmid log file
nominatim_eu1_nf      = {}                                   # record not found
nominatim_eu1_nf = call_nominatim(nominatim_eu1_nf, not_found, suppl_dict, 
                                  dir_pmid_log_eu1, dir_nominatim_eu1_out, 0.1)
nominatim_eu1_nf_file = dir_tmp / "country_info-nominatim_eu1_NF.pickle"
with open(nominatim_eu1_nf_file, "wb") as f:
  pickle.dump(nominatim_eu1_nf, f)

print("Not found:", len(nominatim_eu1_nf))

# Run 2
dir_nominatim_eu1_out2 = dir_tmp / "nominatim_eu1_out2"       # output dir
dir_pmid_log_eu12      = dir_tmp / "log_nominatim_eu1_pmids2" # pmid log file
nominatim_eu1_nf2      = {}                                  # record not found
nominatim_eu1_nf2 = call_nominatim(nominatim_eu1_nf2, nominatim_eu1_nf, suppl_dict, 
                                 dir_pmid_log_eu12, dir_nominatim_eu1_out2, 0.1)
nominatim_eu1_nf_file2 = dir_tmp / "country_info-nominatim_eu1_2_NF.pickle"
with open(nominatim_eu1_nf_file2, "wb") as f:
  pickle.dump(nominatim_eu1_nf2, f)

print("Not found:", len(nominatim_eu1_nf2))

#Run 3
dir_nominatim_eu1_out3 = dir_tmp / "nominatim_eu1_out3"       # output dir
dir_pmid_log_eu13      = dir_tmp / "log_nominatim_eu1_pmids3" # pmid log file
nominatim_eu1_nf3      = {}                                  # record not found
nominatim_eu1_nf3 = call_nominatim(nominatim_eu1_nf3, nominatim_eu1_nf2, suppl_dict, 
                                 dir_pmid_log_eu13, dir_nominatim_eu1_out3, 0.1)
nominatim_eu1_nf_file3 = dir_tmp / "country_info-nominatim_eu1_3_NF.pickle"
with open(nominatim_eu1_nf_file3, "wb") as f:
  pickle.dump(nominatim_eu1_nf3, f)

print("Not found:", len(nominatim_eu1_nf3))

Starting with no output yet


100%|██████████| 72396/72396 [2:45:41<00:00,  7.28it/s]   


Not found: 32349
Starting with no output yet


100%|██████████| 32349/32349 [1:11:51<00:00,  7.50it/s]


Not found: 32345
Starting with no output yet


 18%|█▊        | 5957/32345 [13:04<57:53,  7.60it/s]   


KeyboardInterrupt: 

In [None]:
# Move files
'''
mv ~/data_nominatim/tmp_out/*  ~/projects/plant_sci_hist/7_countries/7_1_parse_countries/
'''

### Search region 4: europe subgroup 2

#### Docker command

In [36]:
# Remove previous server
'''
docker rm -f nominatim-eu1
docker volume rm nominatim-data-eu1
'''

# Copy osm.psf from slow to fast drive where WSL is.
'''
cd /home/shius/data_nominatim
rm europe1.osm.pbf
cp continent_osm/europe2.osm.pbf ./
'''

# Set up server
'''
docker run -it --shm-size=1g \
  -e PBF_PATH=/nominatim/data/europe2.osm.pbf \
  -e FREEZE=true \
  -p 8080:8080 \
  -v nominatim-data-eu2:/var/lib/postgresql/14/main \
  -v /home/shius/data_nominatim:/nominatim/data \
  --name nominatim-eu2 \
  mediagis/nominatim:4.2
'''
# Start        : Forgot to put down, at 5:30pm, 3/30
# Close to end : Just realize it is done, 10:20am, 3/31

# Backup volume
'''
docker ps                                 # Container ID
docker volume ls                          # Check volume is created
docker volume inspect nominatim-data-eu2  # inspect

cd /home/shius/data_nominatim

# temp container for backup
docker run --rm \
  --volumes-from nominatim-eu2 \
  -v $HOME:/backup \
  busybox tar cvf /backup/backup-nominatim-eu2.tar /var/lib/postgresql/14/main

mv ~/backup-nominatim-eu2.tar /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/
'''

'\ndocker ps                                 # Container ID\ndocker volume ls                          # Check volume is created\ndocker volume inspect nominatim-data-eu2  # inspect\n\ncd /home/shius/data_nominatim\n\n# temp container for backup\ndocker run --rm   --volumes-from nominatim-eu2   -v $HOME:/backup   busybox tar cvf /backup/backup-nominatim-eu2.tar /var/lib/postgresql/14/main\n\nmv ~/backup-nominatim-eu2.tar /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/\n'

#### Europe subgroup 2 runs

From here an on, generate output in WSL, then move them over later

In [37]:
# Run 1
dir_nominatim_eu2_out = dir_tmp / "nominatim_eu2_out"       # output dir
dir_pmid_log_eu2      = dir_tmp / "log_nominatim_eu2_pmids" # pmid log file
nominatim_eu2_nf      = {}                                   # record not found
nominatim_eu2_nf = call_nominatim(nominatim_eu2_nf, not_found, suppl_dict, 
                                  dir_pmid_log_eu2, dir_nominatim_eu2_out, 0.1)
nominatim_eu2_nf_file = dir_tmp / "country_info-nominatim_eu2_NF.pickle"
with open(nominatim_eu2_nf_file, "wb") as f:
  pickle.dump(nominatim_eu2_nf, f)

print("Not found:", len(nominatim_eu2_nf))

# Run 2
dir_nominatim_eu2_out2 = dir_tmp / "nominatim_eu2_out2"       # output dir
dir_pmid_log_eu22      = dir_tmp / "log_nominatim_eu2_pmids2" # pmid log file
nominatim_eu2_nf2      = {}                                  # record not found
nominatim_eu2_nf2 = call_nominatim(nominatim_eu2_nf2, nominatim_eu2_nf, suppl_dict, 
                                 dir_pmid_log_eu22, dir_nominatim_eu2_out2, 0.1)
nominatim_eu2_nf_file2 = dir_tmp / "country_info-nominatim_eu2_2_NF.pickle"
with open(nominatim_eu2_nf_file2, "wb") as f:
  pickle.dump(nominatim_eu2_nf2, f)

print("Not found:", len(nominatim_eu2_nf2))

#Run 3
dir_nominatim_eu2_out3 = dir_tmp / "nominatim_eu2_out3"       # output dir
dir_pmid_log_eu23      = dir_tmp / "log_nominatim_eu2_pmids3" # pmid log file
nominatim_eu2_nf3      = {}                                  # record not found
nominatim_eu2_nf3 = call_nominatim(nominatim_eu2_nf3, nominatim_eu2_nf2, suppl_dict, 
                                 dir_pmid_log_eu23, dir_nominatim_eu2_out3, 0.1)
nominatim_eu2_nf_file3 = dir_tmp / "country_info-nominatim_eu2_3_NF.pickle"
with open(nominatim_eu2_nf_file3, "wb") as f:
  pickle.dump(nominatim_eu2_nf3, f)

print("Not found:", len(nominatim_eu2_nf3))

Starting with no output yet


100%|██████████| 72396/72396 [2:57:12<00:00,  6.81it/s]   


Not found: 27968
Starting with no output yet


100%|██████████| 27968/27968 [1:02:55<00:00,  7.41it/s]


Not found: 27961
Starting with no output yet


100%|██████████| 27961/27961 [1:02:43<00:00,  7.43it/s]


Not found: 27960


### Search region 5: all others

`all_others.osm.pbf`
- africa
- antarctica
- australia-oceania
- central-america
- south-america
- total = ~10.4Gb

#### Docker command

In [None]:
# Remove previous server
'''
docker rm -f nominatim-eu2
docker volume rm nominatim-data-eu2
'''

# Copy osm.psf from slow to fast drive where WSL is.
'''
cd /home/shius/data_nominatim
rm europe2.osm.pbf
cp continent_osm/all_others.osm.pbf ./
'''

# Set up server
'''
docker run -it --shm-size=1g \
  -e PBF_PATH=/nominatim/data/all_others.osm.pbf \
  -e FREEZE=true \
  -p 8080:8080 \
  -v nominatim-data-ao:/var/lib/postgresql/14/main \
  -v /home/shius/data_nominatim:/nominatim/data \
  --name nominatim-ao \
  mediagis/nominatim:4.2
'''
# Start        : 2023-03-30 21:33:53: Using project directory: /nominatim
# Close to end : 2023-03-31 07:27:43.864 UTC [6733] LOG:  database system is ready to accept connections

# Backup volume
'''
docker ps                                 # Container ID
docker volume ls                          # Check volume is created
docker volume inspect nominatim-data-ao  # inspect

cd /home/shius/data_nominatim

# temp container for backup
docker run --rm \
  --volumes-from nominatim-ao \
  -v $HOME:/backup \
  busybox tar cvf /backup/backup-nominatim-ao.tar /var/lib/postgresql/14/main

mv ~/backup-nominatim-ao.tar /home/shius/projects/plant_sci_hist/7_countries/7_1_parse_countries/nominatim_volumes/
'''

# Stop all docker related activities
'''
docker rm -f nominatim-ao
docker volume rm nominatim-data-ao
'''
# Kill dockerd

#### All other runs

Found a typo in run2 output so run3 was not done. Given how few new results are recovered in run3 previously. Did not repeat.

In [37]:
# Run 1
dir_nominatim_ao_out = dir_tmp / "nominatim_ao_out"       # output dir
dir_pmid_log_ao      = dir_tmp / "log_nominatim_ao_pmids" # pmid log file
nominatim_ao_nf      = {}                                   # record not found
nominatim_ao_nf = call_nominatim(nominatim_ao_nf, not_found, suppl_dict, 
                                  dir_pmid_log_ao, dir_nominatim_ao_out, 0.1)
nominatim_ao_nf_file = dir_tmp / "country_info-nominatim_ao_NF.pickle"
with open(nominatim_ao_nf_file, "wb") as f:
  pickle.dump(nominatim_ao_nf, f)

print("Not found:", len(nominatim_ao_nf))

# Run 2
dir_nominatim_ao_out2 = dir_tmp / "nominatim_ao_out2"       # output dir
dir_pmid_log_ao2      = dir_tmp / "log_nominatim_ao_pmids2" # pmid log file
nominatim_ao_nf2      = {}                                  # record not found

### ERROR: 
###   below should be nominatim_ao_nf2, said as so run 3 did not work
nominatim_as_nf2 = call_nominatim(nominatim_ao_nf2, nominatim_ao_nf, suppl_dict, 
                                 dir_pmid_log_ao2, dir_nominatim_ao_out2, 0.1)
nominatim_ao_nf_file2 = dir_tmp / "country_info-nominatim_ao_2_NF.pickle"
with open(nominatim_ao_nf_file2, "wb") as f:
  pickle.dump(nominatim_ao_nf2, f)

print("Not found:", len(nominatim_ao_nf2))

#Run 3
dir_nominatim_ao_out3 = dir_tmp / "nominatim_ao_out3"       # output dir
dir_pmid_log_ao3      = dir_tmp / "log_nominatim_ao_pmids3" # pmid log file
nominatim_ao_nf3      = {}                                  # record not found

### ERROR: 
###   below should be nominatim_ao_nf3
nominatim_as_nf3 = call_nominatim(nominatim_ao_nf3, nominatim_ao_nf2, suppl_dict, 
                                 dir_pmid_log_ao3, dir_nominatim_ao_out3, 0.1)
nominatim_ao_nf_file3 = dir_tmp / "country_info-nominatim_ao_3_NF.pickle"
with open(nominatim_ao_nf_file3, "wb") as f:
  pickle.dump(nominatim_ao_nf3, f)

print("Not found:", len(nominatim_ao_nf3))

Starting with no output yet


100%|██████████| 72396/72396 [2:57:35<00:00,  6.79it/s]   


Not found: 25991
Starting with no output yet


100%|██████████| 25991/25991 [56:26<00:00,  7.67it/s]  


Not found: 0
Starting with no output yet


0it [00:00, ?it/s]

Not found: 0





## ___Test___

### pycountry and zip code

#### pycountry

In [None]:
list(pycountry.countries)[0]

In [None]:
rec = all_rec[4004]
au = rec["AU"]
ad = rec["AD"]
au, ad

In [None]:
pycountry.subdivisions.lookup("Urbana")

In [None]:
country = pycountry.countries.get(name="Yugoslavia")
print(country)

In [None]:
country = pycountry.historic_countries.get(name="Yugoslavia")
print(country)

In [None]:
list(pycountry.historic_countries)

#### uszipcode

In [None]:
from uszipcode import SearchEngine

sr = SearchEngine()
z = sr.by_zipcode("02167")
print(z)

#### zipcodes

In [None]:
import zipcodes

exact_zip = zipcodes.matching('02167')
print(exact_zip)

#### us

In [None]:
import us

print(us.states.lookup('24'))
print(us.states.lookup('bleh'))

### Checking medline parse

In [None]:
test_file = medline_dir / 'corpus_plant_421658_medline_100000.pickle'

with open(test_file, 'rb') as f:
    test_medline = pickle.load(f)

In [None]:
# Check those with country_info records with non NA country code
c = 0
for rec in test_medline:
    pmid = rec['PMID']
    if pmid in country_info and country_info[pmid][2] != 'NA':
        print("---\nPMID:", pmid)
        print("Country:", country_info[pmid][2])
        print("AD  :", rec['AD'])

        if c == 5:
            break
        c += 1

In [None]:
# Check those without country_info records
c = 0
for rec in test_medline:
    pmid = rec['PMID']
    if pmid not in country_info:
        print("---\nPMID:", pmid)
        if 'AD' in rec:
          print("AD  :", rec['AD'])
          print("loc :", not_found2[pmid])
        if c == 5:
            break
        c += 1

### Nominatim call by geopy

#### geocoder test

See [this post](https://stackoverflow.com/questions/44208780/find-the-county-for-a-city-state)

In [None]:
import geocoder

results = geocoder.google("Chicago")
print(results)

In [None]:
help(geolocator.geocode)

#### Nominatim call by geocoder

In [None]:
geolocator = Nominatim(domain='localhost:8080', scheme='http')
location = geolocator.geocode('avenue pasteur')
location

#### From city to location

See [this post](https://www.tutorialspoint.com/how-to-get-the-longitude-and-latitude-of-a-city-using-python)

In [None]:
from geopy.geocoders import Nominatim

# initialize Nominatim API
#geolocator = Nominatim(user_agent="plant_sci_hist")
geolocator = Nominatim(domain='localhost:8080', scheme='http')
# Get location
location = geolocator.geocode("Ann Arbor")

print(location.latitude, location.longitude)


#### From location to country
- In the same post, codes directly copied. But that give a `ConfigurationError`.
- Use [this](https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/) instead.

In [None]:
loc_rev = geolocator.reverse(f'{location.latitude},{location.longitude}')
 
# Display
print(loc_rev)

#### Man, this works for many other things

Retrieve in English based on [this post](https://stackoverflow.com/questions/29360910/geopy-retrieving-country-names-in-english)

In [None]:
print(geolocator.geocode("Dusseldorf", language='en'))
print(geolocator.geocode("the Netherlands", language='en'))
print(geolocator.geocode("Academia Sinica", language='en'))
print(geolocator.geocode("Michigan"))
print(geolocator.geocode("Ingham"))
print(geolocator.geocode("Michigan 48823"))
print(geolocator.geocode("MI 48823"))
print(geolocator.geocode("48823"))
print(geolocator.geocode("East Lansing 48823"))
print(geolocator.geocode("Kunming", language='en'))

In [None]:
# This one is wrong
print(geolocator.geocode("Yugoslavia", language='en'))

In [None]:
# This return none
# Ok, this is because I only load Monaco into the server
print(geolocator.geocode("MI USA", language='en'))
print(geolocator.geocode("Republic of Korea", language='en'))
print(geolocator.geocode("PR China", language='en'))

#### Old call_nominatim function

In [None]:
# This is the original function simply update country_info if ANYTHING is found.
# But this is problematic because earlier searches can lead to results that are
# not as important and are false positives. So since I search with north
# america first, this will lead to exaggerated number of matches to north 
# america countries which will bias the result. So create another one
def call_nominatim_OLD(country_info, not_found, port):
  # Access local Nominatim server
  geolocator = Nominatim(domain=f'localhost:{port}', scheme='http')

  # for records that still have no location info
  not_found_local     = {} # {pmid:[AU, AD]}
  count_geopy_a3 = 0
  for pmid in tqdm(not_found):
    if pmid not in country_info:
      [AU, AD] = not_found[pmid]
      loc, _   = get_location_str(AD)
      geo      = geolocator.geocode(loc, language='en')

      # geocode return something useful
      if geo is not None:
        country = geo.raw['display_name'].split(", ")[-1]
        a3      = get_a3(country, cname_to_a3, cname_hist_to_a3, suppl_dict)
        
        country_info[pmid] = [AU, AD, a3]
        count_geopy_a3 += 1
      # nothing found
      else:
        not_found_local[pmid] = [AU, AD, loc]

  print("Still missing:", len(not_found_local.keys()))
  return not_found_local

### Pyosmium

In [None]:
dir(osmium)

In [None]:
help(osmium.osmium)

### Test download osm pbf data with url

Use osm.pbf data from three states.
- Follow [this post](https://askubuntu.com/questions/1160575/how-to-make-python-wait-for-a-program-to-stop-before-going-to-the-next-line-of-c) to make sure each process finish before the next one is called.

In [None]:
'''
# Load Europe data into docker
docker run -it \
  -e PBF_PATH=/nominatim/data/michigan-latest.osm.pbf \
  -p 8080:8080 \
  -v /home/shius/projects/plant_sci_hist/7_countries/test/:/nominatim/data \
  --name nominatim_mi \
  mediagis/nominatim:4.2
'''

In [None]:
# Get test data
test_data_url = "https://download.geofabrik.de/north-america/us/"
test_data_dir = work_dir / "test"
states = ["michigan", "ohio", "wisconsin"]

#https://stackoverflow.com/questions/10251391/suppressing-output-in-python-subprocess-call
for states in states:
  url = f"{test_data_url}{states}-latest.osm.pbf"
  subprocess.call(['wget', '-P', test_data_dir, url], 
                  stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

### Test call_nominatim

#### Test one address

In [None]:
not_found.keys()

In [None]:
AD     = not_found['803110'][1]
loc, _ = get_location_str(AD)
print(loc)

In [None]:
geolocator = Nominatim(domain='localhost:8080', scheme='http')
geo = geolocator.geocode(loc)
geo

In [None]:
help(geo)

In [None]:
geo.address

In [None]:
geo.raw

#### Test call_nominatim

In [None]:
#https://stackoverflow.com/questions/5352546/extract-a-subset-of-key-value-pairs-from-dictionary
test_not_found = dict((k, not_found[k]) 
                      for k in ('803110', '1279697', '1279702', '1280165'))

call_nominatim(test_not_found, suppl_dict)


#### Testing with limited info

In [None]:
test_not_found = {
  '1': [["AU1"], ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus 43210']],
  '2': [["AU2"], ['Division of Hematology and Oncology, Ohio State University College of Medicine, Columbus']],
  '3': [["AU3"], ['Division of Hematology and Oncology, Ohio State University College of Medicine, 43210']],
  '4': [["AU4"], ['Division of Hematology and Oncology, Ohio State University College of Medicine']],
  '5': [["AU5"], ['Ohio State University College of Medicine, Division of Hematology and Oncology']],
  '6': [["AU6"], ['Ohio State University, College of Medicine, Division of Hematology and Oncology']],
}

In [None]:
test_nominatim_dict = call_nominatim(test_not_found, suppl_dict)
for pmid in test_nominatim_dict:
  print(pmid, test_nominatim_dict[pmid][-1])


### Fix log_nominatim_pmids file

For some reason, it got stuck at 16656380 even though a result is generated.

In [None]:
# mv log_nominatim_na_pmids log_nominatim_na_pmids_BAK

with open(work_dir / "log_nominatim_na_pmids_BAK", "r") as f:
  pmids = f.readline().split(" ")[1:]
  pdict = {}
  for i in pmids:
    if i not in pdict:
      pdict[i] = 1

with open(work_dir / "log_nominatim_na_pmids", "w") as f:
  pmids_sorted = list(pdict.keys())
  pmids_sorted.sort()
  f.write(" ".join(pmids_sorted))


In [None]:
not_found['16656381']

### Deprecated functions

In [None]:
# Deal with timeout
# https://gis.stackexchange.com/questions/173569/avoid-time-out-error-nominatim-geopy-openstreetmap
# replace geopy.geocode with geolocator.geocode

# This give module not found error
#from geopy.exec import GeocoderTimedOut

def do_geocode(geolocator, address, attempt=1, max_attempts=5):
    try:
        return geolocator.geocode(address, language='en')
    except Exception as ex:
        #print("Exception:", ex)
        if attempt <= max_attempts:
            return do_geocode(geolocator, address, attempt=attempt+1)
        raise

In [None]:
# Written 3/20/23 but make things worse, so not used
def get_location_str(add_str, token_idx=-1, debug=0):
  '''Get the potential location string from AD
  Args:
    add_str (str): The content of the 1st AD element (1st author address)
    token_idx (int): -1, -2, or 0 (whole thing)
  Return
    location (str): the string that likely contain location info
    errflag (int): the AD info is empty and thus erroneous (1) or not (0)
  '''

  if debug: print("add_str:", add_str)

  # But there are 12 records where the AD field looks like:
  # ['.', '.', '.', '.', '.', '.']
  # So tokens will be "", dealt with in the if-else statement below.
  if add_str == "":
    loc = "NA"
    errflag = 1

  else:
    errflag = 0

    # Multipe authors:
    # ['From xxx, xxx, xxx, xxx, xxx (Ranade, Ganea, Razzak, and Garcia Gil)']
    # Another case:
    # [['xxx, xx, xxx, Maryland 20742 (M.H., G.F.D.).']
    if add_str[-1] == ")" or add_str[-2:] == ").":
      leftMargin = add_str.rfind(" (")
      add_str = add_str[:leftMargin]

    # Email field: Was contemplating using email for country, but some 1st
    # author emails are not the same as for the institution (see 2nd example).

    # 18930883, loc:China E-mail: suzhi1026@163.com
    # Some have ". Email:..."
    if add_str.find("E-mail:") != -1:
      tmp_str = add_str[:add_str.find("E-mail:")]
      # Use space as delimiter, then the empty space is taken care of later
      leftMargin = tmp_str.rfind(" ")
      add_str = add_str[:leftMargin]
      if debug: print("found: 'E-mail:", add_str)

    # 22016614, loc:Poland; E-Mails: agnieszka.pszczolkowska@uwm.edu.pl (A.P.); macieklojko@wp.p
    # 22072902, loc:China; E-Mail: yiruizao@163.com
    elif add_str.find("; E-Mail") != -1:
      add_str = add_str[:add_str.find("; E-Mail")]
      if debug: print("found: '; E-Mail", add_str)

    # 24866837:Fax: (+31) 50-3636440  <-- AG Groningen (The Netherlands), Fax:
    elif add_str.find("Fax:") != -1:
      add_str = add_str[:add_str.find("Fax:")]
      if debug: print("found: 'Fax:", add_str)

    # 28250584:GRID: grid.440587.a	XX, XX, XX, Para Brazil. GRID: grid.440587.a
    if add_str.find(" GRID:") != -1:
      add_str = add_str[:add_str.find(" GRID:")]
      if debug: print("found: ' GRID:", add_str)

    # ISNI code in ~7k records
    #   30263677 ['27601 Republic of Korea. ISNI: 0000 0004 1775 9398. GRID: grid.444122.5'
    if add_str.find("ISNI:") != -1:
      tmp_str = add_str[:add_str.find("ISNI:")]
      leftMargin = tmp_str.rfind(".")
      add_str = add_str[:leftMargin]
      
    # Below are more example for using @ field for parsing.
    #   case 1: 17296497 ['xxx, Denmark. blah@aki.ku.dk <blah@aki.ku.dk>']
    #   case 2: ?? ['Institut ..., France. achmustilli@libero.it']
    # The next one is weird, look like there is something not parsed properly
    #   case 3: 17632571 ['xxx, xxx, UK. ib 103@mole.bio.cam.ac.uk']
    # Also the next one, so cannot use "." as delimiter first.
    #   case 4: 18613594 ['xxx, ACT2601 Australia. rod.mahon@csiro.au']
    # Ok, some is missing space between country and email... Man...
    #   case 5: 9931476 ['PO Box 12, Rehovot 76100, Israel.cohenk@agri.huji.ac.il']
    #   case 6: 27172200:Tianjin P.R	XX, Tianjin P.R., 300384 China tongjiping@sina.com goodrice@263.net.
    if add_str.find("@") != -1:
      if debug: print("found: @", add_str)

      # Find where the 1st email address is and generate a temp_str
      tmp_str = add_str[:add_str.find("@")]
      if debug: print(" ", tmp_str)

      # Originally using space, but the 3rd example shows that it is not good.
      # But then "." is regularly used in email address. So do space delimter
      # first, then use "."
      tmp_str = tmp_str[:tmp_str.rfind(" ")] # this takes care of case 4, 6
      if debug: print(" ", tmp_str)

      # Will not deal with case 3 and 5, this break things, like case 6
      #leftMargin = tmp_str.rfind(".")        # this takes care of case 3
      #if leftMargin == -1:                   # this takes care of case 5
      #  leftMargin = tmp_str.rfind(",")
      #add_str = tmp_str[leftMargin+1:]
      add_str = tmp_str
      
    # Strip empty space before
    add_str = add_str.strip()

    if debug: print("final add_str:",[add_str])

    # Some just have email address so after parsing, add_str is "".
    #   25548975: '. kehrig@pharmazie.uni-kiel.de.'
    if add_str == "":
      loc = "NA"
      errflag = 1
    # if tokens ends with ".", rid of it
    else:
      if add_str[-1] == ".": 
        add_str = add_str[:-1]

      # Originally splot with ", " then "." but there are edge cases like this:
      #   17444520, loc:Hsinchu,Taiwan
      # So split with ",", if it does not exist, split with " "
      if "," not in add_str:
        # Only one large token, split with space instead
        tokens = add_str.split(" ")
        try:
          loc = tokens[token_idx]
        except IndexError:
          loc = "NA"
          errflag = 1
      else:
        tokens = add_str.split(",")
        try:
          # rid of space if present
          loc = tokens[token_idx].strip()
        except IndexError:
          loc = "NA"
          errflag = 1

      if debug: print("step1 loc:",[loc])

      # Edge cases with "XXX. XXX"
      # 17365182:110016. China	['Research Department of Natural Medicine, Shenyang Pharmaceutical University. Shenyang, 110016. China.']
      # 17518112:Lupaszigeti ut 4.-2011	['Gyogynoveny Kutato Intezet Zrt., Budakalasz, Lupaszigeti ut 4.-2011.']
      # 17851392:msvcrt.P. India	['School of Studies in Chemistry, Vikram University, Ujjain 456010, M.P. India. ksrao7709@rediffmail.com']
      if loc.find(".") != -1:
        loc = loc.split(". ")[-1]

      if debug: print("step2 loc:",[loc])

      # More edge cases with "(" and some with ")", examples:
      # case 1: 18636686:47023 Cesena (FC) Italy
      # case 2: 19704524:Ibaraki Japan; xxx (B & PMP); xxx; xxx; Montpellier France
      # case 3: 21665592:B-1860 Meise (Belgium);
      # case 4: 24828308:Japan (K.S	
      # case 5: 27789739:A.I.);	XX, Australia (M.W.M., S.K.-J., A.I.); monika.murcha@uwa.edu.au.
      if loc.find("(") != -1:
        if debug: print("step1 add_str:",[add_str])
        if loc.find(")") != -1:
          # case 1
          if loc.find(") ") != -1:
            loc = loc.split(") ")[-1]
          # case 2, 3
          else:
            loc = loc.split("(")[-1].split(")")[0]
        # case 4
        else:
          loc = loc.split(" (")[0]

      if debug: print("step3 loc:",[loc])

      # 19140172, loc:IR Iran, there are other variations. OpenStreeMap cannot
      # find these so deal with them manually.
      if loc.find("Iran") != -1:
        loc = "Iran"

      # 19651701, loc:Taiwan ROC
      if loc.endswith(" ROC"):
        loc = loc[:loc.rfind(" ")]

      # 21299880, loc:DF- 70770-917 - Brasil
      if loc.find("- ") != -1:
        loc = loc.split("- ")[-1]

      # 1915409, loc:Stuttgart/Bundesrepublik Deutschland
      if loc.find("/") != -1:
        loc = loc.split("/")[-1]

      if loc.endswith("."):
        loc = loc[:-1]

    if debug: print(loc)

  return loc, errflag


In [None]:
# As of 3/7/23
def get_location_str(AD, token_idx=-1, debug=0):
  '''Get the potential location string from AD
  Args:
    AD (str): The content in the AD field
    token_idx (int): -1, -2, or 0 (whole thing)
  Return
    location (str): the string that likely contain location info
    errflag (int): the AD info is empty and thus erroneous (1) or not (0)
  '''

  # The first element in the AD list is used (1st author)
  add_str  = AD[0]
  if debug: print(AD, add_str)

  # But there are 12 records where the AD field looks like:
  # ['.', '.', '.', '.', '.', '.']
  # So tokens will be "", dealt with in the if-else statement below.
  if add_str == "":
    loc = "NA"
    errflag = 1

  else:
    errflag = 0
    # Multipe authors:
    # ['From xxx, xxx, xxx, xxx, xxx (Ranade, Ganea, Razzak, and Garcia Gil)']
    # Another case:
    # [['xxx, xx, xxx, Maryland 20742 (M.H., G.F.D.).']
    if add_str[-1] == ")" or add_str[-2:] == ").":
      leftMargin = add_str.rfind(" (")

    # Email fields: without ending "."
    # ['Institut ..., France. achmustilli@libero.it']
    elif add_str[-1] != ".":
      leftMargin = add_str.rfind(" ")

    # nothing to do, will take the whole thing
    else:
      leftMargin = len(add_str)

    add_str = add_str[:leftMargin] # rid of author info

    # if tokens ends with ".", rid of it
    if add_str[-1] == ".": 
      add_str = add_str[:-1]
    
    # Split with ", ", if it does not exist, split with " "
    if ", " not in add_str:
      # Only one large token, split with space instead
      tokens = add_str.split(" ")
      try:
        loc = tokens[token_idx]
      except IndexError:
        loc = "NA"
    else:
      tokens = add_str.split(", ")
      try:
        loc = tokens[token_idx]
      except IndexError:
        loc = "NA"

    if debug: print(loc)

  return loc, errflag

In [None]:
# Deprecated because there are some issues with handling some weird records
# E.g., 16656381
'''
def recursive_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, 
                                    dir_nominatim_na_out, err_recur=0):
  try:
    return call_nominatim(nominatim_nf, not_found, suppl_dict, dir_pmid_log, 
                          dir_nominatim_na_out, err_recur)
  except Exception as ex:
    #print("ERROR:", ex, err_recur, 1)
    return recursive_nominatim(nominatim_nf, not_found, suppl_dict, 
                               dir_pmid_log, dir_nominatim_na_out, 1)
'''

### Test run nominatim

For 16656795, it is New Heaven, Conneticu XXXX:
- NA search returns result correct, but for a location with importance=0.11
- AS search returns CHN, becaues multiuple token is searched and "New Heaven" is found in China with importance of 0.4.
- This means I cannot search with the add_str tokens other than the last one otherwise FP can happen.

In [None]:
test_pmid = '16656795'
add_str   = not_found[pmid][1][0]
add_str, get_location_str(add_str)

In [None]:

not_found_test = {}
test_list = ['16656734', '16656795']
for pmid in test_list:
  not_found_test[pmid] = not_found[pmid]
dir_test_out = dir71 / "test_out"
dir_pmid_log = dir71 / "log_test_pmids"
test_na_nf = {}
test_na_nf = call_nominatim(test_na_nf, not_found_test, suppl_dict, 
                                 dir_pmid_log, dir_test_out, 0.1)
test_na_nf