# __Step 7.1: Get country info__

Goal
- Get country info out of each doc
- Get # of docs per country
- Get # of docs per continent
- Get # of docs per country over time
- Get # of docs per country per topic
- Get # of docs per country per topic over time

Approach:
- Get the right token for country info in the AD (address) field, some have email address as the last token
- `pycountry`: use both ISO3166 (Countries) and ISO3166-3 (deleted countries)
- Supplement dictionary: some special considerations, e.g., UK, Taiwan, etc.
- `geopy`: pass the location token directly. This is a powerful module but does not deal with historical country properly.

Deprecated:
- `uszipcode`: search for zip code if the token has two parts delimited by " " and the 2nd part is a number as is or after taking the 1st subtoken before "-". This is done by `geopy`.

Key info:
- Total records: 421658
- Medline available: 421626
  - No AD in medline record: 19862
  - With a3 based on pycountry/suppl dict: 289543
  - To geopy: 112165

Issues:
- 2/22/23: 
  - Turned out that the docker command line only point to one region. Download OSM files from [Geofabrik](https://download.geofabrik.de/index.html).
  - There are some issues with email address containing address.
- 2/21/23:
  - Try to query locations using Nominatim servive (OpenStreetMap) via geopy but after couple hundred queries, the connection time out. Likely the service just refuse to handle too many request. Try install Nominatim locally. 
  - [Nominatim Docker version](https://github.com/mediagis/nominatim-docker/tree/master/4.2) and to [get Docker started](https://docs.docker.com/config/daemon/start/).
  - Got `docker: Got permission denied while trying to connect to the Docker daemon socket`. Try [this fix](https://www.digitalocean.com/community/questions/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket).
  - See [this guide](https://www.linkedin.com/pulse/geocoding-geopy-your-own-nominatim-server-chonghua-yin?trk=related_artice_Geocoding%20with%20GeoPy%20and%20Your%20Own%20Nominatim%20Server_article-card_title).
- 2/20/23: 
  - The corpus dataset from 2_5_predict_pubmed does not have author or affiliation info. This needs to be done from the very beginning when I process the pubmed records.
  - In 
[MEDLINE/PubMed Data Element (Field) Descriptions](https://www.nlm.nih.gov/bsd/mms/medlineelements.html), there are several important info:
    - The affiliation of the authors, corporate authors and investigators appear in this repeating field.
      - 1988- The address of the first author's affiliation is included. The institution, city, and state including zip code for U.S. addresses, and country for countries outside of the United States, are included if provided in the journal; sometimes the street address is also included if provided in the journal.
      - 1995-2013 The designation USA is added at the end of the address when the first author's affiliation is in the fifty United States or the District of Columbia.
        - Q: Does this mean that this is not done for records before 1995?
      - 1996- The primary author's electronic mail (e-mail) address is included at the end of the Affiliation field, if present in the journal.
      - 2003- The complete first author address is entered as it appears in the article with no words omitted.
      - October 2013- Quality control of this field ceased in order to accommodate the affiliations for all authors and contributors.
      - December 2014- Multiple affiliations for each author or contributor are included.
        - __Because of this, only 1st author info is considered.__
  - For dealing with countries, there is the issue of historical country names, see [ISSO_3166-3](https://en.wikipedia.org/wiki/ISO_3166-3)

## ___Set up___

### Module import

In [42]:
import pickle, pycountry, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from Bio import Entrez, Medline
from time import sleep
from geopy.geocoders import Nominatim

'''
from scipy.sparse import csr_matrix, lil_matrix, coo_matrix, dok_matrix
from time import time
from datetime import datetime
from dateutil.relativedelta import relativedelta
from collections import OrderedDict, Counter
from bisect import bisect
from mlxtend.preprocessing import minmax_scaling
from copy import deepcopy
'''

'\nfrom scipy.sparse import csr_matrix, lil_matrix, coo_matrix, dok_matrix\nfrom time import time\nfrom datetime import datetime\nfrom dateutil.relativedelta import relativedelta\nfrom collections import OrderedDict, Counter\nfrom bisect import bisect\nfrom mlxtend.preprocessing import minmax_scaling\nfrom copy import deepcopy\n'

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "7_countries"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with date and other info
dir2        = proj_dir / "2_text_classify//2_5_predict_pubmed"
corpus_file = dir2 / "corpus_plant_421658.tsv.gz"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Get PubMed records___

### Read plant science corpus

In [3]:
corpus = pd.read_csv(corpus_file, compression='gzip', sep='\t')

In [4]:
corpus.head(2)

Unnamed: 0.1,Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,reg_article,y_prob,y_pred
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,1
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,1


In [5]:
# Get all PMIDs
pmids = corpus.PMID.values
pmids.shape

(421658,)

### Get Pubmed docs using PMIDs


In [6]:
#https://stackoverflow.com/questions/59267992/biopython-how-to-download-all-of-the-peptide-sequences-or-all-records-associat

Entrez.email = 'shius@msu.edu'

id_list  = [str(pmid) for pmid in pmids]
post_xml = Entrez.epost(db='pubmed', id=','.join(id_list))
results  = Entrez.read(post_xml)
webenv   = results['WebEnv']
qkey     = results['QueryKey']

In [7]:
#http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec166

step    = 10000
for begin in tqdm(range(0, len(pmids), step)):
  # first check if this file is present
  medline_file = work_dir / f"corpus_plant_421658_medline_{begin}.pickle"

  # Check if the file is already there, if so, continue to the next one
  if not medline_file.is_file():
    subset   = pmids[begin:begin+step]

    # Get Medline records for subset
    handle  = Entrez.efetch(db='pubmed', id=subset, rettype='medline', 
                            retmode='text', webenv=webenv, query_key=qkey)
    records  = Medline.parse(handle)
    rec_list = list(records)

    with open(medline_file, "wb") as f:
      pickle.dump(rec_list, f)


100%|██████████| 43/43 [00:00<00:00, 412.95it/s]


### Process PubMed Medline docs

In [8]:
# Read individuial pickle files and compile the full list
all_rec = []
for begin in tqdm(range(0, len(pmids), step)):
  medline_file = work_dir / f"corpus_plant_421658_medline_{begin}.pickle"
  with open(medline_file, "rb") as f:
    rec_list = pickle.load(f)
  all_rec.extend(rec_list)

# The number of docs don't add up. Some records are not downloaded
len(all_rec)

100%|██████████| 43/43 [00:30<00:00,  1.42it/s]


421585

### Check what's missing

In [9]:
# Go thorugh all downloaded docs and get PMIDs
def check_missing(pmids, all_rec):
  '''
  Args:
    pmids (list): list of integer PMIDs
    all_rec (list): list of dictionary of medline records
  Return:
    id_list_missed (list): list of items in pmids but not all_rec
  '''

  # Downloaded
  pmids_dn = []
  for rec in tqdm(all_rec):
    pmids_dn.append(int(rec['PMID']))
  
  # Compare lists
  #https://stackoverflow.com/questions/15455737/python-use-set-to-find-the-different-items-in-list
  print("differnce:",len(pmids)-len(pmids_dn))

  pmids_ori_set = set(pmids)
  pmids_dn_set  = set(pmids_dn)
  missing = pmids_ori_set - pmids_dn_set
  print("# missing:", len(missing))

  id_list_missed = [str(pmid) for pmid in missing]

  return id_list_missed

In [10]:
# Get the missing records and add to all_rec
id_list_missed = check_missing(pmids, all_rec)

# Get Medline records for subset straight without epost
handle  = Entrez.efetch(db='pubmed', id=id_list_missed, rettype='medline', 
                        retmode='text')
records  = Medline.parse(handle)
rec_list = list(records)

# Can only get 41, so some still missing
print("Retrieved:", len(rec_list))

100%|██████████| 421585/421585 [00:00<00:00, 1208058.42it/s]


differnce: 73
# missing: 72
Retrieved: 41


In [11]:
# Save the missing records as pickle
medline_file = work_dir / "corpus_plant_421658_medline_missed.pickle"

with open(medline_file, "wb") as f:
  pickle.dump(rec_list, f)

In [12]:
# Add to all_rec, then check again
all_rec.extend(rec_list)

In [13]:
still_missing = check_missing(pmids, all_rec)
len(still_missing)

100%|██████████| 421626/421626 [00:00<00:00, 1306550.32it/s]


differnce: 32
# missing: 31


31

### Check AU length

In [123]:
ad_len_dict = {}
for rec in tqdm(all_rec):
  if 'AD' in rec:
    ad_len = len(rec['AD'])
    if ad_len not in ad_len_dict:
      ad_len_dict[ad_len] = 1
    else:
      ad_len_dict[ad_len]+= 1

    #if ad_len == 2285:
    #  print(rec['AD'][0])

print(ad_len_dict)
        

100%|██████████| 421626/421626 [00:00<00:00, 1002805.60it/s]

{1: 235025, 4: 20509, 5: 20868, 2: 13440, 9: 10572, 6: 20175, 3: 17537, 10: 8071, 8: 13912, 11: 5575, 7: 17230, 16: 1384, 12: 4400, 13: 2978, 15: 1766, 14: 2466, 17: 952, 23: 239, 21: 401, 19: 620, 18: 890, 59: 2, 25: 164, 30: 93, 32: 77, 22: 352, 20: 552, 34: 46, 36: 49, 43: 23, 28: 118, 54: 10, 35: 50, 44: 12, 27: 150, 29: 93, 24: 276, 31: 64, 26: 156, 45: 26, 38: 27, 71: 4, 50: 12, 37: 34, 55: 6, 86: 1, 41: 17, 80: 4, 82: 1, 33: 61, 47: 12, 64: 1, 42: 22, 58: 2, 39: 28, 69: 2, 52: 10, 51: 10, 40: 41, 62: 3, 46: 17, 96: 2, 48: 12, 65: 5, 57: 7, 72: 4, 128: 1, 60: 6, 114: 2, 78: 2, 76: 1, 56: 5, 95: 2, 136: 1, 63: 3, 126: 2, 112: 2, 162: 1, 216: 1, 77: 3, 105: 3, 66: 4, 70: 5, 88: 3, 153: 1, 135: 1, 200: 1, 111: 2, 152: 1, 75: 4, 130: 1, 273: 1, 85: 2, 68: 2, 92: 2, 137: 2, 61: 5, 53: 4, 113: 1, 366: 1, 67: 3, 93: 1, 120: 1, 164: 1, 110: 2, 89: 1, 143: 1, 83: 1, 49: 4, 131: 1, 101: 1, 103: 1, 73: 3, 179: 1, 886: 1, 74: 2, 2285: 1, 79: 1, 139: 1, 166: 1, 168: 1}





### Spot check AD fields

In [125]:
# Use "." as delimiter will work for most, but exceptions:
'''
['Instituto de Fitosanidad, Colegio de Postgraduados, km. 35.5 Carr. 
  Mexico-Texcoco, 56230-Texcoco, Edo. de Mexico, Mexico.']
['Botanisches Institut der Ludwig-Maximilians Universitat, Munchen, F.R.G.']
'''

for idx in range(0, len(all_rec), 1000):
  rec = all_rec[idx]
  if 'AD' in rec:
    print([rec['AD'][0]]) 

['Faculty of Pharmaceutical Sciences, Kumamoto University, Japan.']
["URA Centre National de la Recherche Scientifique 576, Departement de Biologie Moleculaire et Structurale, Centre d'Etudes Nucleaires de Grenoble, France."]
['Department of Biological Sciences, Stanford University, CA 94305-5020.']
['Botanisches Institut der Ludwig-Maximilians Universitat, Munchen, F.R.G.']
['Department of Biochemistry, Temple University School of Medicine, Philadelphia, PA 19140.']
['Department of Biochemistry, Johns Hopkins University, School of Hygiene and Public Health, Baltimore, Maryland 21205.']
['Institut de Biologie Moleculaire des Plantes du CNRS, Strasbourg, France.']
['Department of Agronomy, Purdue University, West Lafayette, Indiana 47907.']
["Departement de Biologie/Service de Biologie Cellulaire, Institut National de la Sante et de la Recherche Medicale U246, Centre d'Etudes Nucleaires de Saclay, Gif sur Yvette, France."]
['Ministry of Agriculture, Fisheries and Food, Slough Laboratory

## ___Search for country codes___

### Set up country dictionaries

In [14]:
# Build {country_name or official_name: alpha_3 code}
countries   = list(pycountry.countries)
cname_to_a3 = {}

for country in countries:
  name_a2    = country.alpha_2
  name_a3    = country.alpha_3
  name_short = country.name

  cname_to_a3[name_a2] = name_a3 # store this for situation like US
  cname_to_a3[name_a3] = name_a3 # store this for sitiation like USA
  cname_to_a3[name_short] = name_a3
  
  # put official name in
  try:
    name_offic = country.official_name
    cname_to_a3[name_offic] = name_a3
  except AttributeError:
    #print("No official name:", name_short)
    name_offic = "NA"

In [15]:
# Also build a dictionary for historical countries
countries_hist = list(pycountry.historic_countries)
cname_hist_to_a3 = {}

for country in countries_hist:

  # the name in historical countries are the official names
  name_offic = country.name
  cname_hist_to_a3[name_offic] = name_a3
  
  name_short = name_offic.split(",")[0]
  cname_hist_to_a3[name_short] = name_a3


In [16]:
# For weird stuff
suppl_dict = {"UK":"GBR", "The Netherlands":"NLD", "Taiwan":"TWN"}

### Search for country info with pycountry

In [126]:
def get_location_str(AD):
  '''Get the potential location string from AD
  Args:
    AD (str): The content in the AD field
  Return
    location (str): the string that likely contain location info
    errflag (int): the AD info is empty and thus erroneous (1) or not (0)
  '''

  # The first element in the AD list is used (1st author), rid of ending '.'
  tokens  = AD[0][:-1]

  # But there are 12 records where the AD field looks like:
  # ['.', '.', '.', '.', '.', '.']
  # So tokens will be "", dealt with in the if-else statement below.
  if tokens == "":
    location = "NA"
    errflag = 1
  else:
    errflag = 0
    # First deal with AD with mutliple authors that have a format like:
    # ['From xxx, xxx, xxx, xxx, xxx (Ranade, Ganea, Razzak, and Garcia Gil)']
    if tokens[-1] == ")":
      leftMargin = tokens.rfind(" (")
      tokens = tokens[:leftMargin] # rid of author info

    # Split first based on "." The address field should be before the 1st ".".
    # If email field is present in AD, it is expected to be in the 2nd token and
    # on. But there are few records with abbreviations ending with ".". Dealt 
    # with those later.
    # Next split with ", " and the last field is considered the location field.
    tokens = tokens.split(".")[0].split(", ")

    # AD field has only one token
    if len(tokens) == 1:
      location = tokens[0]
    # For the rest, assume location is in the last token
    else:
      location = tokens[-1]

  return location, errflag

In [127]:
def get_a3(location, cname_to_a3, cname_hist_to_a3, suppl_dict):
  # Found current country name
  if location in cname_to_a3:
    a3 = cname_to_a3[location]
  # Found historical country name
  elif location in cname_hist_to_a3:
    a3 = cname_hist_to_a3[location]
  # Found name in suppl dict
  elif location in suppl_dict:
    a3 = suppl_dict[location]
  # Leave this for geopy in the next step
  else:
    a3 = 'NA'

  return a3

In [128]:
# Without country a3
# Before checking for US state: 24867

country_info = {} # {pmid:[first_AU, first_AD, alpha_3]}
not_found    = {} # {pmid:[AU, AD]}, for records with no a3 code
count_AD_NA  = 0  # count records without AD field
count_a3     = 0  # count records with a3 code

# Go through each record
for rec in tqdm(all_rec):
  pmid = rec["PMID"]
  a3   = "NA" # set default value

  # Deal with AU info
  try:
    AU = rec["AU"]
  except KeyError:
    AU = "NA"  

  # Deal with AD info
  try:
    AD = rec["AD"]
    loc, errflag = get_location_str(AD)
    if not errflag:
      a3 = get_a3(loc, cname_to_a3, cname_hist_to_a3, suppl_dict)
      # Leave this for geopy in the next step
      if a3 == "NA":
        not_found[pmid] = [AU, AD]
    # The AD field is effectively empty so AD is "NA"
    else:
      AD = "NA"
      count_AD_NA += 1

  # AD field does not exist
  except KeyError:
    AD = "NA"
    count_AD_NA += 1

  if a3 != "NA":
    count_a3 += 1

  # For AD is NA, no point in doing more, so include these in country_info. 
  # For those with a3 code, they are done, so include these in country info.
  if AD == "NA" or a3 != "NA":
    country_info[pmid] = [AU, AD, a3]

print("Total   :", len(all_rec))
print("With a3 :", count_a3)
print("No AD   :", count_AD_NA)
print("To geopy:", len(not_found.keys()))

100%|██████████| 421626/421626 [00:08<00:00, 51817.33it/s] 

Total   : 421626
With a3 : 289543
No AD   : 19862
To geopy: 112165





## ___Continue to identify location using geopy___

### Docker setup and initial Nominatim run

Query through the Nominatim service resulted in time out. So install a local version via docker:

Install and start docker daemon:
```bash
sudo apt-get update
sudo apt-get upgrade
sudo apt install docker.io
sudo dockerd
```

Run Nominatim docker image:
```bash
sudo groupadd docker
sudo usermod -aG docker shius
su -s shius
docker run hello-world # testing
docker run -it \
  -e PBF_URL=https://download.geofabrik.de/europe/monaco-latest.osm.pbf \
  -e REPLICATION_URL=https://download.geofabrik.de/europe/monaco-updates/ \
  -p 8080:8080 \
  --name nominatim \
  mediagis/nominatim:4.2
```

Testing, in browser:
- http://localhost:8080/search.php?q=avenue%20pasteur

The code below follows [this tutorial](https://www.linkedin.com/pulse/geocoding-geopy-your-own-nominatim-server-chonghua-yin?trk=related_artice_Geocoding%20with%20GeoPy%20and%20Your%20Own%20Nominatim%20Server_article-card_title).


### Nominatim setup

Tried a few options:
- Installation: quite challenging, did not finish. Use [Docker image](https://github.com/mediagis/nominatim-docker/blob/master/4.2/README.md) instead.
- Intial docker run: found out that only Monaco was included. Look for ways to load the whole world.
- Downloaded Continent.pbf files from [Geofabrik](https://www.geofabrik.de/data/download.html).
- Figure out [how to transfer files between local host and container](https://www.edureka.co/community/10534/copying-files-from-host-to-docker-container) and move all pbf files into docker container and commit the image with files.
- For each Nominatim run, a different region is loaded. The order is determined based on anticipated pub volumn:
  - north-america-latest.osm.pbf
  - europe-latest.osm.pbf
  - asia-latest.osm.pbf
  - africa-latest.osm.pbf
  - australia-oceania-latest.osm.pbf
  - central-america-latest.osm.pbf
  - south-america-latest.osm.pbf
  - antarctica-latest.osm.pbf

In [None]:
'''
docker cp ~/data_nominatim/EACH_CONTINENT.pbf CONTAINER_ID:/nominatim/data
docker commit -m "With all continents" -a "Shinhan" CONTAINER_ID shius/nominatim_world
docker images

docker run -it \
  -e PBF_PATH=/nominatim/data/EACH_CONTINENT.pbf \
  -p 8080:8080  \
  --name nominatim_na \
  shius/nominatim_world
'''

### Search function

In [None]:
def call_nominatim(country_info, not_found):
  # Access local Nominatim server
  geolocator = Nominatim(domain='localhost:8080', scheme='http')

  # for records that still have no location info
  not_found2     = {} # {pmid:[AU, AD]}
  count_geopy_a3 = 0
  for pmid in tqdm(not_found):
    if pmid not in country_info:
      [AU, AD] = not_found[pmid]
      loc, _   = get_location_str(AD)
      geo      = geolocator.geocode(loc, language='en')

      # geocode return something useful
      if geo is not None:
        country = geo.raw['display_name'].split(", ")[-1]
        a3      = get_a3(country, cname_to_a3, cname_hist_to_a3, suppl_dict)
        
        country_info[pmid] = [AU, AD, a3]
        count_geopy_a3 += 1
      # nothing found
      else:
        not_found2[pmid] = [AU, AD, loc]

  print("Still missing:", len(not_found2.keys()))
  return not_found2

### North America search

In [94]:
# Access local Nominatim server
geolocator = Nominatim(domain='localhost:8080', scheme='http')

# for records that still have no location info
not_found2     = {} # {pmid:[AU, AD]}
count_geopy_a3 = 0
for pmid in tqdm(not_found):
  if pmid not in country_info:
    [AU, AD]     = not_found[pmid]
    loc, errflag = get_location_str(AD)
    geo          = geolocator.geocode(loc, language='en')

    # geocode return something useful
    if geo is not None:
      country = geo.raw['display_name'].split(", ")[-1]
      a3      = get_a3(country, cname_to_a3, cname_hist_to_a3, suppl_dict)
      
      country_info[pmid] = [AU, AD, a3]
      count_geopy_a3 += 1
    # nothing found
    else:
      not_found2[pmid] = [AU, AD, loc]

print("Still missing:", len(not_found2.keys()))

100%|██████████| 150840/150840 [41:01<00:00, 61.28it/s] 

Still missing: 147446





In [96]:
# Spot check
for pmid in list(not_found2.keys())[100087:100097]:
    print(not_found2[pmid][1:])

[['Max Planck Institute for Terrestrial Microbiology, D-35043 Marburg, Germany; email: loprestl@mpi-marburg.mpg.de , daniel.lanver@mpi-marburg.mpg.de , gabriel.schweizer@mpi-marburg.mpg.de , shigeyuki.tanaka@mpi-marburg.mpg.de , liangl@mpi-marburg.mpg.de , marie.tollot@mpi-marburg.mpg.de , zuccaro.alga@mpi-marburg.mpg.de , reissmas@mpi-marburg.mpg.de , kahmann@mpi-marburg.mpg.de.'], 'reissmas@mpi-marburg.mpg.de ']
[['Department of Chemical Engineering, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul, 143-701, Republic of Korea.'], 'Republic of Korea']
[['Zhejiang Institute of Subtropical Crops, Zhejiang Academy of Agricultural Sciences, Wenzhou, 325005, P.R. China. wuzhigang177@126.com.', 'Zhejiang Institute of Subtropical Crops, Zhejiang Academy of Agricultural Sciences, Wenzhou, 325005, P.R. China. jiangwu8888@163.com.', 'School of Applied Sciences, Health Innovations Research Institute, RMIT University, Melbourne, VIC, Australia. nitin.mantri@rmit.edu.au.', 'Zhejiang Institute

In [38]:
not_found2['1304750']

[['Zhao SL', 'Suo JZ', 'Chen LZ'],
 ['Beijing Institute for Clinical Pharmacy Research.']]

## ___Get continent info___

In [None]:
#https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry



## ___Put country code info into corpusm___

## ___Test___

### pycountry

In [73]:
list(pycountry.countries)[0]

Country(alpha_2='AW', alpha_3='ABW', flag='🇦🇼', name='Aruba', numeric='533')

In [85]:
rec = all_rec[4004]
au = rec["AU"]
ad = rec["AD"]
au, ad

(['Fouly HM', 'Domier LL', "D'Arcy CJ"],
 ['Department of Plant Pathology, University of Illinois, Urbana 61801.'])

In [87]:
pycountry.subdivisions.lookup("Urbana")

LookupError: Could not find a record for 'urbana'

In [131]:
country = pycountry.countries.get(name="Yugoslavia")
print(country)

None


In [136]:
country = pycountry.historic_countries.get(name="Yugoslavia")
print(country)

None


In [137]:
list(pycountry.historic_countries)

[Country(alpha_2='AI', alpha_3='AFI', alpha_4='AIDJ', name='French Afars and Issas', numeric='262', withdrawal_date='1977'),
 Country(alpha_2='AN', alpha_3='ANT', alpha_4='ANHH', name='Netherlands Antilles', numeric='530', withdrawal_date='1993-07-12'),
 Country(alpha_2='BQ', alpha_3='ATB', alpha_4='BQAQ', name='British Antarctic Territory', withdrawal_date='1979'),
 Country(alpha_2='BU', alpha_3='BUR', alpha_4='BUMM', name='Burma, Socialist Republic of the Union of', numeric='104', withdrawal_date='1989-12-05'),
 Country(alpha_2='BY', alpha_3='BYS', alpha_4='BYAA', name='Byelorussian SSR Soviet Socialist Republic', numeric='112', withdrawal_date='1992-06-15'),
 Country(alpha_2='CS', alpha_3='CSK', alpha_4='CSHH', name='Czechoslovakia, Czechoslovak Socialist Republic', numeric='200', withdrawal_date='1993-06-15'),
 Country(alpha_2='CS', alpha_3='SCG', alpha_4='CSXX', name='Serbia and Montenegro', numeric='891', withdrawal_date='2006-06-05'),
 Country(alpha_2='CT', alpha_3='CTE', alpha_

### uszipcode

In [2]:
from uszipcode import SearchEngine

sr = SearchEngine()
z = sr.by_zipcode("02167")
print(z)

None


### zipcodes

In [3]:
import zipcodes

exact_zip = zipcodes.matching('02167')
print(exact_zip)

[]


### us

In [9]:
import us

print(us.states.lookup('24'))
print(us.states.lookup('bleh'))

Maryland


TypeError: str argument expected

### geocoder

See [this post](https://stackoverflow.com/questions/44208780/find-the-county-for-a-city-state)

In [13]:
import geocoder

results = geocoder.google("Chicago")
print(results)

<[REQUEST_DENIED] Google - Geocode [empty]>


In [None]:
help(geolocator.geocode)

### Nominatim setup

#### [mediagis/nominatim-docker](https://github.com/mediagis/nominatim-docker/blob/master/4.2/README.md)

Starting out with this, realize that different regions need to be downloaded.

In [None]:
'''bash
docker run -it \
  -e PBF_URL=https://download.geofabrik.de/north-america/us/michigan-latest.osm.pbf \
  -p 8080:8080  \
  --name nominatim_na \
  mediagis/nominatim:4.2
'''


#### [Aximem/nominatim-docker-multiple-regions](https://github.com/Aximem/nominatim-docker-multiple-regions)

Not well documented and it did not work for me.

In [None]:
'''bash
# In /home/shius/github
git clone https://github.com/Aximem/nominatim-docker-multiple-regions.git
cd nominatim-docker-multiple-regions/

# Build image
docker build --pull --rm -t nominatim .

# update file: multiple_regions/init_multiple_regions.sh
# Issue: Not sure where to find this file.
#COUNTRIES="africa australia-oceania europe antarctica central-america south-america asia"

# init multiple regions: did not work...
docker run -t -v /home/shius/data_nominatim/:/data nominatim sh /app/multiple_regions/init.sh
'''

#### From city to location

See [this post](https://www.tutorialspoint.com/how-to-get-the-longitude-and-latitude-of-a-city-using-python)

In [118]:
from geopy.geocoders import Nominatim

# initialize Nominatim API
#geolocator = Nominatim(user_agent="plant_sci_hist")
geolocator = Nominatim(domain='localhost:8080', scheme='http')
# Get location
location = geolocator.geocode("Ann Arbor")

print(location.latitude, location.longitude)

AttributeError: 'NoneType' object has no attribute 'latitude'


#### From location to country
- In the same post, codes directly copied. But that give a `ConfigurationError`.
- Use [this](https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/) instead.

In [111]:
loc_rev = geolocator.reverse(f'{location.latitude},{location.longitude}')
 
# Display
print(loc_rev)

1107, Olivia Avenue, North Burns Park, Ann Arbor, Washtenaw County, Michigan, 48104, United States


#### Man, this works for many other things

Retrieve in English based on [this post](https://stackoverflow.com/questions/29360910/geopy-retrieving-country-names-in-english)

In [112]:
print(geolocator.geocode("Dusseldorf", language='en'))
print(geolocator.geocode("the Netherlands", language='en'))
print(geolocator.geocode("Academia Sinica", language='en'))
print(geolocator.geocode("Michigan"))
print(geolocator.geocode("Ingham"))
print(geolocator.geocode("Michigan 48823"))
print(geolocator.geocode("MI 48823"))
print(geolocator.geocode("48823"))
print(geolocator.geocode("East Lansing 48823"))
print(geolocator.geocode("Kunming", language='en'))

Dusseldorf, North Rhine-Westphalia, Germany
Netherlands
Academia Sinica, 128, Academia Road Section 2, Zhongyan Village, Nangang District, Jiuzhuang, Taipei, 11529, Taiwan
Michigan, United States
Ingham County, Michigan, United States
East Lansing, Ingham County, Michigan, 48823, United States
East Lansing, Ingham County, Michigan, 48823, United States
East Lansing, Ingham County, Michigan, 48823, United States
East Lansing, Ingham County, Michigan, United States
Kunming, Yunnan, China


In [113]:
# This one is wrong
print(geolocator.geocode("Yugoslavia", language='en'))

Yugoslavia, Reynold García (Pastorita), Peñas Altas, Ciudad de Matanzas, Matanzas, Cuba


In [114]:
# This return none
print(geolocator.geocode("MI USA", language='en'))
print(geolocator.geocode("Republic of Korea", language='en'))
print(geolocator.geocode("PR China", language='en'))

United States
South Korea
Praça China, Santana de Parnaíba, Região Imediata de São Paulo, Região Metropolitana de São Paulo, Região Geográfica Intermediária de São Paulo, São Paulo, Southeast Region, Brazil


### nominatim local access

In [29]:
geolocator = Nominatim(domain='localhost:8080', scheme='http')
location = geolocator.geocode('avenue pasteur')
location

Location(Avenue Pasteur, Fontvieille, Monaco, 98020, Monaco, (43.7308551, 7.4149204, 0.0))

### Checking medline parse

In [101]:
test_file = work_dir / 'corpus_plant_421658_medline_100000.pickle'

with open(test_file, 'rb') as f:
    test_medline = pickle.load(f)

In [102]:
# Check those with country_info records with non NA country code
c = 0
for rec in test_medline:
    pmid = rec['PMID']
    if pmid in country_info and country_info[pmid][2] != 'NA':
        print("---\nPMID:", pmid)
        print("Country:", country_info[pmid][2])
        print("AD  :", rec['AD'])

        if c == 5:
            break
        c += 1

---
PMID: 16465904
Country: CHN
AD  : ['The Key Laboratory of Industrial Biotechnology, Ministry of Education, Southern Yangtze University, Wuxi 214036, China.']
---
PMID: 16465985
Country: CMR
AD  : ['University of Ngaoundere, Ngaoundere, Cameroon.']
---
PMID: 16466344
Country: FRA
AD  : ['Laboratoire de Physiologie Cellulaire Vegetale, UMR5168 CNRS-CEA-INRA-Universite Joseph Fourier Grenoble I, Departement Reponse et Dynamique Cellulaires, CEA-Grenoble, 17 rue des Martyrs, F-38054 Grenoble Cedex 9, France.']
---
PMID: 16466375
Country: USA
AD  : ['Department of Plant Pathology, Washington State University, Pullman, WA 99164-6430, USA.']
---
PMID: 16466376
Country: CHE
AD  : ['Phytopathology Group, Institute of Plant Sciences, Swiss Federal Institute of Technology (ETH), Zurich, Switzerland.']
---
PMID: 16466377
Country: SWE
AD  : ['Department of Botany, Stockholm University, Stockholm, Sweden.']


In [103]:
# Check those without country_info records
c = 0
for rec in test_medline:
    pmid = rec['PMID']
    if pmid not in country_info:
        print("---\nPMID:", pmid)
        if 'AD' in rec:
          print("AD  :", rec['AD'])
          print("loc :", not_found2[pmid])
        if c == 5:
            break
        c += 1

---
PMID: 16465888
AD  : ['Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China. djx@xtbg.ac.cn']
loc : [['Dou JX', 'Zhang YP', 'Feng ZW', 'Liu WJ'], ['Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China. djx@xtbg.ac.cn'], 'Kunming 650223']
---
PMID: 16465991
AD  : ['State Key Laboratory for Agricultural Biotechnology, College of Biology, China Agricultural University, Beijing, PR China.']
loc : [['Yu J', 'Peng P', 'Zhang X', 'Zhao Q', 'Zhu D', 'Sun X', 'Liu J', 'Ao G'], ['State Key Laboratory for Agricultural Biotechnology, College of Biology, China Agricultural University, Beijing, PR China.'], 'PR China']
---
PMID: 16466532
AD  : ['Danish Institute of Agricultural Sciences, Research Centre Foulum, Tjele, Denmark. martint.soresen@agrsci.dk']
loc : [['Sorensen MT', 'Danielsen V'], ['Danish Institute of Agricultural Sciences, Research Centre Foulum, Tjele, Denmark. martint.soresen@agrsci.dk'], 'Tjele']
---
P

### Note on local Nominatim install

Using the Docker version I could just specify the different continent in different runs. But that is very clunky. So I am going to try install locally. Um... just too many dependent packages... Not doing this for now, stop at ICU install.

Following the [instruction](https://nominatim.org/release-docs/latest/admin/Installation/).

#### Install dependency

- cmake: `sudo apt install cmake`
- [expat](https://github.com/libexpat/libexpat/releases): configure, make, make install
- [proj](https://proj.org/):  `conda install -c conda-forge proj`
- [bzip2](https://sourceforge.net/projects/bzip2/): already in Ubuntu
- [Zlib](https://www.zlib.net/): [instruction](https://geeksww.com/tutorials/libraries/zlib/installation/installing_zlib_on_ubuntu_linux.php)
- [ICU](https://icu.unicode.org/):  

#### PostgreSQL

##### Install

```bash
# Create the file repository configuration:
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

# Import the repository signing key:
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

# Update the package lists:
sudo apt-get update

# Install the latest version of PostgreSQL.
# If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
sudo apt-get -y install postgresql
```

##### Tuning

```bash
vim /etc/postgresql/15/main//postgresql.conf
```


