## PubMed API - International Collaborations

Welcome to this tutorial! Today, we will look at using the PubMed Application Programming Interface (API) to automatically extract information from downloaded papers. PubMed is a repository of biomedical research papers run the the US National Insitite of Health: https://pubmed.ncbi.nlm.nih.gov/. More information on their API can be found here: https://www.ncbi.nlm.nih.gov/home/develop/api/.

Please note that to access PubMed using the API you will need to register to obtain an access key: https://support.nlm.nih.gov/kbArticle/?pn=KA-05316.

In this tutorial we will look at accessing PubMed papers which fall under the "obesity" search term over the last 10 years. We will extract the author information and make graphs using this information to determine who is collaborating with whom.

To do this, we make some assumptions:
1) All authors listed on the paper are assumed to have contributed to the paper and are considered working together,
2) We identify the authors' location based on the information they provide in the paper,
3) We remove authors who share the same country location,
4) Authors on the same paper from different countries are assumed to be international collaborators.

### Import Modules

In [None]:
from Bio import Entrez
from Bio import Medline
import re
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
import networkx as nx
import difflib
import os
import conda


conda_file_dir = conda.__file__
conda_dir = conda_file_dir.split('lib')[0]
proj_lib = os.path.join(os.path.join(conda_dir, 'share'), 'proj')
os.environ["PROJ_LIB"] = proj_lib

from mpl_toolkits.basemap import Basemap

### PubMed Search

We use Biopython to make our search a bit easier. More information on Biopython can be found here: https://biopython.org/docs/1.76/api/Bio.Entrez.html

Note you will need to fill in your email in the code below!

In [None]:
# Search pubmed for obesity articles published in last 10 years
Entrez.email = "[your email here]"
handle = Entrez.esearch(db="pubmed", term="obesity[biomedical]", mindate="2014",retmax="500",usehistory="y")
record = Entrez.read(handle)
idlist = record["IdList"]
handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline", retmode="text")
records = Medline.parse(handle)
records = list(records)

Let's open record to find how many papers have been published since 2014 - our count is 490,503. That is a lot of papers!

In [None]:
record

Let's take a sneak peak at the first 5 records to see how our data is stored.

In [None]:
# First 5 records
for record in records[0:5]:
    print(record)

### Extracting Paper Information

Now that we have our papers, we want to extract some information about them. We want to extract the location information and see which country the authors belong to. For this we make use of some text files containing a list of all the countries in the world.

In [None]:
# Extract countries and usa counties
with open("Countries.txt","r") as f:
    countries = f.readlines()
f.close()
countries = [x.strip() for x in countries]

usa_counties = []
with open("USA_Counties.txt","r") as f:
    usa_counties = f.readlines()    
f.close()
usa_counties = [x.strip() for x in usa_counties] 

You will notice form the records information that journal type is stored under 'JT'

In [None]:
# Print journal
for record in records:
    print("Journal: ", record.get("JT","?"))

You will notice form the records information that author location is stored under 'AD'

In [None]:
# Print author location
for record in records:
    print("Author Location: ", record.get("AD","?"))

In [None]:
# Convert author location to dictionary
author_location_dic = dict()
for index,record in enumerate(records):
    author_location = record.get("AD","?")
    author_location_dic[index] = author_location

In [None]:
# Location of authors for each paper
final_author_location_list = []
for value in author_location_dic.values():
    author_country_list = []
    for elem in value:
        info = re.split(r'[,]', elem)[-1]
        #print(info)
        country_info = str(info.partition(".")[0])
        #print(country_info)
        for elem2 in countries:
            if elem2 in country_info:
                author_country_list.append(elem2)
        for elem3 in usa_counties:
            if elem3 in country_info:
                author_country_list.append("United States")
        author_tuple = tuple(author_country_list)
    final_author_location_list.append(author_tuple)

In [None]:
# Remove empty tuples
final_author_location_list = [x for x in final_author_location_list if x != ()]

Let's examine the final author location list to see the countries the authors come from. You can see many of them are from the same country! We're interested in those authors collaborating across countries. Let's do some more digging!

In [None]:
final_author_location_list

In [None]:
# Find collaborating countries
list_collaborators = []
for elem in final_author_location_list:
    *collabs, origin = elem
    #print(origin,collabs)
    collaborators = []
    for elem2 in collabs:
        #print(elem2)
        if elem2 in origin:
            pass
        elif elem2 != origin:
            #print(elem2,origin)
            if elem2 not in collaborators:
                collaborators.append(elem2)
    collaborators_tuple = tuple(collaborators)
    list_collaborators.append((origin,collaborators_tuple))

In [None]:
# Remove single origin countries
final_list_collaborators = []
for elem in list_collaborators:
    origin, collaborators = elem
    if collaborators != ():
        final_list_collaborators.append((origin,collaborators))

Let's check our our final list of collaborating authors. Here you will notice that authors from the same country have been removed.

In [None]:
final_list_collaborators

Now we want our final result as a dataframe!

In [None]:
# Convert list to dataframe
origin_countries = []
collab_countries = []
for elem in final_list_collaborators:
    origin, collabs = elem
    origin_countries.append(origin)
    #print(len(collabs))
    collab_countries.append(list(collabs))

In [None]:
# convert origin countries and collaborating countries to a dataframe
data = {"Origin Country":origin_countries,"Collaborating Countries":collab_countries}
collab_df = pd.DataFrame(data=data)

In [None]:
# Reduce dimensionality for network graph
collab_df = collab_df.explode("Collaborating Countries")

### Plotting Our Results

Yay! We have a list of international collaborating countries working on Obesity. But it's not very informative having a list is it? We want a way of visualising our newly obtained information. For this next step, we utilise BaseMap to see where our collaborators are. More information on BaseMap can be found here: https://matplotlib.org/basemap/stable/

We first utilise the information in our text files to see which continent a country belongs to.

In [None]:
# Countries by continent
north_america = []
south_america = []
middle_east = []
africa = []
europe = []
asia = []
oceania = []
with open("North_America.txt","r") as f:
    north_america = f.readlines()
f.close()
north_america = [x.strip() for x in north_america]
with open("South_America.txt","r") as f:
    south_america = f.readlines() 
f.close()
south_america = [x.strip() for x in south_america]
with open("Europe.txt","r") as f:
    europe = f.readlines()
f.close()
europe = [x.strip() for x in europe]
with open("Africa.txt","r") as f:
    africa = f.readlines()
f.close()
africa = [x.strip() for x in africa]
with open("Oceania.txt","r") as f:
    oceania = f.readlines()
f.close()
oceania = [x.strip() for x in oceania]
with open("Middle_East.txt","r") as f:
    middle_east = f.readlines()
f.close()
middle_east = [x.strip() for x in middle_east]
with open("Asia.txt","r") as f:
    asia = f.readlines()
f.close
asia = [x.strip() for x in asia]

We now need to convert our list of collaborators to the origin country and their collaborators. To do this, we assume that the first country in final_list_collaborators is the origin country and subsequent countries are their collaborators. 

In [None]:
# add continent information
def add_continent(data):
    if data in north_america:
        return "North America"
    elif data in south_america:
        return "South America"
    elif data in europe:
        return "Europe"
    elif data in africa:
        return "Africa"
    elif data in middle_east:
        return "Middle East"
    elif data in asia:
        return "Asia"
    elif data in oceania:
        return "Oceania"
    else:
        return "None Given" 

In [None]:
# apply continent information to our dataframe
collab_df["Origin Continents"] = collab_df["Origin Country"].apply(add_continent)
collab_df["Collab Continents"] = collab_df["Collaborating Countries"].apply(add_continent)

In [None]:
# you can save the dataframe below for future reference!
collab_df.to_csv("Internal Collabs.csv",index=False)

In [None]:
# first get a blank basemap
m = Basemap(projection='robin',lon_0=0,resolution='l')
m.drawcountries(linewidth = 0.5)
m.fillcontinents(color='white',lake_color='white')
m.drawcoastlines(linewidth=0.5)

In [None]:
# our base graph
G = nx.Graph()
G = nx.from_pandas_edgelist(df=collab_df, source="Origin Country", target="Collaborating Countries", edge_attr=True, create_using=nx.DiGraph())

In [None]:
# load geographic coordinate system for countries
import csv
'''
country = [row[0].strip() for row in csv.reader(open('LonLat.csv'), delimiter=';')]    # clear spaces
lat = [float(row[1]) for row in csv.reader(open('LonLat.csv'), delimiter=';')]
lon = [float(row[2]) for row in csv.reader(open('LonLat.csv'), delimiter=';')]
'''
reader = csv.reader(open('LonLat.csv'), delimiter=';')

next(reader,None)
country=[]
lat=[]
lon=[]

for row in reader:
    country.append(row[0])
    lat.append(row[1])
    lon.append(row[2])
    
# define position in basemap
position = {}
for i in range(0, len(country)):
    position[country[i]] = m(lon[i], lat[i])

In [None]:
# the longitude and latitude for each country
country_long_lat = pd.read_csv("LonLat.csv",delimiter=';')

In [None]:
# get positions for each country using the longitudes and latitudes
position = {}
for row in country_long_lat.itertuples():
    position[row.Country.strip()] = (row.Longitude,row.Latitude)

In [None]:
# we remove double counts
position["CROATIA"] = position["CROATIA (HRVATSKA)"]
position.pop("CROATIA (HRVATSKA)")
position["SERBIA"] = [44,21]
position["SLOVAKIA"] = position["SLOVAK REPUBLIC"]
position.pop("SLOVAK REPUBLIC")

In [None]:
# convert position to a dictionary
position_dic = {}
for node in G:   
    long, lat = position[node.upper()]
    position_dic[node] = m(lat, long)

In [None]:
# add in general directions to our position dictionary
position_dic["NW"] = m(-180,-90)
position_dic["NE"] = m(180,90)
position_dic["SW"] = m(180,-90)
position_dic["SE"] = m(-180,90)

In [None]:
# draw the graph!
nx.draw_networkx_nodes(G, position_dic, nodelist = G.nodes(),node_color = 'r', alpha = 0.8, node_size = 10)
nx.draw_networkx_edges(G, position_dic, edge_color='g',alpha=0.2, arrows = True)
m.drawcoastlines()
m.drawcountries()
plt.savefig("International Collabs Biomedical.png",dpi=300,bbox_inches='tight')