_Adapted from notebook used by [nbclab.github.io](https://nbclab.github.io)._

# Retrieve new publications from PubMed 

This notebook is used to search for and retrieve latest publications by Dr. Khan using BioPython's PubMed search tool. A publication-specific MarkDown file is generated for each unique paper, with many elements automatically set up. As noted in the original notebook, you generally should check that the link to the markdown file exists. Unfortunately, preprints cannot be found via this method (though they can be added manually). This notebook cannot find new preprints. The process is automated and runs monthly using Github actions.

## Steps (via Github or manual)

1. Run this notebook.
2. If any new papers were grabbed, check the following:
    1. The paper has either of the lab PIs as an author. Ensure that it isn't by *another* AR Khan.
    2. The paper is not a duplicate of a preprint or another version of the paper. If so, merge the two versions.
3. Save the changes to the notebook.
4. Push changes to the notebook and affected files to GitHub.
5. Open a pull request to khanlab/khanlab.github.io

In [1]:
# Libraries
import os 

from Bio import Entrez, Medline
from datetime import datetime
from dateutil import parser
import pandas as pd

### Check existing publications

In [2]:
# First count number of articles from previous grab
pub_data = "_data/publications/publications.csv"

# Update count of publications from existing file
old_count = 0 
if os.path.isfile(pub_data):
    df_old = pd.read_csv(pub_data)
    old_count = len(df_old)

### Perform new query

In [3]:
# Only grab papers from after the lab PI came to UWO
search_criteria = ['''"Khan AR"[AUTH] AND ("2015/01/01"[PDAT] : "3000/12/31"[PDAT]) AND
                    ("Western University"[AFFL] OR "University of Western Ontario"[AFFL] OR
                     "Brain and Mind Institute"[AFFL] OR "Robarts Research Institute"[AFFL])''']

# Email required to search
Entrez.email = ''

In [4]:
rows = []

# Publications to skip (possibly due to another user with same initial)
skip_pmids = [32971934, 29641820, 29634829]
skip_pmids = [str(pmid) for pmid in skip_pmids]

for TERM in search_criteria:
    search = Entrez.esearch(db="pubmed", retmax="2", term=TERM)
    result = Entrez.read(search)
    print(f"Total number of publications containing {TERM}: {result['Count']}")
    
    search_all = Entrez.esearch(db="pubmed", term=TERM, retmax=result["Count"])
    result_all = Entrez.read(search_all)
    ids_all = result_all['IdList']
    pubs_all = Entrez.efetch(db="pubmed", id=ids_all, rettype='medline', retmode='text')
    records = Medline.parse(pubs_all)
    
    acceptable_formats = ["journal article", "comparative study", "editorial"]
    
    for record in records:
        if any([type_.lower() in acceptable_formats for type_ in record.get('PT')]):
            pmid = record.get("PMID")
            pmcid = record.get("PMC", "")
            
            doi = [aid for aid in record.get("AID", []) if aid.endswith(" [doi]")]
            if doi:
                doi = doi[0].replace(" [doi]", "")
            else:
                doi = ""
            
            title = record.get("TI").rstrip(".")
            authors = record.get("AU")
            
            pub_date = parser.parse(record.get("DP"))
            journal = record.get('TA')
            volume = record.get('VI', '')
            issue = record.get('IP', '')
            pages = record.get('PG', '')
            
            row = [pmid, pmcid, doi, title, authors, pub_date.year, pub_date.month,
                   pub_date.day, journal, volume, issue, pages]
            rows += [row]
            
df = pd.DataFrame(columns=['pmid', 'pmcid', 'doi', 'title', 'authors',
                           'year', 'month', 'day',
                           'journal', 'volume', 'issue', 'pages'],
                  data=rows)
df = df[~df["pmid"].isin(skip_pmids)]

new_pubs = df[~df['title'].isin(df_old['title'])]

# Append to old pubs to solve date issue
df = df_old.append(new_pubs)
df = df.sort_values(by=['year', 'month', 'day'], ascending=False)
df = df.fillna('')

# Save all relevant info from articles to a csv.
print("Saving identified publications to csv...")
df.to_csv('_data/publications/publications.csv', index=False)

Total number of publications containing "Khan AR"[AUTH] AND ("2015/01/01"[PDAT] : "3000/12/31"[PDAT]) AND
                    ("Western University"[AFFL] OR "University of Western Ontario"[AFFL] OR
                     "Brain and Mind Institute"[AFFL] OR "Robarts Research Institute"[AFFL]): 62


Saving identified publications to csv...


In [5]:
# Add papers we already have pages for.
if len(skip_pmids) > 0:
    for pmid in skip_pmids:
        df = df[df['pmid'] != pmid]
        
print(f"{len(df)} total articles found.")
print(f"{len(df) - old_count} new articles found.")

print("\nNew publications found:")
for _, pub in new_pubs.iterrows():
    print(f"Title: {pub['title']} ({pub['journal']})")
    print(f"Authors: {pub['authors']}")
    print(f"Journal (Date): {pub['journal']} ({pub['month']}/{pub['day']}/{pub['year']})\n")


57 total articles found.
0 new articles found.

New publications found:


In [6]:
# Also output the df in case output limit exceeded
new_pubs

Unnamed: 0,pmid,pmcid,doi,title,authors,year,month,day,journal,volume,issue,pages
