<h1>Auto-Generate Papers Page</h1>

<i>This little script was written by <a href="www.drkenreid.github.com">Ken Reid</a> for the <a href="www.msuqg.github.io">MSU Quantitative Genetics website</a>. Please give it a star if you use it.</i>

This works in 3 steps.

<ol>
    <li> Set up author names for parameters</li>
    <li> Run the scrape, keep only useful information. Removed non-unique entries.</li>
    <li> Write the HTML page (this is customised for www.msuqg.github.io but can be easily modified for your own usage). </li>
</ol>

<h2>Step One</h2>

Define the names of authors we wish to list on the page, using the template: <span style="font-family: Consolas; font-size: 16px">"lastName firstName[Author]"</span>

In [70]:
names = ["Gondro Cedric[Author]","Tempelman RJ[Author]"]

<h2>Step Two</h2>

<i>Thanks to <a href="https://gist.github.com/bonzanini/5a4c39e4c02502a8451d">Bonzanini</a> for the pubmed scraping code utilised (and modified) below. Thanks to <a href = "https://www.geeksforgeeks.org/python-merging-two-dictionaries/">geeksforgeeks</a> for the merge function.</i>

Import libraries, defined a couple of functions we'll use.

In [76]:
# you need to install Biopython:
# pip install biopython

# Full discussion:
# https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/
import Bio
from Bio import Entrez
import json

def search(query):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='20',
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

# Python code to merge dict using a single  
# expression. 
def Merge(dict1, dict2): 
    res = {**dict1, **dict2} 
    return res 

# Function used to detect if we are attempting to add a paper
# more than once. This can happen if you search for two authors
# who have collaborated.
def isUnique(results, title):
    for result in results:
        if result == title:
            return False
    return True

Next we scrape PubMed for the above set of authors, then store <i>all</i> results in <span style="font-family: Consolas; font-size: 16px">papers</span>. The results contain all possible information about the paper, except for the actual content. So we first strip the enormous object of 95% of it's data, and retain some chosen information (authors, titles, journals, years).

In [79]:
if __name__ == '__main__':
    #Create an empty dictionary so we can use the Merge() function
    #on the first iteration and every other one, too.
    papers = dict() 
    for name in names:
        results = search(name)
        id_list = results['IdList']
        papers = Merge(papers, fetch_details(id_list))

####################################################################################
# Test this is running properly by running the following print commands:
#
# Print each:
#for i, paper in enumerate(papers['PubmedArticle']):
#   print("%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
#
# Print 1st paper in JSON format:
#print(json.dumps(papers['PubmedArticle'][0], indent=2, separators=(',', ':')))
####################################################################################

# Create a new list to store all unique records, and only the fields we are interested in.
results = []
for i, paper in enumerate(papers['PubmedArticle']):
    title = "(%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle'])
    if isUnique(results, title):
        item = []
        item.append(title)
        item.append("(%d) %s" % (i+1, paper['MedlineCitation']['Article']['AuthorList']))
        item.append("(%d) %s" % (i+1, paper['MedlineCitation']['Article']['Journal']))
        item.append("(%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleDate']))
        results.append(item)

In [80]:
print(results)

[['(1) High-density genome-wide association study for residual feed intake in Holstein dairy cattle.', "(1) ListElement([DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD 20705-2350.'}], 'LastName': 'Li', 'ForeName': 'B', 'Initials': 'B'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD 20705-2350; Department of Animal and Avian Sciences, University of Maryland, College Park 20742; Medical Research Council Human Genetics Unit at the Medical Research Council Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, EH4 2XU, United Kingdom.'}], 'LastName': 'Fang', 'ForeName': 'L', 'Initials': 'L'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInf