<h1>Auto-Generate Papers Page</h1>

<i>This little script was written by <a href="www.drkenreid.github.com">Ken Reid</a> for the <a href="www.msuqg.github.io">MSU Quantitative Genetics website</a>. Please give it a star if you use it.</i>

This works in 3 steps.

<ol>
    <li> Set up author names for parameters</li>
    <li> Run the scrape, keep only useful information. Removed non-unique entries.</li>
    <li> Write the HTML page (this is customised for <a href = "www.msuqg.github.io">www.msuqg.github.io</a> but can be easily modified for your own usage). </li>
</ol>

If you would rather skip most of step 3 and simply copy-paste the scraped and HTML-ized data, that's fine too, see the end of step 3.

<h2>Step One</h2>

Define the names of authors we wish to list on the page, using the template: <span style="font-family: Consolas; font-size: 16px">"lastName firstName[Author]"</span>

In [2]:
names = ["Gondro Cedric[Author]","Tempelman RJ[Author]"]

<h2>Step Two</h2>

<i>Thanks to <a href="https://gist.github.com/bonzanini/5a4c39e4c02502a8451d">Bonzanini</a> for the pubmed scraping code utilised (and modified) below. Thanks to <a href = "https://www.geeksforgeeks.org/python-merging-two-dictionaries/">geeksforgeeks</a> for the merge function.</i>

Import libraries, defined a couple of functions we'll use.

In [3]:
# you need to install Biopython:
# pip install biopython

# Full discussion:
# https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/
import Bio
from Bio import Entrez
import json
import codecs
import datetime
from datetime import date

def search(query):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='20',
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

# Python code to merge dict using a single  
# expression. 
def Merge(dict1, dict2): 
    res = {**dict1, **dict2} 
    return res 

# Function used to detect if we are attempting to add a paper
# more than once. This can happen if you search for two authors
# who have collaborated.
def isUnique(results, title):
    for result in results:
        if result == title:
            return False
    return True

Next we scrape PubMed for the above set of authors, then store <i>all</i> results in <span style="font-family: Consolas; font-size: 16px">papers</span>. The results contain all possible information about the paper, except for the actual content. So we first strip the enormous object of 95% of it's data, and retain some chosen information (authors, titles, journals, years). If you wish to save some other information, look inside the commented code and choose a <span style="font-family: Consolas; font-size: 16px">print()</span> command, use it, and explore the <span style="font-family: Consolas; font-size: 16px">papers</span> object for additional data you wish to use.

In [4]:
if __name__ == '__main__':
    #Create an empty dictionary so we can use the Merge() function
    #on the first iteration and every other one, too.
    papers = dict() 
    for name in names:
        results = search(name)
        id_list = results['IdList']
        papers = Merge(papers, fetch_details(id_list))

####################################################################################
# Test this is running properly by running the following print commands:
#
# Print each:
#for i, paper in enumerate(papers['PubmedArticle']):
#   print("%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
#
# Print 1st paper in JSON format:
#print(json.dumps(papers['PubmedArticle'][0], indent=2, separators=(',', ':')))
####################################################################################

# Create a new list to store all unique records, and only the fields we are interested in.
reducedData = []
for i, paper in enumerate(papers['PubmedArticle']):
    # get title
    title = "\t\t(%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle'])
    if isUnique(reducedData, title):
        item = []
        item.append(title)
        
        # get authors
        rawAuthors = "(%d) %s" % (i+1, paper['MedlineCitation']['Article']['AuthorList'])
        splitString = rawAuthors.split("Name")
        splitString[0] = ""
        nameArray = []
        comma = True
        for chunk in splitString:
            chunkFromNameStart = chunk[4:]
            #print(chunkFromNameStart)
            endOfNameLocation = chunkFromNameStart.find("\'")
            #print(endOfNameLocation)
            charsOfInterest = chunkFromNameStart[:endOfNameLocation]
            if not charsOfInterest: #if string is empty, ignore and move to next. Some names are missing.
                continue
            #Ensure format is "LastName, FirstName; LastName2, FirstName2..."
            if comma == True:
                nameArray.append(charsOfInterest + ", ")
                comma = False
            else:
                nameArray.append(charsOfInterest + "; ")
                comma = True
        fixedFinalString = nameArray[len(nameArray)-1]
        fixedFinalString = fixedFinalString[:-1]
        nameArray[len(nameArray)-1] = fixedFinalString #Remove last semicolon
        authors = "".join(nameArray)
        item.append(authors)
        
        # get date
        date = "(%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleDate'])
        locationOfYear = date.find("\'Year\': \'",0,len(date)) + 9
        item.append(", <b>"+date[locationOfYear:locationOfYear+4]+"</b>, ") #4 because I doubt this code will be used past the year 9999
        
        # get journal
        journal = "(%d) %s" % (i+1, paper['MedlineCitation']['Article']['Journal'])
        journalLocation = journal.find("\'Title\':",0,len(journal)) + 10
        journalLocationEnd = journal.find("\',",journalLocation,len(journal))
        item.append("<i>"+journal[journalLocation:journalLocationEnd]+"</i>")
        
        # add all garnered data to reducedData as a record (string array, really)
        reducedData.append(item)
        reducedData.append("\n\n\t\t<br><br>\n\n")

# Finally, turn into a string for easy use next step.

paperData = ''.join(''.join(elems) for elems in reducedData)
        


<h2>Step Three</h2>

Finally, we wish to show the information we scraped and cleaned inside an HTML page. By default, this script will look for a file called <span style="font-family: Consolas; font-size: 16px">publications.html</span> in the same file directory. If found, it will then look for, by default, the first \<\/h2\> tag. The list of papers will be inserted after this tag. 

In [5]:
# Using codecs library, imported in step 1.
file1 = codecs.open("publications.html",'r') 
webpageCode = file1.read()

# 5 because </h2> is 5 characters long. We wish to insert after this.
location = webpageCode.find("</h2>",0,len(webpageCode)) + 5

# Python cannot modify strings, you have to make a new string instead.
prePaperPage = webpageCode[:location] + '\n\n<br><br>\n\n'
midPaperPage = prePaperPage + paperData
acknowledgementPaperPage = midPaperPage + "\n\n<br><br>\n\n\t\t<i>Generated on "+str(datetime.date.today())+ " by <a href=https://github.com/DrKenReid/GenerateHTMLPublicationsPage> this tool </a></i>."
newWebpageCode = acknowledgementPaperPage + webpageCode[location:]

Great - that's the page generated. Now just to save it as an HTML file.

In [137]:
Html_file= open("publications.html","w")
Html_file.write(newWebpageCode)
Html_file.close()

Alternatively - if you'd rather simply see the HTML output, run the following cell:

Finally, if you'd rather see JUST the paper data, run the following cell:

In [None]:
print(paperData)