# Publications markdown generator for academicpages

Takes a set of bibtex of publications and converts them for use with [academicpages.github.io](academicpages.github.io). This is an interactive Jupyter notebook ([see more info here](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html)). 

The core python code is also in `pubsFromBibs.py`. 
Run either from the `markdown_generator` folder after replacing updating the publist dictionary with:
* bib file names
* specific venue keys based on your bib file preferences
* any specific pre-text for specific files
* Collection Name (future feature)

TODO: Make this work with other databases of citations, 
TODO: Merge this with the existing TSV parsing solution

In [93]:
from pybtex.database.input import bibtex
import pybtex.database.input.bibtex 
from time import strptime
import string
import html
import os
import re

In [115]:
#todo: incorporate different collection types rather than a catch all publications, requires other changes to template
publist = {
    "proceeding": {
        "file" : "proceedings.bib",
        "venuekey": "booktitle",
        "venue-pretext": "In the proceedings of ",
        "catagory" : "conferences",
        "collection" : {"name":"publications",
                        "permalink":"/publication/"}
        
    },
    "journal":{
        "file": "pub.bib",
        "venuekey" : "journal",
        "venue-pretext" : "",
        "catagory" : "manuscripts",
        "collection" : {"name":"publications",
                        "permalink":"/publication/"}
    } 
}

In [116]:
html_escape_table = {
    "&": "&amp;",
    '"': "&quot;",
    "'": "&apos;"
    }

def html_escape(text):
    """Produce entities within text."""
    return "".join(html_escape_table.get(c,c) for c in text)

In [117]:
from pybtex import errors
errors.set_strict_mode(False)

for pubsource in publist:
    parser = bibtex.Parser()
    bibdata = parser.parse_file(publist[pubsource]["file"])

    #loop through the individual references in a given bibtex file
    for bib_id in bibdata.entries:
        #reset default date
        pub_year = "1900"
        pub_month = "01"
        pub_day = "01"
        
        b = bibdata.entries[bib_id].fields
        
        try:
            pub_year = f'{b["year"]}'

            #todo: this hack for month and day needs some cleanup
            if "month" in b.keys(): 
                if(len(b["month"])<3):
                    pub_month = "0"+b["month"]
                    pub_month = pub_month[-2:]
                elif(b["month"] not in range(12)):
                    tmnth = strptime(b["month"][:3],'%b').tm_mon   
                    pub_month = "{:02d}".format(tmnth) 
                else:
                    pub_month = str(b["month"])
            if "day" in b.keys(): 
                pub_day = str(b["day"])
                
            if "title" in b:
                clean_title = b["title"].replace("{", "").replace("}","").replace("\\","").replace(" ","-")
            else:
                clean_title = "unknown-title"
                
            pub_date = pub_year+"-"+pub_month+"-"+pub_day
            
            #strip out {} as needed (some bibtex entries that maintain formatting)
            clean_title = b["title"].replace("{", "").replace("}","").replace("\\","").replace(" ","-")    

            url_slug = re.sub("\\[.*\\]|[^a-zA-Z0-9_-]", "", clean_title)
            url_slug = url_slug.replace("--","-")

            md_filename = (str(pub_date) + "-" + url_slug + ".md").replace("--","-")
            html_filename = (str(pub_date) + "-" + url_slug).replace("--","-")

            #Build Citation from text
            citation = ""

            #citation authors - todo - add highlighting for primary author?
            author = bibdata.entries[bib_id].persons["author"][0]
            citation = citation+" "+author.first_names[0]+" "+author.last_names[0]+", "

            #citation title
            citation = citation + "\"" + html_escape(b["title"].replace("{", "").replace("}","").replace("\\","")) + ".\""

            #add venue logic depending on citation type
            venue = publist[pubsource]["venue-pretext"]+b[publist[pubsource]["venuekey"]].replace("{", "").replace("}","").replace("\\","")

            citation = citation + " " + html_escape(venue)
            citation = citation + ", " + pub_year + "."

            
            ## YAML variables
            md = "---\ntitle: \""   + html_escape(b["title"].replace("{", "").replace("}","").replace("\\","")) + '"\n'
            
            md += """collection: """ +  publist[pubsource]["collection"]["name"] + '"\n'

            md += "category: manuscripts"

            md += """\npermalink: """ + publist[pubsource]["collection"]["permalink"]  + html_filename
            
            note = False
            if "note" in b.keys():
                if len(str(b["note"])) > 5:
                    md += "\nexcerpt: '" + html_escape(b["note"]) + "'"
                    note = True

            md += "\ndate: " + str(pub_date) 

            md += "\nvenue: '" + html_escape(venue) + "'"
            
            url = False
            if "url" in b.keys():
                if len(str(b["url"])) > 5:
                    md += "\npaperurl: '" + b["url"] + "'"
                    url = True

            md += "\ncitation: '" + html_escape(citation) + "'"

            md += "\n---"

            
            ## Markdown description for individual page
            if note:
                md += "\n" + html_escape(b["note"]) + "\n"

            if url:
                md += "\n[Access paper here](" + b["url"] + "){:target=\"_blank\"}\n" 
            else:
                md += "\nUse [Google Scholar](https://scholar.google.com/scholar?q="+html.escape(clean_title.replace("-","+"))+"){:target=\"_blank\"} for full citation"

            md_filename = os.path.basename(md_filename)

            with open("../_publications/" + md_filename, 'w', encoding="utf-8") as f:
                f.write(md)
            print(f'SUCESSFULLY PARSED {bib_id}: \"', b["title"][:60],"..."*(len(b['title'])>60),"\"")
        # field may not exist for a reference
        except KeyError as e:
            print(f'WARNING Missing Expected Field {e} from entry {bib_id}: \"', b["title"][:30],"..."*(len(b['title'])>30),"\"")
            continue


SUCESSFULLY PARSED ghosh2023abstract: " Abstract B171: BX-795 enhances the efficacy of crizotinib in ... "
SUCESSFULLY PARSED jensen2023abstract: " Abstract P3060: Modeling Immune Checkpoint Inhibitor Associa ... "
SUCESSFULLY PARSED abdulkareem2023abstract: " Abstract P6-10-19: Screening of GPCR-targeting drugs for rep ... "
SUCESSFULLY PARSED tahaney2022abstract: " Abstract GS1-09: Inhibition of GPX4 induces preferential dea ... "
SUCESSFULLY PARSED powell2020abstract: " Abstract B20: Unbiased high-throughput screenings to identif ... "
SUCESSFULLY PARSED tahaney2020abstract: " Abstract P3-03-08: Identification of drugs that induce the d ... "




SUCESSFULLY PARSED marreddy2024chemical: " Chemical genetic analysis of enoxolone inhibition of C. diff ... "
SUCESSFULLY PARSED liu2024cystic: " Cystic fibrosis cell models for high-throughput analysis and ... "
SUCESSFULLY PARSED ghosh2024enhancing: " Enhancing efficacy of the MEK inhibitor trametinib with pacl ... "
SUCESSFULLY PARSED joseph2024epha2: " EphA2 and HDAC-Targeted Combination Therapy in Endometrial C ... "
SUCESSFULLY PARSED rinkenbaugh2024pdx: " PDX models of TNBC established from pre-and post-therapy tum ... "
SUCESSFULLY PARSED johnson2024small: " Small Molecule Screening Identifies HSP90 as a Modifier of R ... "
SUCESSFULLY PARSED powell2024targeting: " Targeting neddylation and sumoylation in chemoresistant trip ... "
SUCESSFULLY PARSED ellis202335p: " 35P Enhancing efficacy of the MEK inhibitor trametinib in KR ... "
SUCESSFULLY PARSED junco2023novel: " A Novel Transgenic Mouse Model of Down Syndrome Acute Lympho ... "
SUCESSFULLY PARSED dasari2023combination: " C

In [118]:
from pybtex import errors
from pybtex.database.input import bibtex
import os
import re
from time import strptime
import html

errors.set_strict_mode(False)

# Function to clean and format the title
def clean_bibtex_title(title):
    return title.replace("{", "").replace("}", "").replace("\\", "").replace(" ", "-")

# Function to parse the publication date
def parse_publication_date(b):
    pub_year = b.get("year", "Unknown Year")
    pub_month = "01"
    pub_day = "01"

    # Attempt to parse month if available
    if "month" in b:
        month = b["month"].lower()
        try:
            if month.isdigit():
                pub_month = f"{int(month):02d}"
            else:
                pub_month = f"{strptime(month[:3], '%b').tm_mon:02d}"
        except ValueError:
            print(f"WARNING: Unable to parse month: {month}, setting to '01'")

    # Attempt to parse day if available
    if "day" in b:
        day = b["day"]
        if day.isdigit():
            pub_day = f"{int(day):02d}"

    return f"{pub_year}-{pub_month}-{pub_day}"

# Function to build citation from BibTeX entry
def build_citation(b, authors, venue, pub_year):
    citation = ", ".join([f"{author.first_names[0]} {author.last_names[0]}" for author in authors])
    title = b.get("title", "Unknown Title")
    citation += f'. "{html.escape(clean_bibtex_title(title))}."'
    citation += f' {html.escape(venue)}, {pub_year}.'
    return citation

# Main parsing loop
for pubsource in publist:
    parser = bibtex.Parser()
    try:
        bibdata = parser.parse_file(publist[pubsource].get("file", "unknown_file.bib"))
    except FileNotFoundError:
        print(f"ERROR: File not found: {publist[pubsource].get('file', 'unknown_file.bib')}")
        continue
    except Exception as e:
        print(f"ERROR: Failed to parse BibTeX file: {e}")
        continue

    # Loop through the individual references in a given BibTeX file
    for bib_id, entry in bibdata.entries.items():
        b = entry.fields
        authors = entry.persons.get("author", [])

        try:
            pub_date = parse_publication_date(b)
            clean_title = clean_bibtex_title(b.get("title", "Unknown Title"))

            url_slug = re.sub(r"\[.*\]|[^a-zA-Z0-9_-]", "", clean_title).replace("--", "-")
            md_filename = f"{pub_date}-{url_slug}.md".replace("--", "-")
            html_filename = f"{pub_date}-{url_slug}".replace("--", "-")

            # Build citation
            venue_key = publist[pubsource].get("venuekey", "journal")
            venue = publist[pubsource].get("venue-pretext", "") + b.get(venue_key, "Unknown Venue").replace("{", "").replace("}", "").replace("\\", "")
            citation = build_citation(b, authors, venue, b.get("year", "Unknown Year"))

            # YAML variables
            md = f"---\ntitle: \"{html.escape(clean_bibtex_title(b.get('title', 'Unknown Title')))}\"\n"
            md += f"collection: \"{publist[pubsource]['collection'].get('name', 'Unknown Collection')}\"\n"
            md += "category: manuscripts\n"
            md += f"permalink: {publist[pubsource]['collection'].get('permalink', '/unknown/')}{html_filename}\n"
            md += f"date: {pub_date}\n"
            md += f"venue: '{html.escape(venue)}'\n"
            md += f"citation: '{html.escape(citation)}'\n"

            # Add optional fields
            if "note" in b and len(b["note"]) > 5:
                md += f"excerpt: '{html.escape(b['note'])}'\n"

            if "url" in b and len(b["url"]) > 5:
                md += f"paperurl: '{b['url']}'\n"

            md += "---\n"

            # Markdown description for individual page
            if "note" in b and len(b["note"]) > 5:
                md += f"\n{html.escape(b['note'])}\n"

            if "url" in b and len(b["url"]) > 5:
                md += f"\n[Access paper here]({b['url']}){{:target=\"_blank\"}}\n"
            else:
                md += f"\nUse [Google Scholar](https://scholar.google.com/scholar?q={html.escape(clean_title.replace('-', '+'))}){{:target=\"_blank\"}} for full citation"

            # Write to file
            output_path = os.path.join("../_publications/", os.path.basename(md_filename))
            with open(output_path, 'w', encoding="utf-8") as f:
                f.write(md)

            print(f'SUCCESSFULLY PARSED {bib_id}: "', b.get("title", "Unknown Title")[:60], "..." * (len(b.get('title', '')) > 60), "")

        except KeyError as e:
            print(f'WARNING: Missing Expected Field {e} from entry {bib_id}: "', b.get("title", "Unknown Title")[:30], "..." * (len(b.get('title', '')) > 30), "")
        except Exception as e:
            print(f'ERROR: Failed to process entry {bib_id}: {e}')

SUCCESSFULLY PARSED ghosh2023abstract: " Abstract B171: BX-795 enhances the efficacy of crizotinib in ... 
ERROR: Failed to process entry jensen2023abstract: list index out of range
ERROR: Failed to process entry abdulkareem2023abstract: list index out of range
ERROR: Failed to process entry tahaney2022abstract: list index out of range
SUCCESSFULLY PARSED powell2020abstract: " Abstract B20: Unbiased high-throughput screenings to identif ... 
SUCCESSFULLY PARSED tahaney2020abstract: " Abstract P3-03-08: Identification of drugs that induce the d ... 
SUCCESSFULLY PARSED marreddy2024chemical: " Chemical genetic analysis of enoxolone inhibition of C. diff ... 
SUCCESSFULLY PARSED liu2024cystic: " Cystic fibrosis cell models for high-throughput analysis and ... 
SUCCESSFULLY PARSED ghosh2024enhancing: " Enhancing efficacy of the MEK inhibitor trametinib with pacl ... 
ERROR: Failed to process entry joseph2024epha2: list index out of range
ERROR: Failed to process entry tahaney2024inhibition



SUCCESSFULLY PARSED johnson2024small: " Small Molecule Screening Identifies HSP90 as a Modifier of R ... 
ERROR: Failed to process entry powell2024targeting: list index out of range
SUCCESSFULLY PARSED ellis202335p: " 35P Enhancing efficacy of the MEK inhibitor trametinib in KR ... 
ERROR: Failed to process entry junco2023novel: list index out of range
ERROR: Failed to process entry dasari2023combination: list index out of range
ERROR: Failed to process entry fan2023combining: list index out of range
SUCCESSFULLY PARSED shah2023concurrent: " Concurrent inactivation of PI3K and PLK1 is synergistic and  ... 
SUCCESSFULLY PARSED ghosh2023crizotinib: " Crizotinib enhances the efficacy of BX-795 identified by hig ... 
ERROR: Failed to process entry yu2023evolution: list index out of range
ERROR: Failed to process entry foster2023systematic: list index out of range
SUCCESSFULLY PARSED welte2023targeting: " Targeting FOXL2C402G vulnerabilities in adult type ovarian g ... 
SUCCESSFULLY PARSED 