# Publications Gathering Exploration
2023-11-10 ZD

Relevant JIRA ticket: [INS-786](https://tracker.nci.nih.gov/browse/INS-786)  

Exploratory notebook to investigate gathering Publications data for INS. Initial plan is to use the NIH RePORTER API to gather PMIDs associated with each project ID within projects.tsv. Those PMIDs will then be used as inputs into Entrez e-utility functions (accessed with BioPython) to gather additional Publication metdata. 

In [37]:
import pandas as pd
import numpy as np # for nan handling
import requests
from datetime import datetime
from time import sleep # for retrying API calls
from math import ceil # for pagination logging
from tqdm import tqdm # for progress bars

from Bio import Entrez
import json # for printing dicts during development


# Method to import from parent directory
import os # for accessing .env
import sys
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
sys.path.append(root_dir)

import config

### Load project.tsv to get list of project IDs

In [2]:
# # Define directory containing processed grant data
# processed_dir = '../'+config.PROCESSED_DIR

# # Define directory to store reports. Create if doesn't already exist
# reports_dir = '../'+config.REPORTS_DIR
# if not os.path.exists(reports_dir):
#     os.makedirs(reports_dir)

# Backup paths to allow for using same set of gathered data during notebook development
processed_dir = '../' + 'data/processed/2023-08-30/api-gathered-2023-11-01'

print(f"Project data pulled from: {processed_dir}")

Project data pulled from: ../data/processed/2023-08-30/api-gathered-2023-11-01


In [3]:
# Load project data
project_filename = os.path.join(processed_dir, 'project.tsv')
df_projects = pd.read_csv(project_filename, sep='\t')

In [4]:
df_projects

Unnamed: 0,project_id,queried_project_id,application_id,fiscal_year,project_title,abstract_text,keywords,principal_investigators,program_officers,award_amount,...,award_notice_date,project_start_date,project_end_date,opportunity_number,api_source_search,org_name,org_city,org_state,org_country,program.program_id
0,1R01CA239701-01A1,R01CA239701,9995648,2020,Admixture analysis of acute lymphoblastic leuk...,Abstract Children with substantial African anc...,Accounting;Acute Lymphocytic Leukemia;Admixtur...,"Michael E Scheurer, Logan G. Spector",Danielle L Daee,1898254,...,2020-07-23T12:07:00Z,2020-08-01T12:08:00Z,2023-07-31T12:07:00Z,PA-19-056,award_R01CA239701,UNIVERSITY OF MINNESOTA,MINNEAPOLIS,MN,UNITED STATES,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...
1,3R01CA239701-01A1S1,R01CA239701,10307680,2021,Admixture analysis of acute lymphoblastic leuk...,This attachment is not required for the divers...,Acute Lymphocytic Leukemia;Admixture;African A...,"Michael E Scheurer, Logan G. Spector",Danielle L Daee,11554,...,2021-09-15T12:09:00Z,2021-08-01T12:08:00Z,2022-07-31T12:07:00Z,PA-21-071,award_R01CA239701,UNIVERSITY OF MINNESOTA,MINNEAPOLIS,MN,UNITED STATES,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...
2,3R01CA239701-01A1S3,R01CA239701,10626271,2022,Admixture analysis of acute lymphoblastic leuk...,Abstract This application is being submitted i...,Acute Lymphocytic Leukemia;Administrative Supp...,"Joseph Lubega, Michael E Scheurer, Logan G. Sp...",Danielle L Daee,197648,...,2022-09-19T12:09:00Z,2020-08-01T12:08:00Z,2023-07-31T12:07:00Z,PA-20-272,award_R01CA239701,UNIVERSITY OF MINNESOTA,MINNEAPOLIS,MN,UNITED STATES,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...
3,4R01CA239701-02,R01CA239701,10902170,2023,Admixture analysis of acute lymphoblastic leuk...,Modified Project Summary/Abstract Section Abst...,Acute Lymphocytic Leukemia;Admixture;African;A...,"Michael E Scheurer, Logan G. Spector",Danielle L Daee,635018,...,2023-09-15T12:09:00Z,2020-08-01T12:08:00Z,2025-07-31T12:07:00Z,PA-19-056,award_R01CA239701,UNIVERSITY OF MINNESOTA,MINNEAPOLIS,MN,UNITED STATES,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...
4,1R21CA209848-01,R21CA209848,9185145,2016,Algorithms for Literature-Guided Multi-Platfor...,Project Abstract The development of accurate a...,Accounting;Address;Algorithms;Bayesian Modelin...,"Dongjun Chung, Linda E Kelemen",David J Miller,195098,...,2016-07-27T12:07:00Z,2016-08-01T12:08:00Z,2018-07-31T12:07:00Z,PAR-15-334,nofo_PAR-15-334,MEDICAL UNIVERSITY OF SOUTH CAROLINA,CHARLESTON,SC,UNITED STATES,ADVANCEDDEVELOPMENTOFINFORMATICSTECHNOLOGIESFO...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5994,5P50CA196530-04,P50CA196530,9529550,2018,Yale SPORE in Lung Cancer (YSILC): The Biology...,DESCRIPTION (provided by applicant): The Yale ...,American;Animal Disease Models;Animal Model;Be...,Roy S Herbst,Peter Ujhazy,2507598,...,2018-08-20T12:08:00Z,2015-08-26T12:08:00Z,2020-07-31T12:07:00Z,PAR-14-031,award_P50CA196530,YALE UNIVERSITY,NEW HAVEN,CT,UNITED STATES,YaleSPOREinLungCancerYSILCTheBiologyandPersona...
5995,5P50CA196530-05,P50CA196530,9767058,2019,Yale SPORE in Lung Cancer (YSILC): The Biology...,DESCRIPTION (provided by applicant): The Yale ...,American;Animal Disease Models;Animal Model;Be...,Roy S Herbst,Peter Ujhazy,2134771,...,2019-08-21T12:08:00Z,2015-08-26T12:08:00Z,2020-07-31T12:07:00Z,PAR-14-031,award_P50CA196530,YALE UNIVERSITY,NEW HAVEN,CT,UNITED STATES,YaleSPOREinLungCancerYSILCTheBiologyandPersona...
5996,5P50CA196530-07,P50CA196530,10203850,2021,Yale SPORE in Lung Cancer (YSILC): The Biology...,YALE SPORE IN LUNG CANCER (YSILC) OVERALL SUMM...,Address;Adoption;Animal Disease Models;Area;Aw...,Roy S Herbst,Peter Ujhazy,2003668,...,2021-09-01T12:09:00Z,2015-08-26T12:08:00Z,2025-07-31T12:07:00Z,PAR-18-313,award_P50CA196530,YALE UNIVERSITY,NEW HAVEN,CT,UNITED STATES,YaleSPOREinLungCancerYSILCTheBiologyandPersona...
5997,5P50CA196530-08,P50CA196530,10479786,2022,Yale SPORE in Lung Cancer (YSILC): The Biology...,YALE SPORE IN LUNG CANCER (YSILC) OVERALL SUMM...,Address;Adoption;Animal Disease Models;Area;Aw...,"Roy S Herbst, Katerina Abigail Politi",Peter Ujhazy,2046999,...,2022-08-23T12:08:00Z,2015-08-26T12:08:00Z,2025-07-31T12:07:00Z,PAR-18-313,award_P50CA196530,YALE UNIVERSITY,NEW HAVEN,CT,UNITED STATES,YaleSPOREinLungCancerYSILCTheBiologyandPersona...


In [6]:
# Show first row of df_projects
df_projects.loc[0]

project_id                                                 1R01CA239701-01A1
queried_project_id                                               R01CA239701
application_id                                                       9995648
fiscal_year                                                             2020
project_title              Admixture analysis of acute lymphoblastic leuk...
abstract_text              Abstract Children with substantial African anc...
keywords                   Accounting;Acute Lymphocytic Leukemia;Admixtur...
principal_investigators                 Michael E Scheurer, Logan G. Spector
program_officers                                             Danielle L Daee
award_amount                                                         1898254
nci_funded_amount                                                    1898254
award_notice_date                                       2020-07-23T12:07:00Z
project_start_date                                      2020-08-01T12:08:00Z

### Use the NIH RePORTER to get PMIDs associated with a project
Reuse code from `modules/nih_reporter_api.py` where appropriate

In [7]:
# Set some constants and search parameters

# Sample project for testing. Should have 7 pubs
queried_project_id = "R01CA263500" 

base_url = "https://api.reporter.nih.gov/v2/publications/search"
LIMIT = 500
MAX_ATTEMPTS = 5
RETRY_TIME = 2
offset = 0

params = {
    "criteria": {
        "core_project_nums": [queried_project_id]
    },
    "offset": offset,
    "limit": LIMIT,
    "sort_field":"core_project_nums",
    "sort_order":"desc"
 }

In [8]:
# Empty list to collect json responses
pmid_data = []

# Define response details
response = requests.post(base_url, 
                        json=params, 
                        headers={
                            "accept": "application/json", 
                            "Content-Type": "application/json"})

# If response is good, get results
if response.status_code == 200:
    pmids = response.json()

    # Add grants to running list
    pmid_data.extend(pmids['results'])

else: print("error")

df_pmids = pd.DataFrame(pmid_data)

In [9]:
# Check raw response
pmid_data

[{'coreproject': 'R01CA263500', 'pmid': 35130560, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36917953, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37138086, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36288726, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37059069, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37024595, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36734849, 'applid': 10679077}]

In [10]:
# Check as dataframe
df_pmids

Unnamed: 0,coreproject,pmid,applid
0,R01CA263500,35130560,10679077
1,R01CA263500,36917953,10679077
2,R01CA263500,37138086,10679077
3,R01CA263500,36288726,10679077
4,R01CA263500,37059069,10679077
5,R01CA263500,37024595,10679077
6,R01CA263500,36734849,10679077


Structure of the API response includes both `coreproject` and `pmid`, which means that it will be straightforward to make a main PMID dataframe. All of the project IDs can be fed into the API and then the results can be added to a running dataframe. The `applid` is not relevant for us, so we can drop it later.

`get_nih_reporter_grants` from `nih_reporter_api.py` has functionatliy to handle response pagination (when there are more than 500 results) and response errors. Copy that into the pmid query.  
There's an argument to be made for just adapting the `get_nih_reporter_grants` function rather than making a new one specific for publications, but I'll keep it separate for simplicity. 

In [11]:
def get_pmids_from_nih_reporter_api(project_id,
                                    print_meta=False):
    """Get PMIDs associated with a single provided Project ID. 
    
    :param project_id: String Core Project ID (e.g. 'R01CA263500')
    :param print_meta: boolean indicator. If True, print API gathering 
                        process results to console.
    """

    base_url = "https://api.reporter.nih.gov/v2/publications/search"
    pmid_data = []

    # Set default values for params not likley to change
    LIMIT = 500
    MAX_ATTEMPTS = 5
    RETRY_TIME = 2

    # Set starting value for counters
    offset = 0
    page = 0
    attempts = 0

    # RePORTER API sets a max limit of 500 records per call.
    # Keep looping each call in "pages" until the number of records 
    # gathered reaches the total number of records available. 

    # Set a cap on the number of attempts at a failed call before 
    # moving on to the next award.
    while attempts < MAX_ATTEMPTS:
        # Set parameters for API call
        params = {
            "criteria": {
                "core_project_nums": [project_id]
            },
            "offset": offset,
            "limit": LIMIT,
            "sort_field":"core_project_nums",
            "sort_order":"desc"
        }

        try: 
            # Define response details
            response = requests.post(base_url, 
                                    json=params, 
                                    headers={
                                        "accept": "application/json", 
                                        "Content-Type": "application/json"})

            # If response is good, get results
            if response.status_code == 200:
                pmids = response.json()

                # Add grants to running list
                pmid_data.extend(pmids['results'])

                # Increase offset by limit to get next "page"
                total_records = pmids['meta']['total']
                offset = offset + LIMIT
                page = page + 1

                # Print paginated partial optional metadata
                # Consider replacing this with proper logging
                if print_meta == True:
                    total_pages = max(ceil(total_records/LIMIT),1)
                    print(f"{project_id}: "
                          f"({page}/{total_pages}): {pmids['meta']}")

                # Stop looping if offset has reached total record count
                if offset >= total_records:
                    break

            # Handle 500 errors by retrying after 2 second delay
            elif response.status_code == 500:
                attempts = attempts + 1
                print(f"Received a 500 error for "
                        f"{project_id}'. "
                        f"Retrying after {RETRY_TIME} seconds. "
                        f"Attempt {attempts}/{MAX_ATTEMPTS}")
                sleep(RETRY_TIME)
            else:
                print(f"Error occurred while fetching grants for "
                        f"{project_id}': "
                        f"{response.status_code}")
                break

        except requests.exceptions.RequestException as e:
            print(f"An error occurred while making the API call for "
                    f"{project_id}': {e}")
            break

    return pmid_data

In [12]:
# Define some test projects
test_project_id = "R01CA263500" # 7 publication results expected
test_project_id_list = ["R01CA263500", "R01CA222518", "U01CA209936"] # [7,6,38] publication results expected

In [13]:
# Get PMIDs for one test project
get_pmids_from_nih_reporter_api(test_project_id, print_meta=True)

R01CA263500: (1/1): {'search_id': None, 'total': 7, 'offset': 0, 'limit': 500, 'sort_field': 'core_project_nums', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {}}


[{'coreproject': 'R01CA263500', 'pmid': 36917953, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37138086, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 35130560, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36288726, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37059069, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36734849, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37024595, 'applid': 10679077}]

In [16]:
# GET PMIDs for a list of test projects

# Create empty df
all_pmids = []

# Iterate through projects and add to running list of results
for id in tqdm.tqdm(test_project_id_list):
    results = get_pmids_from_nih_reporter_api(id, print_meta=False)
    all_pmids.extend(results)

# Reformat list of results as dataframe
df_pmids = pd.DataFrame(all_pmids)

total_pubs = len(df_pmids)
unique_pmids = df_pmids['pmid'].nunique()
unique_pubs = df_pmids['coreproject'].nunique()

print(f"---\nComplete! PMIDs successfully gathered. \n"
      f"{total_pubs} total Publication results. \n"
      f"{unique_pmids} unique PMIDs found across \n"
      f"{unique_pubs} unique Project IDs.")


100%|██████████| 3/3 [00:01<00:00,  1.80it/s]

---
Complete! PMIDs successfully gathered. 
51 total Publication results. 
51 unique PMIDs found across 
3 unique Project IDs.





In [17]:
# View full dataset
df_pmids

Unnamed: 0,coreproject,pmid,applid
0,R01CA263500,36917953,10679077
1,R01CA263500,37138086,10679077
2,R01CA263500,35130560,10679077
3,R01CA263500,36288726,10679077
4,R01CA263500,37059069,10679077
5,R01CA263500,36734849,10679077
6,R01CA263500,37024595,10679077
7,R01CA222518,33186151,10845929
8,R01CA222518,33777659,10845929
9,R01CA222518,33778776,10845929


### Test PMID gathering with real project data

In [18]:
# Get all unique project IDs from projects.tsv (loaded earlier in notebook)
project_id_list = df_projects['queried_project_id'].unique().tolist()

print(f"{len(project_id_list)} total unique project IDs.")

2520 total unique project IDs.


In [19]:
# Create empty df
all_pmids = []

# Iterate through projects and add to running list of results
for id in tqdm.tqdm(project_id_list):
    results = get_pmids_from_nih_reporter_api(id, print_meta=False)
    all_pmids.extend(results)

# Reformat list of results as dataframe
df_pmids = pd.DataFrame(all_pmids)

total_pubs = len(df_pmids)
unique_pmids = df_pmids['pmid'].nunique()
unique_pubs = df_pmids['coreproject'].nunique()

# Print results
print(f"---\nComplete! PMIDs successfully gathered. \n"
      f"{total_pubs} total Publication results. \n"
      f"{unique_pmids} unique PMIDs found across \n"
      f"{unique_pubs} unique Project IDs.")


 72%|███████▏  | 1804/2520 [19:01<55:28,  4.65s/it]

Received a 500 error for P30CA008748'. Retrying after 2 seconds. Attempt 1/5
Received a 500 error for P30CA008748'. Retrying after 2 seconds. Attempt 2/5
Received a 500 error for P30CA008748'. Retrying after 2 seconds. Attempt 3/5
Received a 500 error for P30CA008748'. Retrying after 2 seconds. Attempt 4/5
Received a 500 error for P30CA008748'. Retrying after 2 seconds. Attempt 5/5


 72%|███████▏  | 1824/2520 [20:38<36:42,  3.17s/it]  

Received a 500 error for P30CA016672'. Retrying after 2 seconds. Attempt 1/5
Received a 500 error for P30CA016672'. Retrying after 2 seconds. Attempt 2/5
Received a 500 error for P30CA016672'. Retrying after 2 seconds. Attempt 3/5
Received a 500 error for P30CA016672'. Retrying after 2 seconds. Attempt 4/5
Received a 500 error for P30CA016672'. Retrying after 2 seconds. Attempt 5/5


100%|██████████| 2520/2520 [29:07<00:00,  1.44it/s]  

---
Complete! PMIDs successfully gathered. 
175527 total Publication results. 
144658 unique PMIDs found across 
1771 unique Project IDs.





In [82]:
# Save temporary results to csv for development
pmid_filename = 'gathered_pmids_20231110.csv'
df_pmids.to_csv(pmid_filename, index=False)
print(f"Interrim PMID data saved to {pmid_filename}.")

Interrim PMID data saved to gathered_pmids_20231110.csv.


In [5]:
# # Checkpoint loading instead of regathering data during development
# pmid_filename = 'gathered_pmids_20231110.csv'
# df_pmids = pd.read_csv(pmid_filename)

### Check gathered PMIDs

2 Projects encountered 500 errors when the offset passed 9500. NIH RePORTER API must enforce an offset cap of 10,000.
- `P30CA008748` is the `SLOAN-KETTERING INST CAN RESEARCH` (Cancer Center support grant)
    - NIH RePORTER has records back to FY 1985 
    - API shows 18,812 associated PMIDs, but we can only gather 10,000
- `P30CA016672` is the `UNIVERSITY OF TX MD ANDERSON CAN CTR` (Cancer Center support grant)
    - NIH RePORTER has records back to FY 1985
    - API shows 24,774 associated PMIDs, but we can only gather 10,000


In [6]:
# Get new df of each core project and associated pmid count
df_pub_count = (df_pmids.groupby('coreproject').size().reset_index()
                            .rename(columns={0:'publications'})
                            .sort_values(by='publications', ascending=False))
df_pub_count

Unnamed: 0,coreproject,publications
78,P30CA008748,10000
98,P30CA016672,10000
99,P30CA021765,8052
92,P30CA016058,4363
77,P30CA006973,3814
...,...,...
1416,U01CA243007,1
629,R01CA258681,1
630,R01CA258682,1
811,R01CA262788,1


In [7]:
def create_summary_table(df, core_project_col, publications_col, bin_size):
    """Build a histogram table showing bins of publications per projects.

    :param pd.DataFrame df: Dataframe of publication counts grouped by project
    :param str core_project_col: column header for core project values
    :param str publications_col: column header for publication counts
    :param int bin_size: size of range for grouping publications
    :return pd.DataFrame: DataFrame with grouped publication counts and the 
                number of projects associated with that number of publications
    """

    # Create bins
    bins = range(1, 10001, bin_size)
    
    # Define labels for the bins
    labels = [f"{start}-{start+bin_size-1}" for start in bins[:-1]]
    
    # Add a new column to the dataframe with the bin labels
    df['publication_range'] = pd.cut(df[publications_col], bins=bins, labels=labels, include_lowest=True)
    
    # Group by publication range and count the number of core projects in each range
    summary_table = df.groupby('publication_range')[core_project_col].count().reset_index()
    
    # Rename the columns
    summary_table.columns = ['publication_range', 'core_project_count']
    
    return summary_table

In [8]:
create_summary_table(df_pub_count, 'coreproject', 'publications', 100).head(30)

  summary_table = df.groupby('publication_range')[core_project_col].count().reset_index()


Unnamed: 0,publication_range,core_project_count
0,1-100,1620
1,101-200,49
2,201-300,14
3,301-400,4
4,401-500,12
5,501-600,6
6,601-700,5
7,701-800,3
8,801-900,5
9,901-1000,4


In [9]:
create_summary_table(df_pub_count, 'coreproject', 'publications', 500)

  summary_table = df.groupby('publication_range')[core_project_col].count().reset_index()


Unnamed: 0,publication_range,core_project_count
0,1-500,1699
1,501-1000,23
2,1001-1500,16
3,1501-2000,9
4,2001-2500,12
5,2501-3000,3
6,3001-3500,3
7,3501-4000,2
8,4001-4500,1
9,4501-5000,0


## Use PubMed API to get publication details

#### Load Checkpoint data

In [3]:
# Checkpoint loading instead of regathering data during development
pmid_filename = 'gathered_pmids_20231110.csv'
df_pmids = pd.read_csv(pmid_filename)

In [4]:
df_pmids

Unnamed: 0,coreproject,pmid,applid
0,R01CA239701,36127808,10902170
1,R21CA209848,29074302,9321971
2,R21CA209848,31387361,9321971
3,R21CA209848,29027980,9321971
4,R21CA209848,29309429,9321971
...,...,...,...
175522,P50CA196530,35471840,10690040
175523,P50CA196530,35793873,10690040
175524,P50CA196530,36509758,10690040
175525,P50CA196530,36775354,10690040


### Use Entrez package from BioPython to get Publication data for each PMID

Biopython is a package that simplifies access to bioinformatics tools. One of the tools accessible is the `Bio.Entrez` package, which gives access to the Entrez e-Utilities.  
These can act as an API to access PubMed information. 

In [6]:
# OUTDATED VERSION OF THIS FUNCTION
def get_publication_info_from_pmid(pmid):
    """
    Get Entrez PubMed publication information for a given PMID.

    :param pmid: PubMed ID (str)
    :return: Dictionary containing publication information
    """
    # Get user email from hidden local env file. Use default if not defined
    Entrez.email = os.environ.get('NCBI_EMAIL', 'your-email@example.com')

    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, retmode="xml")
        records = Entrez.read(handle)
        handle.close()

        # Extract relevant information from the XML record
        record = records['PubmedArticle'][0]['MedlineCitation']['Article']
        publication_info = {
            'publication_id': pmid,
            'title': record['ArticleTitle'],
            'authors': ', '.join(author.get('LastName', '') for author in record.get('AuthorList', [])),
            'publication_date': record['Journal']['JournalIssue']['PubDate'].get('Year', ''),
            'citation_count': record.get('CitedIn', ''),
            'doi': record.get('ELocationID', ''),
            # Add more fields as needed
        }

        return publication_info

    except Exception as e:
        print(f"Error fetching information for PMID {pmid}: {e}")
        return None

In [7]:
# Try with test PMID
test_pmid = '36127808'
get_publication_info_from_pmid(test_pmid)

{'publication_id': '36127808',
 'title': 'Genetic ancestry, differential gene expression, and survival in pediatric B-cell acute lymphoblastic leukemia.',
 'authors': 'Barragan, Mills, Raduski, Marcotte, Grinde, Spector, Williams',
 'publication_date': '2023',
 'citation_count': '',
 'doi': [StringElement('10.1002/cam4.5266', attributes={'EIdType': 'doi', 'ValidYN': 'Y'})]}

Good first start. Need to fix the following:  
- Author names should show first and middle initials
- Fix citation count blank. Should be 1-3 for this pmid
- Fix doi format

In [5]:
def get_full_publication_record(pmid):
    """
    Get the full record for a given PMID without subselection.

    :param pmid: PubMed ID (str)
    :return: Full record (dictionary)
    """
    # Get user email from hidden local env file. Use default if not defined
    Entrez.email = os.environ.get('NCBI_EMAIL', 'your-email@example.com')
    Entrez.api_key = os.environ.get('NCBI_API_KEY', '')

    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, retmode="xml")
        records = Entrez.read(handle)
        handle.close()

        # Return the full record
        return records

    except Exception as e:
        print(f"Error fetching information for PMID {pmid}: {e}")
        return None

In [9]:
# Get the full record as text to browse fields and values
pmid_record = get_full_publication_record(test_pmid)
print(json.dumps(pmid_record, indent=2))

{
  "PubmedBookArticle": [],
  "PubmedArticle": [
    {
      "MedlineCitation": {
        "KeywordList": [
          [
            "acute lymphoblastic leukemia",
            "genetic ancestry",
            "survival disparities"
          ]
        ],
        "CitationSubset": [
          "IM"
        ],
        "GeneralNote": [],
        "SpaceFlightMission": [],
        "OtherAbstract": [],
        "OtherID": [],
        "PMID": "36127808",
        "DateCompleted": {
          "Year": "2023",
          "Month": "03",
          "Day": "02"
        },
        "DateRevised": {
          "Year": "2023",
          "Month": "05",
          "Day": "24"
        },
        "Article": {
          "Language": [
            "eng"
          ],
          "ELocationID": [
            "10.1002/cam4.5266"
          ],
          "ArticleDate": [
            {
              "Year": "2022",
              "Month": "09",
              "Day": "20"
            }
          ],
          "Journal": {
       

#### Helper function to reformat values returned from API call

In [6]:
def format_authors(author_list):
    """
    Format author names as 'FirstName LastName'.

    :param author_list: List of authors
    :return: Formatted author names
    """
    formatted_authors = []
    for author in author_list:
        last_name = author.get('LastName', '')
        first_name = author.get('ForeName', '')
        formatted_author = f"{first_name} {last_name}".strip()
        formatted_authors.append(formatted_author)

    return ', '.join(formatted_authors)

In [11]:
# Pull unformatted authors list
record = pmid_record['PubmedArticle'][0]['MedlineCitation']['Article']
authors = record.get('AuthorList', [])

authors

ListElement([DictElement({'Identifier': [StringElement('0000-0001-7317-0412', attributes={'Source': 'ORCID'})], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Department of Mathematics, Statistics, and Computer Science, Macalester College, St. Paul, Minnesota, USA.'}, {'Identifier': [], 'Affiliation': 'Division of Epidemiology and Clinical Research, Department of Pediatrics, University of Minnesota, Minneapolis, Minnesota, USA.'}], 'LastName': 'Barragan', 'ForeName': 'Freddy A', 'Initials': 'FA'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [StringElement('0000-0002-8914-2587', attributes={'Source': 'ORCID'})], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Division of Epidemiology and Clinical Research, Department of Pediatrics, University of Minnesota, Minneapolis, Minnesota, USA.'}], 'LastName': 'Mills', 'ForeName': 'Lauren J', 'Initials': 'LJ'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [StringElement('0000-0002-7069-6934', attributes={'S

In [12]:
# Test formatted authors list
formatted_authors = format_authors(authors)
formatted_authors

'Freddy A Barragan, Lauren J Mills, Andrew R Raduski, Erin L Marcotte, Kelsey E Grinde, Logan G Spector, Lindsay A Williams'

In [13]:
# doi needs to be converted from list and then to string
print(record.get('ELocationID', ''))

# ELocationID is a list. Try choosing the first entry
print(str(record.get('ELocationID', '')[0]))

[StringElement('10.1002/cam4.5266', attributes={'EIdType': 'doi', 'ValidYN': 'Y'})]
10.1002/cam4.5266


In [14]:
# OUTDATED VERSION OF THIS FUNCTION
def get_publication_info_from_pmid(pmid):
    """
    Get publication information for a given PMID.

    :param pmid: PubMed ID (str)
    :return: Dictionary containing publication information
    """
    # Get user email from hidden local env file. Use default if not defined
    Entrez.email = os.environ.get('NCBI_EMAIL', 'your-email@example.com')

    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, retmode="xml")
        records = Entrez.read(handle)
        handle.close()

        # Access article details from full returned record
        record = records['PubmedArticle'][0]['MedlineCitation']['Article']

        # Isolate and format author names
        authors = record.get('AuthorList', [])
        formatted_authors = format_authors(authors)

        publication_info = {
            'publication_id': pmid,
            'title': record['ArticleTitle'],
            'authors': formatted_authors,
            'publication_date': record['Journal']['JournalIssue']['PubDate'].get('Year', ''),
            'doi': str(record.get('ELocationID', '')[0]),
        }

        return publication_info

    except Exception as e:
        print(f"Error fetching information for PMID {pmid}: {e}")
        return None

In [15]:
get_publication_info_from_pmid(test_pmid)

{'publication_id': '36127808',
 'title': 'Genetic ancestry, differential gene expression, and survival in pediatric B-cell acute lymphoblastic leukemia.',
 'authors': 'Freddy A Barragan, Lauren J Mills, Andrew R Raduski, Erin L Marcotte, Kelsey E Grinde, Logan G Spector, Lindsay A Williams',
 'publication_date': '2023',
 'doi': '10.1002/cam4.5266'}

Data formatting looks good on the test PMID. Try again with a larger sample to see if values vary.

In [16]:
# List of arbitrary PMIDs to test
test_pmid_list = [
    '36127808',
    '29074302',
    '31387361',
    '29027980',
    '29309429',
]

In [17]:
for pub in test_pmid_list:
    display(get_publication_info_from_pmid(pub))

{'publication_id': '36127808',
 'title': 'Genetic ancestry, differential gene expression, and survival in pediatric B-cell acute lymphoblastic leukemia.',
 'authors': 'Freddy A Barragan, Lauren J Mills, Andrew R Raduski, Erin L Marcotte, Kelsey E Grinde, Logan G Spector, Lindsay A Williams',
 'publication_date': '2023',
 'doi': '10.1002/cam4.5266'}

{'publication_id': '29074302',
 'title': 'Endogenous antibody responses to mucin 1 in a large multiethnic cohort of patients with breast cancer and healthy controls: Role of immunoglobulin and Fcγ receptor genes.',
 'authors': 'Janardan P Pandey, Aryan M Namboodiri, Bethany Wolf, Motoki Iwasaki, Yoshio Kasuga, Gerson S Hamada, Shoichiro Tsugane',
 'publication_date': '2018',
 'doi': '10.1016/j.imbio.2017.10.028'}

{'publication_id': '31387361',
 'title': 'Defects in the Exocyst-Cilia Machinery Cause Bicuspid Aortic Valve Disease and Aortic Stenosis.',
 'authors': 'Diana Fulmer, Katelynn Toomer, Lilong Guo, Kelsey Moore, Janiece Glover, Reece Moore, Rebecca Stairley, Glenn Lobo, Xiaofeng Zuo, Yujing Dang, Yanhui Su, Ben Fogelgren, Patrick Gerard, Dongjun Chung, Mahyar Heydarpour, Rupak Mukherjee, Simon C Body, Russell A Norris, Joshua H Lipschutz',
 'publication_date': '2019',
 'doi': '10.1161/CIRCULATIONAHA.119.038376'}

{'publication_id': '29027980',
 'title': 'The Plasticizer Bisphenol A Perturbs the Hepatic Epigenome: A Systems Level Analysis of the miRNome.',
 'authors': 'Ludivine Renaud, Willian A da Silveira, E Starr Hazard, Jonathan Simpson, Silvia Falcinelli, Dongjun Chung, Oliana Carnevali, Gary Hardiman',
 'publication_date': '2017',
 'doi': '269'}

{'publication_id': '29309429',
 'title': 'ShinyGPA: An interactive visualization toolkit for investigating pleiotropic architecture using GWAS datasets.',
 'authors': 'Emma Kortemeier, Paula S Ramos, Kelly J Hunt, Hang J Kim, Gary Hardiman, Dongjun Chung',
 'publication_date': '2018',
 'doi': 'e0190949'}

DOIs gathered vary in format. Checking the full PubMed record shows that a a few DOIs are usually listed in the ELocationID field, but the parsing is inconsistent. An ArticleIdList field also exists and it seems to list DOIs, but also with inconsistent ordering and parsing.  

For example, in PMID `29027980`:
```
...
  "ELocationID": [
    "269",
    "10.3390/genes8100269"
  ],
...
  "ArticleIdList": [
    "29027980",
    "PMC5664119",
    "10.3390/genes8100269",
    "genes8100269"
  ],
```  

While in PMID `29309429` we can see:
```
...
  "ELocationID": [
    "e0190949",
    "10.1371/journal.pone.0190949"
  ],
...
  "ArticleIdList": [
    "29309429",
    "PMC5757942",
    "10.1371/journal.pone.0190949",
    "PONE-D-17-37743"
  ],
```  

In [18]:
test = get_full_publication_record('29027980')
print(json.dumps(test, indent=2))

{
  "PubmedBookArticle": [],
  "PubmedArticle": [
    {
      "MedlineCitation": {
        "KeywordList": [
          [
            "bioinformatics",
            "bisphenol A",
            "epigenome",
            "microRNAs",
            "toxicology",
            "zebrafish"
          ]
        ],
        "CitationSubset": [],
        "GeneralNote": [],
        "SpaceFlightMission": [],
        "OtherAbstract": [],
        "OtherID": [],
        "PMID": "29027980",
        "DateRevised": {
          "Year": "2019",
          "Month": "11",
          "Day": "20"
        },
        "Article": {
          "Language": [
            "eng"
          ],
          "ELocationID": [
            "269",
            "10.3390/genes8100269"
          ],
          "ArticleDate": [
            {
              "Year": "2017",
              "Month": "10",
              "Day": "13"
            }
          ],
          "Journal": {
            "ISSN": "2073-4425",
            "JournalIssue": {
           

In [19]:
test = get_full_publication_record('29309429')
print(json.dumps(test, indent=2))

{
  "PubmedBookArticle": [],
  "PubmedArticle": [
    {
      "MedlineCitation": {
        "KeywordList": [],
        "CitationSubset": [
          "IM"
        ],
        "GeneralNote": [],
        "SpaceFlightMission": [],
        "OtherAbstract": [],
        "OtherID": [],
        "PMID": "29309429",
        "DateCompleted": {
          "Year": "2018",
          "Month": "03",
          "Day": "09"
        },
        "DateRevised": {
          "Year": "2023",
          "Month": "11",
          "Day": "12"
        },
        "Article": {
          "Language": [
            "eng"
          ],
          "ELocationID": [
            "e0190949",
            "10.1371/journal.pone.0190949"
          ],
          "ArticleDate": [
            {
              "Year": "2018",
              "Month": "01",
              "Day": "08"
            }
          ],
          "Journal": {
            "ISSN": "1932-6203",
            "JournalIssue": {
              "Volume": "13",
              "Issue": 

We can tackle doi and citation_count later if valuable.  
Publication date is tricky. Current INS uses "publication_date", but PubMed generally uses `PubDate` field, which can vary in format. [NCBI Guidance](https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.PubDate_R). `PubDate` will always have a Year, but may have months and/or days in either int or string values. We'll change `publication_date` to `publication_year` for simplicity and for consistency with NIH RePORTER publication display. 

First, let's move forward with a shell of the publication workflow.

In [7]:
def get_publication_info_from_pmid(pmid):
    """
    Get publication information for a given PMID.

    :param pmid: PubMed ID (str)
    :return: Dictionary containing publication information
    """
    # Get user email from hidden local env file. Use default if not defined
    Entrez.email = os.environ.get('NCBI_EMAIL', 'your-email@example.com')
    Entrez.api_key = os.environ.get('NCBI_API_KEY', '')
    # Reduce delay between failed attempts from default 15
    Entrez.sleep_between_tries = 3

    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, retmode="xml")
        records = Entrez.read(handle)
        handle.close()

        # Access article details from full returned record
        record = records['PubmedArticle'][0]['MedlineCitation']['Article']

        # Isolate and format author names
        authors = record.get('AuthorList', [])
        formatted_authors = format_authors(authors)

        publication_info = {
            'publication_id': pmid,
            'title': record['ArticleTitle'],
            'authors': formatted_authors,
            'publication_year': record['Journal']['JournalIssue']['PubDate'].get('Year', ''),
        }

        return publication_info

    except Exception as e:
        print(f"Error fetching information for PMID {pmid}: {e}")
        return None

In [7]:
# OUTDATED VERSION OF THIS FUNCTION
def get_publication_info_for_df(df_pmid):
    """
    Get publication information for each PMID in the input DataFrame.

    :param df_pmid: DataFrame containing 'coreproject', 'pmid', and 'applid' columns.
    :return: DataFrame containing 'coreproject', 'pmid', 'applid', 'title', 'authors', 'publication_date'.
    """
    df_pmid_info = pd.DataFrame()

    for index, row in df_pmid.iterrows():
        pmid = str(row['pmid'])
        publication_info = get_publication_info_from_pmid(pmid)

        if publication_info:
            # Combine the information with the original DataFrame
            df_pmid_info = pd.concat([df_pmid_info, pd.DataFrame([{
                'coreproject': row['coreproject'],
                'pmid': pmid,
                'applid': row['applid'],
                'title': publication_info['title'],
                'authors': publication_info['authors'],
                'publication_year': publication_info['publication_year']
            }])])

    return df_pmid_info

In [8]:
test_df_pmids = df_pmids.head(10)
test_df_pmids

Unnamed: 0,coreproject,pmid,applid
0,R01CA239701,36127808,10902170
1,R21CA209848,29074302,9321971
2,R21CA209848,31387361,9321971
3,R21CA209848,29027980,9321971
4,R21CA209848,29309429,9321971
5,R21CA209848,29432514,9321971
6,R21CA209848,30326636,9321971
7,R21CA209848,30755818,9321971
8,R21CA209848,32566544,9321971
9,R21CA209848,32753773,9321971


In [9]:
test_pmid_info = get_publication_info_for_df(test_df_pmids)
test_pmid_info

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
0,R01CA239701,36127808,10902170,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023
0,R21CA209848,29074302,9321971,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018
0,R21CA209848,31387361,9321971,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019
0,R21CA209848,29027980,9321971,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017
0,R21CA209848,29309429,9321971,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018
0,R21CA209848,29432514,9321971,Improving SNP prioritization and pleiotropic a...,"Hang J Kim, Zhenning Yu, Andrew Lawson, Hongyu...",2018
0,R21CA209848,30326636,9321971,An Analytic Approach Using Candidate Gene Sele...,"Bethany J Wolf, Paula S Ramos, J Madison Hyer,...",2018
0,R21CA209848,30755818,9321971,Synergistic effects of SHP2 and PI3K pathway i...,"Bowen Sun, Nathaniel R Jensen, Dongjun Chung, ...",2019
0,R21CA209848,32566544,9321971,Ranking subjects based on paired compositional...,"Jin Hyun Nam, Aastha Khatiwada, Lois J Matthew...",2020
0,R21CA209848,32753773,9321971,hubViz: A Novel Tool for Hub-centric Visualiza...,"Jin Hyun Nam, Jonghyun Yun, Ick Hoon Jin, Dong...",2020


In [10]:
# Get first 10 pmids to use as test cases
test_pmid_list_2 = df_pmids.head(10)['pmid']
test_pmid_list_2

0    36127808
1    29074302
2    31387361
3    29027980
4    29309429
5    29432514
6    30326636
7    30755818
8    32566544
9    32753773
Name: pmid, dtype: int64

In [11]:
# Compare results of "_from_pmid" function to those of "_for_df" function
for pmid in test_pmid_list_2:
    display(get_publication_info_from_pmid(pmid))

{'publication_id': 36127808,
 'title': 'Genetic ancestry, differential gene expression, and survival in pediatric B-cell acute lymphoblastic leukemia.',
 'authors': 'Freddy A Barragan, Lauren J Mills, Andrew R Raduski, Erin L Marcotte, Kelsey E Grinde, Logan G Spector, Lindsay A Williams',
 'publication_year': '2023'}

{'publication_id': 29074302,
 'title': 'Endogenous antibody responses to mucin 1 in a large multiethnic cohort of patients with breast cancer and healthy controls: Role of immunoglobulin and Fcγ receptor genes.',
 'authors': 'Janardan P Pandey, Aryan M Namboodiri, Bethany Wolf, Motoki Iwasaki, Yoshio Kasuga, Gerson S Hamada, Shoichiro Tsugane',
 'publication_year': '2018'}

{'publication_id': 31387361,
 'title': 'Defects in the Exocyst-Cilia Machinery Cause Bicuspid Aortic Valve Disease and Aortic Stenosis.',
 'authors': 'Diana Fulmer, Katelynn Toomer, Lilong Guo, Kelsey Moore, Janiece Glover, Reece Moore, Rebecca Stairley, Glenn Lobo, Xiaofeng Zuo, Yujing Dang, Yanhui Su, Ben Fogelgren, Patrick Gerard, Dongjun Chung, Mahyar Heydarpour, Rupak Mukherjee, Simon C Body, Russell A Norris, Joshua H Lipschutz',
 'publication_year': '2019'}

{'publication_id': 29027980,
 'title': 'The Plasticizer Bisphenol A Perturbs the Hepatic Epigenome: A Systems Level Analysis of the miRNome.',
 'authors': 'Ludivine Renaud, Willian A da Silveira, E Starr Hazard, Jonathan Simpson, Silvia Falcinelli, Dongjun Chung, Oliana Carnevali, Gary Hardiman',
 'publication_year': '2017'}

{'publication_id': 29309429,
 'title': 'ShinyGPA: An interactive visualization toolkit for investigating pleiotropic architecture using GWAS datasets.',
 'authors': 'Emma Kortemeier, Paula S Ramos, Kelly J Hunt, Hang J Kim, Gary Hardiman, Dongjun Chung',
 'publication_year': '2018'}

{'publication_id': 29432514,
 'title': 'Improving SNP prioritization and pleiotropic architecture estimation by incorporating prior knowledge using graph-GPA.',
 'authors': 'Hang J Kim, Zhenning Yu, Andrew Lawson, Hongyu Zhao, Dongjun Chung',
 'publication_year': '2018'}

{'publication_id': 30326636,
 'title': 'An Analytic Approach Using Candidate Gene Selection and Logic Forest to Identify Gene by Environment Interactions (G × E) for Systemic Lupus Erythematosus in African Americans.',
 'authors': 'Bethany J Wolf, Paula S Ramos, J Madison Hyer, Viswanathan Ramakrishnan, Gary S Gilkeson, Gary Hardiman, Paul J Nietert, Diane L Kamen',
 'publication_year': '2018'}

{'publication_id': 30755818,
 'title': 'Synergistic effects of SHP2 and PI3K pathway inhibitors in GAB2-overexpressing ovarian cancer.',
 'authors': 'Bowen Sun, Nathaniel R Jensen, Dongjun Chung, Meixiang Yang, Amanda C LaRue, Hiu Wing Cheung, Qi Wang',
 'publication_year': '2019'}

{'publication_id': 32566544,
 'title': 'Ranking subjects based on paired compositional data with application to age-related hearing loss subtyping.',
 'authors': 'Jin Hyun Nam, Aastha Khatiwada, Lois J Matthews, Bradley A Schulte, Judy R Dubno, Dongjun Chung',
 'publication_year': '2020'}

{'publication_id': 32753773,
 'title': 'hubViz: A Novel Tool for Hub-centric Visualization.',
 'authors': 'Jin Hyun Nam, Jonghyun Yun, Ick Hoon Jin, Dongjun Chung',
 'publication_year': '2020'}

#### Gather publication info on full set of PMIDs

The first attempt at running `get_publication_info_for_df` ran overnight (~16h) before I was forced to shut it down. One way to speed it up is to register for an [NCBI API Key](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/), which will increase the maximum calls per second from 3 to 10.

There are also a large number of duplicate PMIDs across multiple projects. Rather than gather duplicates multiple times, it could make sense to build a dataframe of unique PMIDs with publication info, THEN join that with the Project/PMID dataframe.  

Finally, I also need to remove publications older than 2000, but that will still need to be done AFTER the publication info is present.

#### Timing Estimates

In [15]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.head(200))

In [16]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.tail(200))

Error fetching information for PMID 26916412: HTTP Error 429: Too Many Requests
Error fetching information for PMID 31730848: HTTP Error 429: Too Many Requests
Error fetching information for PMID 36775354: list index out of range


API Key added into function above

In [20]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.head(200))

In [21]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.tail(200))

Error fetching information for PMID 36775354: list index out of range


In [23]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.head(1000))

Error fetching information for PMID 33579955: list index out of range
Error fetching information for PMID 33574288: list index out of range
Error fetching information for PMID 33574288: list index out of range


In [24]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.head(2000))

Error fetching information for PMID 33579955: list index out of range
Error fetching information for PMID 33574288: list index out of range
Error fetching information for PMID 33574288: list index out of range


Without API Key:  
First 200 PMIDs: 00:01:16  
Last 200 PMIDs: 00:01:17 (With 3 errors)  

With API Key:  
First 200 PMIDs: 00:00:38  
Last 200 PMIDs: 00:00:38 (With 1 error)  
First 500 PMIDs: 00:01:36 (With 2 errors)  
First 1000 PMIDs: 00:03:39 (With 3 errors)  
First 2000 PMIDs: 00:06:22 (With 3 errors)  

**Average Rate: 0.198125 sec/PMID**  
At this rate, gathering unique info for all 145,000 unique PMIDs will take ~8 hours

In [27]:
df_pmids.pmid.nunique()

144658

In [30]:
# Test call of subset of PMIDs with API key enabled in .env
df_pub_info = get_publication_info_for_df(df_pmids.head(200))

In [31]:
df_pub_info

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
0,R01CA239701,36127808,10902170,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023
0,R21CA209848,29074302,9321971,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018
0,R21CA209848,31387361,9321971,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019
0,R21CA209848,29027980,9321971,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017
0,R21CA209848,29309429,9321971,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018
...,...,...,...,...,...,...
0,R21CA242861,35121878,10260680,Integrating molecular profiles into clinical f...,"Brendan Reardon, Nathanael D Moore, Nicholas S...",2021
0,R21CA242861,33230298,10260680,Integrated molecular drivers coordinate biolog...,"Jake R Conway, Felix Dietlein, Amaro Taylor-We...",2020
0,R21CA242861,32015527,10260680,Identification of cancer driver genes based on...,"Felix Dietlein, Donate Weghorn, Amaro Taylor-W...",2020
0,R21CA242933,36568126,10227447,A New Non-Linear Conjugate Gradient Algorithm ...,"Suvra Pal, Souvik Roy",2022


### Rework Publication gathering to reduce duplicative effort

In [13]:
# OUTDATED VERSION OF THIS FUNCTION
def get_publication_info_for_df(df_pmid):
    """
    Get publication information for each PMID in the input DataFrame.

    :param df_pmid: DataFrame containing 'coreproject', 'pmid', and 'applid' columns.
    :return: DataFrame containing 'coreproject', 'pmid', 'applid', 'title', 'authors', 'publication_date'.
    """
    unique_pmids = df_pmid['pmid'].unique()
    df_pmid_info = pd.DataFrame()

    for pmid in tqdm(unique_pmids, desc="Fetching Publication Info", unit=" PMID"):
        publication_info = get_publication_info_from_pmid(str(pmid))

        if publication_info:
            # Combine the information with the original DataFrame
            df_pmid_info = pd.concat([df_pmid_info, pd.DataFrame([{
                'coreproject': df_pmid[df_pmid['pmid'] == pmid]['coreproject'].values[0],
                'pmid': pmid,
                'applid': df_pmid[df_pmid['pmid'] == pmid]['applid'].values[0],
                'title': publication_info['title'],
                'authors': publication_info['authors'],
                'publication_year': publication_info['publication_year']
            }])])

    return df_pmid_info

In [18]:
df_pub_info = get_publication_info_for_df(df_pmids.head(2000))

Fetching Publication Info:  24%|██▍       | 442/1807 [01:52<04:32,  5.00 PMID/s]  

Error fetching information for PMID 33579955: list index out of range


Fetching Publication Info:  26%|██▌       | 470/1807 [01:58<05:23,  4.14 PMID/s]

Error fetching information for PMID 33574288: list index out of range


Fetching Publication Info: 100%|██████████| 1807/1807 [05:55<00:00,  5.08 PMID/s]


In [19]:
df_pub_info.pmid.nunique()

1805

In [25]:
df_pub_info.groupby('pmid').size().reset_index().sort_values(by=0, ascending=False)

Unnamed: 0,pmid,0
0,10387937,1
1241,34660940,1
1211,34518361,1
1210,34514469,1
1209,34508353,1
...,...,...
598,30864654,1
597,30854450,1
596,30836307,1
595,30828438,1


#### `df_pmids` have some duplicate coreproject-pmid combinations with different applids. Drop them

In [46]:
# Drop 'applid' column and any resulting duplicate coreproject-pmid rows
df_pmids = df_pmids.drop(columns='applid').drop_duplicates(ignore_index = True)

In [47]:
df_pmids

Unnamed: 0,coreproject,pmid
0,R01CA239701,36127808
1,R21CA209848,29074302
2,R21CA209848,31387361
3,R21CA209848,29027980
4,R21CA209848,29309429
...,...,...
165983,P50CA196530,35471840
165984,P50CA196530,35793873
165985,P50CA196530,36509758
165986,P50CA196530,36775354


In [48]:
# Get sample list of PMIDs with more than one project
duplicate_pmid_list = df_pmids.groupby('pmid').size().reset_index().sort_values(by=0, ascending=False).head(10).pmid.tolist()

# Get rows of the df_pmid dataset including the duplicate pmids
duplicate_pmid_df = df_pmids[df_pmids['pmid'].isin(duplicate_pmid_list)].reset_index(drop=True)
duplicate_pmid_df

Unnamed: 0,coreproject,pmid
0,U01CA231840,33746047
1,U24CA209851,31344359
2,U24CA231877,32302568
3,U24CA248453,33626341
4,U24CA253531,36961012
...,...,...
147,U2CCA233238,35277708
148,U2CCA233284,32302568
149,U2CCA233284,35277708
150,P50CA058223,30683880


In [52]:
print(f"Unique PMIDs: {duplicate_pmid_df.pmid.nunique()}")
print(f"Unique Projects: {duplicate_pmid_df.coreproject.nunique()}")

Unique PMIDs: 10
Unique Projects: 82


In [61]:
# OUTDATED VERSION OF THIS FUNCTION
def get_publication_info_for_df(df_pmid):
    """
    Get publication information for each unique PMID in the input DataFrame.

    :param df_pmid: DataFrame containing 'coreproject', 'pmid' columns.
    :return: DataFrame containing 'coreproject', 'pmid', 'title', 'authors', 'publication_year'.
    """
    df_pmid_info = pd.DataFrame()

    # Get unique PMIDs
    unique_pmids = df_pmid['pmid'].unique()

    # Iterate through unique PMIDs
    for pmid in tqdm(unique_pmids, desc="Fetching Publications"):
        publication_info = get_publication_info_from_pmid(str(pmid))

        if publication_info:
            # Get rows in the original DataFrame corresponding to the current PMID
            rows = df_pmid[df_pmid['pmid'] == pmid]

            # Create a DataFrame for the current PMID
            df_current = pd.DataFrame({
                'coreproject': rows['coreproject'].values,
                'pmid': pmid,
                'title': publication_info['title'],
                'authors': publication_info['authors'],
                'publication_year': publication_info['publication_year']
            })

            df_current = 

            # Concatenate the current DataFrame with df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current])

    return df_pmid_info

In [63]:
# Test new function using duplicative input
duplicate_pub_info_df = get_publication_info_for_df(duplicate_pmid_df)
duplicate_pub_info_df

Fetching Publications: 100%|██████████| 10/10 [00:02<00:00,  4.76it/s]


Unnamed: 0,coreproject,pmid,title,authors,publication_year
0,U01CA231840,33746047,Association of clinical factors and recent ant...,"P Grivas, A R Khaki, T M Wise-Draper, B French...",2021
1,P30CA013330,33746047,Association of clinical factors and recent ant...,"P Grivas, A R Khaki, T M Wise-Draper, B French...",2021
2,P30CA013696,33746047,Association of clinical factors and recent ant...,"P Grivas, A R Khaki, T M Wise-Draper, B French...",2021
3,P30CA015704,33746047,Association of clinical factors and recent ant...,"P Grivas, A R Khaki, T M Wise-Draper, B French...",2021
4,P30CA016058,33746047,Association of clinical factors and recent ant...,"P Grivas, A R Khaki, T M Wise-Draper, B French...",2021
...,...,...,...,...,...
9,U24CA210990,34585150,Analytical protocol to identify local ancestry...,"Jian Carrot-Zhang, Seunghun Han, Wanding Zhou,...",2021
10,U24CA210999,34585150,Analytical protocol to identify local ancestry...,"Jian Carrot-Zhang, Seunghun Han, Wanding Zhou,...",2021
11,U24CA211000,34585150,Analytical protocol to identify local ancestry...,"Jian Carrot-Zhang, Seunghun Han, Wanding Zhou,...",2021
12,U24CA211006,34585150,Analytical protocol to identify local ancestry...,"Jian Carrot-Zhang, Seunghun Han, Wanding Zhou,...",2021


In [66]:
# Check that results of duplicative input have one set of info for each pmid
duplicate_pub_info_df.groupby(['pmid', 'title']).size().reset_index()

Unnamed: 0,pmid,title,0
0,30683880,Shared heritability and functional enrichment ...,15
1,31344359,Before and After: Comparison of Legacy and Har...,15
2,32302568,The Human Tumor Atlas Network: Charting Tumor ...,18
3,32396860,Comprehensive Analysis of Genetic Ancestry and...,14
4,33626341,Whole-genome characterization of lung adenocar...,22
5,33746047,Association of clinical factors and recent ant...,14
6,33982016,Integrative modeling identifies genetic ancest...,13
7,34585150,Analytical protocol to identify local ancestry...,14
8,35277708,MITI minimum information guidelines for highly...,13
9,36961012,Animal Models and Their Role in Imaging-Assist...,14


In [95]:
df_pub_info = get_publication_info_for_df(df_pmids.head(200))

Fetching Publications: 100%|██████████| 200/200 [00:40<00:00,  5.00it/s]


In [96]:
df_pub_info

Unnamed: 0,coreproject,pmid,title,authors,publication_year
0,R01CA239701,36127808,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023
0,R21CA209848,29074302,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018
0,R21CA209848,31387361,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019
0,R21CA209848,29027980,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017
0,R21CA209848,29309429,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018
...,...,...,...,...,...
0,R21CA242861,35121878,Integrating molecular profiles into clinical f...,"Brendan Reardon, Nathanael D Moore, Nicholas S...",2021
0,R21CA242861,33230298,Integrated molecular drivers coordinate biolog...,"Jake R Conway, Felix Dietlein, Amaro Taylor-We...",2020
0,R21CA242861,32015527,Identification of cancer driver genes based on...,"Felix Dietlein, Donate Weghorn, Amaro Taylor-W...",2020
0,R21CA242933,36568126,A New Non-Linear Conjugate Gradient Algorithm ...,"Suvra Pal, Souvik Roy",2022


In [97]:
# Check that results have only one set of info for each pmid
df_pub_info.groupby(['pmid', 'title']).size().reset_index().sort_values(by=0, ascending=False)

Unnamed: 0,pmid,title,0
0,25714012,Harnessing Noxa demethylation to overcome Bort...,1
137,33398948,Comprehensive review of surgical microscopes: ...,1
127,33163519,FIB-4 score and hepatocellular carcinoma risk ...,1
128,33213742,The Nobel Prize in Medicine 2020 for the Disco...,1
129,33230298,Integrated molecular drivers coordinate biolog...,1
...,...,...,...
69,31990579,Evidence-Based Network Approach to Recommendin...,1
70,32006268,Personalized Dosimetry for Liver Cancer Y-90 R...,1
71,32015527,Identification of cancer driver genes based on...,1
72,32025647,A virtual molecular tumor board to improve eff...,1


In [13]:
# OUTDATED VERSION OF THIS FUNCTION
def get_publication_info_for_df(df_pmid, chunk_size=10000, output_folder='pub_info_chunks'):
    """
    Get publication information for each PMID in the input DataFrame.

    :param df_pmid: DataFrame containing 'coreproject' and 'pmid' columns.
    :param chunk_size: Number of records to process in each batch.
    :param output_folder: Folder to store the output files.
    :return: DataFrame containing 'coreproject', 'pmid', 'title', 'authors', 
                and 'publication_date'.
    """
    # Create the output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    df_pmid_info = pd.DataFrame()
    chunk_number = 0
    processed_pmids = set()
    
    for index, row in tqdm(df_pmid.iterrows(), total=df_pmid['pmid'].nunique()):
        pmid = str(row['pmid'])

        # Check if we've already processed this PMID
        if pmid in processed_pmids:
            continue

        try:
            publication_info = get_publication_info_from_pmid(pmid)

            # Add the current PMID to the set of processed PMIDs
            processed_pmids.add(pmid)

            if publication_info:
                # Combine the information with the original DataFrame
                df_current = pd.DataFrame({
                    'coreproject': row['coreproject'],
                    'pmid': pmid,
                    'title': publication_info['title'],
                    'authors': publication_info['authors'],
                    'publication_year': publication_info['publication_year']
                }, index=[0])

                # Add the current DataFrame to df_pmid_info
                df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

                # Check if df should be saved to file
                if len(df_pmid_info) >= chunk_size:
                    chunk_number += 1
                    output_file = f"{output_folder}/df_pmid_info_chunk_{chunk_number}.csv"
                    df_pmid_info.to_csv(output_file, index=False)

                    # Clear df_pmid_info for the next chunk
                    df_pmid_info = pd.DataFrame()

        except Exception as e:
            print(f"Error processing PMID {pmid}: {e}")
            # Fill in fields with "No Data Available"
            df_current = pd.DataFrame({
                'coreproject': row['coreproject'],
                'pmid': pmid,
                'title': 'No Data Available',
                'authors': 'No Data Available',
                'publication_year': 'No Data Available'
            }, index=[0])

            # Add the current DataFrame to df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

    # Save the final chunk if present
    if len(df_pmid_info) >= 1:
        chunk_number += 1
        output_file = f"{output_folder}/df_pmid_info_chunk_{chunk_number}.csv"
        df_pmid_info.to_csv(output_file, index=False)

    return df_pmid_info


In [14]:
df_pub_info = get_publication_info_for_df(df_pmids.head(230), chunk_size=50, output_folder='pub_info_chunks')

  0%|          | 0/230 [00:00<?, ?it/s]

100%|██████████| 230/230 [00:53<00:00,  4.27it/s]


In [9]:
def load_all_directory_files_to_df(directory):
    """Load all identically-structured files within a folder into a single df."""
    
    filenames = [file for file in os.listdir(directory)]
    filepaths = [directory + '/' + file for file in filenames]
    df = pd.concat(map(pd.read_csv, filepaths), ignore_index=True)
    
    return df

In [15]:
df_pubs = load_all_directory_files_to_df('pub_info_chunks')
df_pubs

Unnamed: 0,coreproject,pmid,title,authors,publication_year
0,R01CA239701,36127808,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023
1,R21CA209848,29074302,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018
2,R21CA209848,31387361,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019
3,R21CA209848,29027980,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017
4,R21CA209848,29309429,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018
...,...,...,...,...,...
230,R21CA209940,32795560,Perturbation of the circadian clock and pathog...,"Atish Mukherji, Mayssa Dachraoui, Thomas F Bau...",2020
231,R21CA209940,31399384,Uncovering the mechanism of action of aspirin ...,"Natascha Roehlen, Thomas F Baumert",2019
232,R21CA209940,32534107,Single-cell genomics and spatial transcriptomi...,"Antonio Saviano, Neil C Henderson, Thomas F Ba...",2020
233,R21CA209940,29331675,Module Analysis Captures Pancancer Genetically...,"Magali Champion, Kevin Brennan, Tom Croonenbor...",2018


In [16]:
# Set temp storage directory for chunked publication data
directory = 'pub_info_chunks'
input_df = df_pmids.head(500)
chunk_size = 100

# Gather publication data and store in chunks of defined size 
get_publication_info_for_df(input_df, chunk_size=chunk_size, output_folder=directory)

# Load all partial files back into a single df
df_pub_info = load_all_directory_files_to_df(directory)
df_pub_info

 91%|█████████ | 448/494 [02:08<00:08,  5.19it/s]

Error fetching information for PMID 33579955: list index out of range


 97%|█████████▋| 477/494 [02:14<00:02,  5.85it/s]

Error fetching information for PMID 33574288: list index out of range


500it [02:19,  3.59it/s]                         


Unnamed: 0,coreproject,pmid,title,authors,publication_year
0,R01CA239701,36127808,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023
1,R21CA209848,29074302,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018
2,R21CA209848,31387361,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019
3,R21CA209848,29027980,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017
4,R21CA209848,29309429,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018
...,...,...,...,...,...
487,U01CA239055,36281473,Automated analysis of computerized morphologic...,"Shayan Monabbati, Patrick Leo, Kaustav Bera, C...",2023
488,U01CA239055,31719058,Changes in CT Radiomic Features Associated wit...,"Mohammadhadi Khorrami, Prateek Prasanna, Amit ...",2020
489,U01CA239055,32722082,Radiomic Texture and Shape Descriptors of the ...,"Charlems Alvarez-Jimenez, Jacob T Antunes, Nit...",2020
490,U01CA239055,32816791,Radiomics-based assessment of ultra-widefield ...,"Prateek Prasanna, Vishal Bobba, Natalia Figuei...",2021


In [19]:
# Check to see if duplicate PMIDs are included across different projects
df_pub_info[df_pub_info['pmid'].duplicated()]

Unnamed: 0,coreproject,pmid,title,authors,publication_year


#### Try Publication info gathering with full dataset
Try chunksize of 2500 lines for each file

In [59]:
# Review original input df_pmids
df_pmids

Unnamed: 0,coreproject,pmid,applid
0,R01CA239701,36127808,10902170
1,R21CA209848,29074302,9321971
2,R21CA209848,31387361,9321971
3,R21CA209848,29027980,9321971
4,R21CA209848,29309429,9321971
...,...,...,...
175522,P50CA196530,35471840,10690040
175523,P50CA196530,35793873,10690040
175524,P50CA196530,36509758,10690040
175525,P50CA196530,36775354,10690040


In [60]:
# Set temp storage directory for chunked publication data
directory = 'publication_chunked_data'
input_df = df_pmids
chunk_size = 2500

# Gather publication data and store in chunks of defined size 
get_publication_info_for_df(input_df, chunk_size=chunk_size, output_folder=directory)

# Load all partial files back into a single df
df_pub_info = load_all_directory_files_to_df(directory)
df_pub_info

  0%|          | 447/175527 [01:43<9:47:18,  4.97it/s] 

Error fetching information for PMID 33579955: list index out of range


  0%|          | 476/175527 [01:50<9:40:08,  5.03it/s] 

Error fetching information for PMID 33574288: list index out of range


  1%|          | 889/175527 [03:18<40:32:55,  1.20it/s]

Error fetching information for PMID 32330068: Failed to find tag 'pubmed' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.


  1%|          | 949/175527 [03:31<9:35:41,  5.05it/s] 

Error fetching information for PMID 33574288: list index out of range


  1%|▏         | 2592/175527 [09:01<9:23:51,  5.11it/s] 

Error fetching information for PMID 34233275: list index out of range


  1%|▏         | 2604/175527 [09:04<8:26:03,  5.70it/s] 

Error fetching information for PMID 31993221: list index out of range


  2%|▏         | 3348/175527 [11:39<8:58:12,  5.33it/s] 

Error fetching information for PMID 35554228: list index out of range


  2%|▏         | 3679/175527 [12:49<8:44:40,  5.46it/s] 

Error fetching information for PMID 35820070: list index out of range


  2%|▏         | 3726/175527 [12:59<9:59:04,  4.78it/s] 


ValueError: invalid literal for int() with base 10: ''

### Rename and rework `get_publication_info_for_df` to cycle through unique PMIDs

In [54]:
def build_pmid_info_data_chunks(df_pmid, chunk_size=10000, output_folder='pub_info_chunks'):
    """
    Get publication information for each PMID in the input DataFrame and export
    as separate csv files. 

    :param df_pmid: DataFrame containing 'coreproject' and 'pmid' columns.
    :param chunk_size: Number of records to process in each batch.
    :param output_folder: Folder to store the output files.
    :return: None. Files are stored for loading
    """
    # Create the output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    df_pmid_info = pd.DataFrame()
    chunk_number = 0
    
    # Iterate through each unique PMID
    for pmid in tqdm(df_pmid['pmid'].unique()):
        try:
            publication_info = get_publication_info_from_pmid(pmid)

            if publication_info:
                # Combine the information with the original DataFrame
                df_current = pd.DataFrame({
                    'pmid': pmid,
                    'title': publication_info['title'],
                    'authors': publication_info['authors'],
                    'publication_year': publication_info['publication_year']
                }, index=[0])

                # Add the current DataFrame to df_pmid_info
                df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

                # Check if df should be saved to file
                if len(df_pmid_info) >= chunk_size:
                    chunk_number += 1
                    output_file = f"{output_folder}/df_pmid_info_chunk_{chunk_number}.csv"
                    df_pmid_info.to_csv(output_file, index=False)

                    # Clear df_pmid_info for the next chunk
                    df_pmid_info = pd.DataFrame()

        except Exception as e:
            print(f"Error processing PMID {pmid}: {e}")
            # Fill in fields with numpy NaN if not available
            df_current = pd.DataFrame({
                'pmid': pmid,
                'title': np.nan,
                'authors': np.nan,
                'publication_year': np.nan
            }, index=[0])

            # Add the current DataFrame to df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

    # Save the final chunk if present
    if len(df_pmid_info) >= 1:
        chunk_number += 1
        output_file = f"{output_folder}/df_pmid_info_chunk_{chunk_number}.csv"
        df_pmid_info.to_csv(output_file, index=False)

    return None


In [39]:
# Set temp storage directory for chunked publication data
directory = 'test_pub_chunks'
input_df = df_pmids.head(500)
chunk_size = 100

# Gather publication data and store in chunks of defined size 
get_publication_info_for_df(input_df, chunk_size=chunk_size, output_folder=directory)

# Load all partial files back into a single df
df_pub_info = load_all_directory_files_to_df(directory)
df_pub_info

 89%|████████▉ | 442/494 [01:31<00:10,  5.15it/s]

Error fetching information for PMID 33579955: list index out of range


 95%|█████████▌| 471/494 [01:36<00:04,  5.15it/s]

Error fetching information for PMID 33574288: list index out of range


100%|██████████| 494/494 [01:41<00:00,  4.89it/s]


Unnamed: 0,pmid,title,authors,publication_year
0,36127808,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023
1,29074302,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018
2,31387361,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019
3,29027980,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017
4,29309429,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018
...,...,...,...,...
487,36281473,Automated analysis of computerized morphologic...,"Shayan Monabbati, Patrick Leo, Kaustav Bera, C...",2023
488,31719058,Changes in CT Radiomic Features Associated wit...,"Mohammadhadi Khorrami, Prateek Prasanna, Amit ...",2020
489,32722082,Radiomic Texture and Shape Descriptors of the ...,"Charlems Alvarez-Jimenez, Jacob T Antunes, Nit...",2020
490,32816791,Radiomics-based assessment of ultra-widefield ...,"Prateek Prasanna, Vishal Bobba, Natalia Figuei...",2021


In [40]:
test_df_pmids = df_pmids.head(500)

enriched_df = pd.merge(test_df_pmids, df_pub_info, on='pmid', how='left')
enriched_df

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
0,R01CA239701,36127808,10902170,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0
1,R21CA209848,29074302,9321971,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0
2,R21CA209848,31387361,9321971,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0
3,R21CA209848,29027980,9321971,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0
4,R21CA209848,29309429,9321971,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0
...,...,...,...,...,...,...
495,U01CA239055,36281473,10392854,Automated analysis of computerized morphologic...,"Shayan Monabbati, Patrick Leo, Kaustav Bera, C...",2023.0
496,U01CA239055,31719058,10392854,Changes in CT Radiomic Features Associated wit...,"Mohammadhadi Khorrami, Prateek Prasanna, Amit ...",2020.0
497,U01CA239055,32722082,10392854,Radiomic Texture and Shape Descriptors of the ...,"Charlems Alvarez-Jimenez, Jacob T Antunes, Nit...",2020.0
498,U01CA239055,32816791,10392854,Radiomics-based assessment of ultra-widefield ...,"Prateek Prasanna, Vishal Bobba, Natalia Figuei...",2021.0


In [30]:
enriched_df[enriched_df['pmid'].duplicated(keep=False)]

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
37,R21CA209940,30978357,9483712,"Combined Analysis of Metabolomes, Proteomes, a...","Joachim Lupberger, Tom Croonenborghs, Armando ...",2019.0
38,R21CA209940,31639029,9483712,Accuracy assessment of fusion transcript detec...,"Brian J Haas, Alexander Dobin, Bo Li, Nicolas ...",2019.0
39,R21CA209940,31297189,9483712,TFutils: Data structures for transcription fac...,"Benjamin J Stubbs, Shweta Gopaulakrishnan, Kim...",2019.0
43,R21CA209940,32383980,9483712,Imaging-AMARETTO: An Imaging Genomics Software...,"Olivier Gevaert, Mohsen Nabian, Shaimaa Bakr, ...",2020.0
99,R21CA220398,30311370,9923991,Adapting crowdsourced clinical cancer curation...,"Arpad M Danos, Deborah I Ritter, Alex H Wagner...",2018.0
312,U01CA209936,30311370,9654992,Adapting crowdsourced clinical cancer curation...,"Arpad M Danos, Deborah I Ritter, Alex H Wagner...",2018.0
349,U01CA214846,30978357,9677124,"Combined Analysis of Metabolomes, Proteomes, a...","Joachim Lupberger, Tom Croonenborghs, Armando ...",2019.0
350,U01CA214846,32383980,9677124,Imaging-AMARETTO: An Imaging Genomics Software...,"Olivier Gevaert, Mohsen Nabian, Shaimaa Bakr, ...",2020.0
353,U01CA214846,31639029,9677124,Accuracy assessment of fusion transcript detec...,"Brian J Haas, Alexander Dobin, Bo Li, Nicolas ...",2019.0
354,U01CA214846,31297189,9677124,TFutils: Data structures for transcription fac...,"Benjamin J Stubbs, Shweta Gopaulakrishnan, Kim...",2019.0


In [31]:
enriched_df['publication_year'].unique()

array([2023., 2018., 2019., 2017., 2020., 2022., 2016., 2021., 2015.,
         nan])

In [33]:
enriched_df[enriched_df['publication_year'].isna()]

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
446,U01CA231840,33579955,9991804,,,
475,U01CA239055,33574288,10392854,,,


In [41]:
enriched_df[enriched_df['publication_year'] >= 2000 ]

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
0,R01CA239701,36127808,10902170,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0
1,R21CA209848,29074302,9321971,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0
2,R21CA209848,31387361,9321971,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0
3,R21CA209848,29027980,9321971,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0
4,R21CA209848,29309429,9321971,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0
...,...,...,...,...,...,...
495,U01CA239055,36281473,10392854,Automated analysis of computerized morphologic...,"Shayan Monabbati, Patrick Leo, Kaustav Bera, C...",2023.0
496,U01CA239055,31719058,10392854,Changes in CT Radiomic Features Associated wit...,"Mohammadhadi Khorrami, Prateek Prasanna, Amit ...",2020.0
497,U01CA239055,32722082,10392854,Radiomic Texture and Shape Descriptors of the ...,"Charlems Alvarez-Jimenez, Jacob T Antunes, Nit...",2020.0
498,U01CA239055,32816791,10392854,Radiomics-based assessment of ultra-widefield ...,"Prateek Prasanna, Vishal Bobba, Natalia Figuei...",2021.0


### Add step to merge PMID info df with project-pmid info
Also clean and save any problematic issues for reference

In [50]:
def merge_and_clean_project_pmid_info(df_pmids, df_pub_info):
    """Merge publication information with the dataframe of projects and pmids. 
       Also perform some cleaning functions and store removed publications.

    :param df_pmid: DataFrame containing 'coreproject' and 'pmid' columns.
    :param df_pub_info: DataFrame containing 'pmid', 'title', 'authors', and 
                    'publication_year'.
    :return: df_merged: Clean DataFrame with projects, pmids, and pub info
    :return: df_removed_publications: DataFrame with errored publication info
    """
    # Merge the two DataFrames on the 'pmid' column
    df_merged = pd.merge(df_pmids, df_pub_info, on='pmid', how='left')

    # Remove rows with NaN values
    df_removed_publications = df_merged[df_merged.isnull().any(axis=1)].copy()
    df_merged = df_merged.dropna()

    # Remove rows where 'publication_date' is below 2000
    df_removed_before_2000 = df_merged[df_merged['publication_year']
                                       .astype(int) < 2000].copy()
    df_merged = df_merged[df_merged['publication_year'].astype(int) >= 2000]

    # Add reasons for removal to 'df_removed_publications'
    df_removed_publications = pd.concat([df_removed_publications, 
                                         df_removed_before_2000], 
                                         ignore_index=True)
    df_removed_publications['reason'] = ''

    # Add reasons for removal based on conditions
    df_removed_publications.loc[df_removed_publications
                                .isnull().any(axis=1), 'reason'] = 'No API info'
    df_removed_publications.loc[df_removed_publications['publication_year']
                                .astype(float) < 2000, 'reason'] = 'Published before 2000'

    return df_merged, df_removed_publications

In [51]:
df_publications, df_removed_publications = merge_and_clean_project_pmid_info(test_df_pmids, df_pub_info)

In [52]:
df_publications

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
0,R01CA239701,36127808,10902170,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0
1,R21CA209848,29074302,9321971,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0
2,R21CA209848,31387361,9321971,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0
3,R21CA209848,29027980,9321971,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0
4,R21CA209848,29309429,9321971,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0
...,...,...,...,...,...,...
495,U01CA239055,36281473,10392854,Automated analysis of computerized morphologic...,"Shayan Monabbati, Patrick Leo, Kaustav Bera, C...",2023.0
496,U01CA239055,31719058,10392854,Changes in CT Radiomic Features Associated wit...,"Mohammadhadi Khorrami, Prateek Prasanna, Amit ...",2020.0
497,U01CA239055,32722082,10392854,Radiomic Texture and Shape Descriptors of the ...,"Charlems Alvarez-Jimenez, Jacob T Antunes, Nit...",2020.0
498,U01CA239055,32816791,10392854,Radiomics-based assessment of ultra-widefield ...,"Prateek Prasanna, Vishal Bobba, Natalia Figuei...",2021.0


In [53]:
df_removed_publications

Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year,reason
0,U01CA231840,33579955,9991804,,,,No API info
1,U01CA239055,33574288,10392854,,,,No API info


In [55]:
# Set temp storage directory for chunked publication data
directory = 'test_pub_chunks'
input_df = df_pmids.head(5050)
chunk_size = 500

# Gather publication data and store in chunks of defined size 
build_pmid_info_data_chunks(input_df, chunk_size=chunk_size, output_folder=directory)

# Load all partial files back into a single df
df_pub_info = load_all_directory_files_to_df(directory)

# Build final publication df
df_publications, df_removed_publications = merge_and_clean_project_pmid_info(input_df, df_pub_info)

display(df_publications)
display(df_removed_publications)

 10%|▉         | 442/4557 [01:44<11:21,  6.03it/s]  

Error fetching information for PMID 33579955: list index out of range


 10%|█         | 471/4557 [01:50<12:05,  5.63it/s]

Error fetching information for PMID 33574288: list index out of range


 52%|█████▏    | 2367/4557 [08:38<06:39,  5.49it/s]

Error fetching information for PMID 34233275: list index out of range


 52%|█████▏    | 2379/4557 [08:41<08:40,  4.18it/s]

Error fetching information for PMID 31993221: list index out of range


 68%|██████▊   | 3095/4557 [11:42<04:34,  5.32it/s]  

Error fetching information for PMID 35554228: list index out of range


 74%|███████▍  | 3368/4557 [12:34<03:12,  6.16it/s]

Error fetching information for PMID 35820070: list index out of range


100%|██████████| 4557/4557 [18:17<00:00,  4.15it/s]  


Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year
0,R01CA239701,36127808,10902170,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0
1,R21CA209848,29074302,9321971,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0
2,R21CA209848,31387361,9321971,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0
3,R21CA209848,29027980,9321971,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0
4,R21CA209848,29309429,9321971,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0
...,...,...,...,...,...,...
5045,R01CA271309,35927389,10584507,Improving plane wave ultrasound imaging throug...,"Josquin Foiret, Xiran Cai, Hanna Bendjador, Eu...",2022.0
5046,R01CA271309,35928598,10584507,Highly Integrated Multiplexing and Buffering E...,"Robert Wodnicki, Haochen Kang, Di Li, Douglas ...",2022.0
5047,R01CA271309,35836805,10584507,A theranostic 3D ultrasound imaging system for...,"Hanna Bendjador, Josquin Foiret, Robert Wodnic...",2022.0
5048,R01CA271309,37256942,10584507,Fast volumetric ultrasound facilitates high-re...,"Eun-Yeong Park, Xiran Cai, Josquin Foiret, Han...",2023.0


Unnamed: 0,coreproject,pmid,applid,title,authors,publication_year,reason
0,U01CA231840,33579955,9991804,,,,No API info
1,U01CA239055,33574288,10392854,,,,No API info
2,U24CA209851,28572459,10006080,AACR Project GENIE: Powering Precision Medicin...,,2017.0,No API info
3,U24CA215109,33574288,10227670,,,,No API info
4,U24CA231877,35446428,10908030,"The Galaxy platform for accessible, reproducib...",,2022.0,No API info
5,U54CA224019,34233275,10684101,,,,No API info
6,U54CA224019,31993221,10684101,,,,No API info
7,R01CA257505,35554228,10593130,,,,No API info
8,R01CA259046,35820070,10616492,,,,No API info
9,R01CA259188,36572537,10599170,Correction to: CD30 Regulation of IL-13-STAT6 ...,,2023.0,No API info


## Workflow looks good. Ready to move to scripts

Notes:  
- Real-world data gathered at a rate of roughly 0.2175 PMID/sec. Estimation for 145,000 unique PMIDs is ~9h runtime, which is why the chunks are valuable to get partial data. Consider ability to gracefully restart the process in the middle in the event of a failure. 
- Some of the publications removed due to missing API info may still be valuable:
    - Some have missing authors because PubMed lists the authors as an organization (e.g. [35446428](https://pubmed.ncbi.nlm.nih.gov/35446428/)):
        ```{'publication_id': 28572459,
        'title': 'AACR Project GENIE: Powering Precision Medicine through an International Consortium.',
        'authors': '',
        'publication_year': '2017'}
    - Others have missing Publication Years but have valid authors and title. Unsure why. (e.g. [35880942](https://pubmed.ncbi.nlm.nih.gov/35880942/)):
        ```{'publication_id': 35880942,
        'title': 'Adoptive Cellular Therapy for Pediatric Solid Tumors: Beyond Chimeric Antigen Receptor-T Cell Therapy.',
        'authors': 'Jonathan Hensel, Jonathan Metts, Ajay Gupta, Brian H Ladle, Shari Pilon-Thomas, John Mullinax',
        'publication_year': ''}


In [56]:
get_publication_info_from_pmid(28572459)

{'publication_id': 28572459,
 'title': 'AACR Project GENIE: Powering Precision Medicine through an International Consortium.',
 'authors': '',
 'publication_year': '2017'}

In [57]:
get_publication_info_from_pmid(35880942)

{'publication_id': 35880942,
 'title': 'Adoptive Cellular Therapy for Pediatric Solid Tumors: Beyond Chimeric Antigen Receptor-T Cell Therapy.',
 'authors': 'Jonathan Hensel, Jonathan Metts, Ajay Gupta, Brian H Ladle, Shari Pilon-Thomas, John Mullinax',
 'publication_year': ''}