# Explore the iCite API for gathering Publication data
2023-11-22 ZD  

# Part 1: Trying the iCite API
Part 2 (below): Using bulk download snapshot  

Relevant Jira Ticket: [INS-790](https://tracker.nci.nih.gov/browse/INS-790)  

Exploratory notebook to investigate gathering Publications data for INS from [the iCite API](https://icite.od.nih.gov/api). This wiil build upon the work to in `notebooks/07_gather_publications.ipynb` and `modules/gather_publication_data.py`.  

The primary goal is to gather metrics specific to iCite: Citation Count and Relative Citation Ratio. The secondary goal is to explore whether the iCite API could replace the Biopython Entrez PubMed API, which is very slow. 

In [15]:
# Method to import from parent directory
import os
import sys
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
sys.path.append(root_dir)

import requests
import pandas as pd
from tqdm import tqdm

# Get all existing publication functions
import modules.gather_publication_data as gpub

In [5]:
# Test imported functions
gpub.get_pmids_from_nih_reporter_api('R01CA263500', print_meta=True)

R01CA263500: (1/1): {'search_id': None, 'total': 7, 'offset': 0, 'limit': 500, 'sort_field': 'core_project_nums', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {}}


[{'coreproject': 'R01CA263500', 'pmid': 37138086, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36288726, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36917953, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 36734849, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37059069, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 35130560, 'applid': 10679077},
 {'coreproject': 'R01CA263500', 'pmid': 37024595, 'applid': 10679077}]

In [11]:
gpub.get_publication_info_from_pmid('37138086)')

{'publication_id': '37138086)',
 'title': 'Glioblastoma remodelling of human neural circuits decreases survival.',
 'authors': 'Saritha Krishna, Abrar Choudhury, Michael B Keough, Kyounghee Seo, Lijun Ni, Sofia Kakaizada, Anthony Lee, Alexander Aabedi, Galina Popova, Benjamin Lipkin, Caroline Cao, Cesar Nava Gonzales, Rasika Sudharshan, Andrew Egladyous, Nyle Almeida, Yalan Zhang, Annette M Molinaro, Humsa S Venkatesh, Andy G S Daniel, Kiarash Shamardani, Jeanette Hyer, Edward F Chang, Anne Findlay, Joanna J Phillips, Srikantan Nagarajan, David R Raleigh, David Brang, Michelle Monje, Shawn L Hervey-Jumper',
 'publication_year': '2023'}

In [7]:
# Basic iCite python example

response = requests.get(
    "/".join([
        "https://icite.od.nih.gov/api",
        "pubs",
        "23456789",
    ]),
)
pub = response.json()
print(pub)

{'pmid': 23456789, 'year': 2013, 'title': 'Hospital volume is associated with survival but not multimodality therapy in Medicare patients with advanced head and neck cancer.', 'authors': 'Arun Sharma, Stephen M Schwartz, Eduardo Méndez', 'journal': 'Cancer', 'is_research_article': 'Yes', 'relative_citation_ratio': 1.77, 'nih_percentile': 70.8, 'human': 1.0, 'animal': 0.0, 'molecular_cellular': 0.0, 'apt': 0.75, 'is_clinical': 'No', 'citation_count': 45, 'citations_per_year': 4.5, 'expected_citations_per_year': 2.547166821310601, 'field_citation_rate': 5.361749145554551, 'provisional': 'No', 'x_coord': 0.0, 'y_coord': 1.0, 'cited_by_clin': [25488965, 29180076], 'cited_by': [30186960, 34399637, 30220318, 37564472, 34795020, 28606602, 24123512, 36746098, 29100787, 26777060, 26553389, 25488965, 30194691, 35792549, 33556919, 27061951, 24706437, 29794540, 25042524, 28079775, 35547406, 32600116, 24488549, 31334365, 30409307, 35868508, 26868285, 29079897, 33449369, 32191271, 30698823, 25681489

### Try a one-to-one replacement of the Entrez `get_publication_info_from_pmid` function

In [13]:
def get_publication_info_from_pmid_icite(pmid):
    """
    Get publication information for a given PMID using the iCite API.

    :param pmid: PubMed ID (str)
    :return: Dictionary containing publication information
    """
    try:
        # Use the iCite API to get publication data
        response = requests.get(f"https://icite.od.nih.gov/api/pubs/{pmid}")
        pub = response.json()

        # Extract relevant information
        publication_info = {
            'publication_id': pub.get('pmid', ''),
            'title': pub.get('title', ''),
            'authors': pub.get('authors', ''),
            'publication_year': pub.get('year', ''),
            'doi':pub.get('doi', ''),
            'citation_count': pub.get('citation_count', ''),
            'relative_citation_ratio': pub.get('relative_citation_ratio', ''),
        }

        return publication_info

    except Exception as e:
        # Use tqdm.write() instead of print() for long processes
        tqdm.write(f"Error fetching information for PMID {pmid} from iCite API: {e}")
        #print(f"Error fetching information for PMID {pmid} from iCite API: {e}")
        return None


In [28]:
get_publication_info_from_pmid_icite('37138086')

{'publication_id': 37138086,
 'title': 'Glioblastoma remodelling of human neural circuits decreases survival.',
 'authors': 'Saritha Krishna, Abrar Choudhury, Michael B Keough, Kyounghee Seo, Lijun Ni, Sofia Kakaizada, Anthony Lee, Alexander Aabedi, Galina Popova, Benjamin Lipkin, Caroline Cao, Cesar Nava Gonzales, Rasika Sudharshan, Andrew Egladyous, Nyle Almeida, Yalan Zhang, Annette M Molinaro, Humsa S Venkatesh, Andy G S Daniel, Kiarash Shamardani, Jeanette Hyer, Edward F Chang, Anne Findlay, Joanna J Phillips, Srikantan Nagarajan, David R Raleigh, David Brang, Michelle Monje, Shawn L Hervey-Jumper',
 'publication_year': 2023,
 'doi': '10.1038/s41586-023-06036-1',
 'citation_count': 21,
 'relative_citation_ratio': 10.5}

In [18]:
# Checkpoint loading instead of regathering data during development
pmid_filename = 'gathered_pmids_20231110.csv'
df_pmid = pd.read_csv(pmid_filename)

In [22]:
# Iterate through each unique PMID with tqdm progress bar
def get_pub_info_test_loop(df_pmid):

    df_pmid_info = pd.DataFrame()

    for pmid in tqdm(df_pmid['pmid'].unique(), 
                    #total=remaining_pmid_count, 
                    ncols=80):
        try:
            # Use PubMed API to get publication data
            publication_info = get_publication_info_from_pmid_icite(pmid)

            if publication_info:
                # Combine the information with the original DataFrame
                df_current = pd.DataFrame({
                    'pmid': pmid,
                    'title': publication_info['title'],
                    'authors': publication_info['authors'],
                    'publication_year': publication_info['publication_year'],
                    'doi': publication_info['doi'],
                    'citation_count': publication_info['citation_count'],
                    'relative_citation_ratio': publication_info['relative_citation_ratio']
                }, index=[0])

                # Add the current DataFrame to df_pmid_info
                df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

        except Exception as e:
            print(f"Error processing PMID {pmid}: {e}")
            # Fill in fields with NaN if not available
            df_current = pd.DataFrame({
                'pmid': pmid,
                'title': pd.NA,
                'authors': pd.NA,
                'publication_year': pd.NA,
                'doi': pd.NA,
                'citation_count': pd.NA,
                'relative_citation_ratio': pd.NA
            }, index=[0])

            # Add the current DataFrame to df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

    return df_pmid_info

In [26]:
df_pmid_info_icite = get_pub_info_test_loop(df_pmid.head(1000))

  df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)
100%|█████████████████████████████████████████| 914/914 [13:41<00:00,  1.11it/s]


In [27]:
df_pmid_info_icite

Unnamed: 0,pmid,title,authors,publication_year,doi,citation_count,relative_citation_ratio
0,36127808,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023,10.1002/cam4.5266,4,
1,29074302,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018,10.1016/j.imbio.2017.10.028,2,0.09
2,31387361,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019,10.1161/CIRCULATIONAHA.119.038376,34,1.95
3,29027980,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017,10.3390/genes8100269,20,1.17
4,29309429,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018,10.1371/journal.pone.0190949,2,0.08
...,...,...,...,...,...,...,...
909,32276990,Mesenchymal and MAPK Expression Signatures Ass...,"Josh Lewis Stern, Grace Hibshman, Kevin Hu, Sa...",2020,10.1158/1541-7786.MCR-19-1244,17,1.13
910,32525984,Tissue- and development-stage-specific mRNA an...,"Anshuman Panda, Anupama Yadav, Huwate Yeerna, ...",2020,10.1093/nar/gkaa485,12,0.76
911,36371231,Differential regulation of TNFα and IL-6 expre...,"Ida Deichaite, Timothy J Sears, Leisa Sutton, ...",2022,10.1186/s12967-022-03731-x,2,0.57
912,37682073,Transcriptional subtypes of glottic cancer cha...,"Bharat A Panuganti, Christine Carico, Harishan...",2023,10.1002/hed.27514,0,


### Try processing in batches for fewer API calls
The above approach works but is even slower per iteration than the Entrez API (~15min for 1000). Try a batching approach where multiple PMIDs are sent in a single call.

In [37]:
def get_publication_info_from_pmid_icite_batch(pmids):
    """
    Get publication information for a list of PMIDs using the iCite API.

    :param pmids: List of PubMed IDs (str)
    :return: DataFrame containing publication information
    """
    try:
        # Join PMIDs into a comma-separated string
        pmid_str = ','.join(pmids)
        
        # Use the iCite API to get publication data for all PMIDs
        response = requests.get(f"https://icite.od.nih.gov/api/pubs?pmids={pmid_str}")
        pubs = response.json()

        # Initialize an empty list to store publication information for each PMID
        publication_info_list = []

        for pub in pubs:
            # Extract relevant information
            publication_info = {
                'pmid': pub.get('pmid', ''),
                'title': pub.get('title', ''),
                'authors': pub.get('authors', ''),
                'publication_year': pub.get('year', ''),
                'doi': pub.get('doi', ''),
                'citation_count': pub.get('citation_count', ''),
                'relative_citation_ratio': pub.get('relative_citation_ratio', ''),
            }
            
            # Add data to running list
            publication_info_list.append(publication_info)

        return pd.DataFrame(publication_info_list)

    except Exception as e:
        print(f"Error fetching information for PMIDs {pmids} from iCite API: {e}")
        return pd.DataFrame()

In [39]:
def get_pub_info_batched(df_pmid, batch_size=10):
    df_pmid_info = pd.DataFrame()

    # Extract unique PMIDs
    unique_pmids = df_pmid['pmid'].unique()

    # Split PMIDs into batches
    pmid_batches = [
        unique_pmids[i : i + batch_size].astype(str) for i in range(0, len(unique_pmids), batch_size)
    ]

    for batch in tqdm(pmid_batches, ncols=80):
        try:
            # Use iCite API to get publication data for the batch
            batch_info = [get_publication_info_from_pmid_icite(pmid) for pmid in batch]

            # Filter out None results (failed API calls)
            batch_info = [info for info in batch_info if info is not None]

            # Combine the information with the original DataFrame
            df_current = pd.DataFrame(batch_info)

            # Add the current DataFrame to df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

        except Exception as e:
            print(f"Error processing batch of PMIDs: {e}")

    return df_pmid_info

In [42]:
# No batching
test = get_pub_info_test_loop(df_pmid.head(100))

  df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)
100%|█████████████████████████████████████████| 100/100 [01:22<00:00,  1.22it/s]


In [40]:
# Batch size default 10
test = get_pub_info_batched(df_pmid.head(100))

100%|███████████████████████████████████████████| 10/10 [01:17<00:00,  7.77s/it]


#### Compare timing of gathering iCite PMID info in single vs batched calls
Table gathers data from cells below

| API    | PMIDs | Batch Size| Time (mm:ss)| Rate (s/pmid) |
| ------:| -----:| ---------:| -----------:| -------------:|
| Entrez | 100   | None      | 00:34        | 0.34   |
| iCite  | 100   | None      | 01:22        | 0.82   |
| iCite  | 100   | 1         | 01:18        | 0.78   |
| iCite  | 100   | 5         | 01:18        | 0.78   |
| iCite  | 100   | 10        | 01:18        | 0.78   |
| iCite  | 100   | 50        | 01:19        | 0.79   |
| Entrez | 500   | None      | 02:08        | 0.26   |
| iCite  | 500   | None      | 07:09        | 0.86   |
| iCite  | 500   | 100       | 06:52        | 0.82   |



#### Summary
1. The Entrez API is 3-4 times faster than the iCite API
2. Batching the iCite API calls does not significantly improve performance

In [44]:
# 500 pmids, no batching
test = get_pub_info_test_loop(df_pmid.head(500))

  df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)
 98%|████████████████████████████████████████▎| 486/494 [06:45<00:06,  1.30it/s]

In [43]:
# 500 pmids, batch size 100
test = get_pub_info_batched(df_pmid.head(500), batch_size=100)

100%|█████████████████████████████████████████████| 5/5 [07:09<00:00, 85.85s/it]


In [47]:
# 100 pmids, batch size 5
test = get_pub_info_batched(df_pmid.head(100), batch_size=5)

100%|███████████████████████████████████████████| 20/20 [01:18<00:00,  3.92s/it]


In [48]:
# 100 pmids, batch size 50
test = get_pub_info_batched(df_pmid.head(100), batch_size=50)

100%|█████████████████████████████████████████████| 2/2 [01:18<00:00, 39.35s/it]


In [49]:
# 100 pmids, batch size 1
test = get_pub_info_batched(df_pmid.head(100), batch_size=1)

  df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)
100%|█████████████████████████████████████████| 100/100 [01:18<00:00,  1.27it/s]


In [50]:
# Iterate through each unique PMID with tqdm progress bar
def get_pub_info_test_loop_entrez(df_pmid):

    df_pmid_info = pd.DataFrame()

    for pmid in tqdm(df_pmid['pmid'].unique(), 
                    #total=remaining_pmid_count, 
                    ncols=80):
        try:
            # Use PubMed API to get publication data
            publication_info = gpub.get_publication_info_from_pmid(pmid)

            if publication_info:
                # Combine the information with the original DataFrame
                df_current = pd.DataFrame({
                    'pmid': pmid,
                    'title': publication_info['title'],
                    'authors': publication_info['authors'],
                    'publication_year': publication_info['publication_year'],
                    # 'doi': publication_info['doi'],
                    # 'citation_count': publication_info['citation_count'],
                    # 'relative_citation_ratio': publication_info['relative_citation_ratio']
                }, index=[0])

                # Add the current DataFrame to df_pmid_info
                df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

        except Exception as e:
            print(f"Error processing PMID {pmid}: {e}")
            # Fill in fields with NaN if not available
            df_current = pd.DataFrame({
                'pmid': pmid,
                'title': pd.NA,
                'authors': pd.NA,
                'publication_year': pd.NA,
                # 'doi': pd.NA,
                # 'citation_count': pd.NA,
                # 'relative_citation_ratio': pd.NA
            }, index=[0])

            # Add the current DataFrame to df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

    return df_pmid_info

In [51]:
# 100 pmids, Entrez method
test = get_pub_info_test_loop_entrez(df_pmid.head(100))

100%|█████████████████████████████████████████| 100/100 [00:34<00:00,  2.89it/s]


In [53]:
# 100 pmids, Entrez method
test = get_pub_info_test_loop_entrez(df_pmid.head(500))

 89%|████████████████████████████████████▋    | 442/494 [01:52<00:10,  4.75it/s]

Error fetching information for PMID 33579955: list index out of range


 95%|███████████████████████████████████████  | 470/494 [02:02<00:04,  4.98it/s]

Error fetching information for PMID 33574288: list index out of range


100%|█████████████████████████████████████████| 494/494 [02:08<00:00,  3.86it/s]


#### See if pulling fewer fields from iCite is faster

In [61]:
def get_publication_info_from_pmid_icite(pmid, fields='all'):
    """
    Get publication information for a given PMID using the iCite API.

    :param pmid: PubMed ID (str)
    :return: Dictionary containing publication information
    """
    try:
        # Use the iCite API to get publication data
        if fields == ['all']:
            response = requests.get(f"https://icite.od.nih.gov/api/pubs/{pmid}")
        
        # If a list of fields is provided, include only those in the response
        else:
            field_str = ','.join(fields)
            response = requests.get(f"https://icite.od.nih.gov/api/pubs/{pmid}"
                                    f"&fl={field_str}")
        pub = response.json()

        # Extract relevant information
        publication_info = {
            'publication_id': pub.get('pmid', ''),
            # 'title': pub.get('title', ''),
            # 'authors': pub.get('authors', ''),
            # 'publication_year': pub.get('year', ''),
            'doi':pub.get('doi', ''),
            'citation_count': pub.get('citation_count', ''),
            'relative_citation_ratio': pub.get('relative_citation_ratio', ''),
        }

        return publication_info

    except Exception as e:
        # Use tqdm.write() instead of print() for long processes
        tqdm.write(f"Error fetching information for PMID {pmid} from iCite API: {e}")
        #print(f"Error fetching information for PMID {pmid} from iCite API: {e}")
        return None

# Iterate through each unique PMID with tqdm progress bar
def get_pub_info_test_loop(df_pmid, fields='all'):

    df_pmid_info = pd.DataFrame()

    for pmid in tqdm(df_pmid['pmid'].unique(), 
                    #total=remaining_pmid_count, 
                    ncols=80):
        try:
            # Use PubMed API to get publication data
            publication_info = get_publication_info_from_pmid_icite(pmid, fields)

            if publication_info:
                # Combine the information with the original DataFrame
                df_current = pd.DataFrame({
                    'pmid': pmid,
                    # 'title': publication_info['title'],
                    # 'authors': publication_info['authors'],
                    # 'publication_year': publication_info['publication_year'],
                    'doi': publication_info['doi'],
                    'citation_count': publication_info['citation_count'],
                    'relative_citation_ratio': publication_info['relative_citation_ratio']
                }, index=[0])

                # Add the current DataFrame to df_pmid_info
                df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

        except Exception as e:
            print(f"Error processing PMID {pmid}: {e}")
            # Fill in fields with NaN if not available
            df_current = pd.DataFrame({
                'pmid': pmid,
                # 'title': pd.NA,
                # 'authors': pd.NA,
                # 'publication_year': pd.NA,
                'doi': pd.NA,
                'citation_count': pd.NA,
                'relative_citation_ratio': pd.NA
            }, index=[0])

            # Add the current DataFrame to df_pmid_info
            df_pmid_info = pd.concat([df_pmid_info, df_current], ignore_index=True)

    return df_pmid_info

In [62]:
test = get_pub_info_test_loop(df_pmid.head(100), fields=['pmid', 'citation_count', 'doi', 'relative_citation_ratio'])

100%|█████████████████████████████████████████| 100/100 [01:19<00:00,  1.26it/s]


No notable change in runtime. 1:19m vs 1:22m to complete 100 pmids, gathering either a few or all fields

# Part 2: Using Bulk Snapshow Download  

Still using Jira Ticket: [INS-790](https://tracker.nci.nih.gov/browse/INS-790)  

Pulling data from the iCite API was too slow at the scale we need. Instead, try using the bulk download of the monthly database snapshot that iCite provides on NIH FigShare: https://nih.figshare.com/collections/iCite_Database_Snapshots_NIH_Open_Citation_Collection_/4586573

In [1]:
# Method to import from parent directory
import os
import sys
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
sys.path.append(root_dir)

#import requests
import pandas as pd
from tqdm import tqdm

# Get all existing publication functions
import modules.gather_publication_data as gpub

---TIMESTAMP OVERRIDE IN USE---
---Disable this with comments in config.py for default behavior---


In [2]:
# Checkpoint loading instead of regathering data during development
pmid_filename = 'gathered_pmids_20231110.csv'
df_pmid = pd.read_csv(pmid_filename)

In [3]:
df_pmid.head(5)

Unnamed: 0,coreproject,pmid,applid
0,R01CA239701,36127808,10902170
1,R21CA209848,29074302,9321971
2,R21CA209848,31387361,9321971
3,R21CA209848,29027980,9321971
4,R21CA209848,29309429,9321971


In [3]:
# Define path to icite bulk csv.zip file and verify 
icite_filepath = '../data/raw/icite/2023-10/icite_metadata.zip'
os.path.exists(icite_filepath)

True

In [8]:
# Set columns of intersest to pull from iCite
cols = ['pmid','doi','title','authors','year','citation_count','relative_citation_ratio']

In [4]:
# Try reading first rows of the file
df_icite_sample = pd.read_csv(icite_filepath, compression='zip', nrows=5)
df_icite_sample

Unnamed: 0,pmid,doi,title,authors,year,journal,is_research_article,citation_count,field_citation_rate,expected_citations_per_year,...,molecular_cellular,x_coord,y_coord,apt,is_clinical,cited_by_clin,cited_by,references,provisional,last_modified
0,1,10.1016/0006-2944(75)90147-7,Formate assay in body fluids: application in m...,"A B Makar, K E McMartin, M Palese, T R Tephly",1975,Biochem Med,True,101,5.203998,0.793752,...,0.5,-0.144338,-0.25,0.25,False,,27354968 34013366 30849241 21923939 36162727 3...,4972128 4332837 13672941 14203183 14161139 143...,No,"10/31/2023, 08:00:32"
1,2,10.1016/0006-291x(75)90482-9,Delineation of the intimate details of the bac...,"K S Bose, R H Sarma",1975,Biochem Biophys Res Commun,True,39,5.963835,0.905675,...,1.0,-0.866025,-0.5,0.25,False,21542697.0,36225522 6267127 26376 37339783 37441102 37754...,4150960 4356257 4846745 4357832 4683494 441485...,No,"10/31/2023, 08:12:21"
2,3,10.1016/0006-291x(75)90498-2,Metal substitutions incarbonic anhydrase: a ha...,"R J Smith, R G Bryant",1975,Biochem Biophys Res Commun,True,17,5.35838,0.816492,...,0.57,-0.247436,-0.285714,0.25,False,37120631.0,25624746 33776281 32427033 28053241 35738176 2...,28089154 29927711 33870035 31115340 27536128 3...,No,"10/31/2023, 08:12:58"
3,4,10.1016/0006-291x(75)90506-9,Effect of chloroquine on cultured fibroblasts:...,"U N Wiesmann, S DiDonato, N N Herschkowitz",1975,Biochem Biophys Res Commun,True,75,5.586596,0.850108,...,0.8,-0.69282,-0.2,0.05,False,21975914.0,564972 8907731 7060838 6734624 36936276 312582...,13663253 4271529 5021451 4607946 4374680 14907...,No,"10/31/2023, 08:13:00"
4,5,10.1016/0006-291x(75)90508-2,Atomic models for the polypeptide backbones of...,"W A Hendrickson, K B Ward",1975,Biochem Biophys Res Commun,True,23,5.652509,0.859817,...,0.33,0.288675,-0.5,0.05,False,,7118409 2619971 364941 3380793 856811 8372226 ...,4882249 5059118 14834145 1056020 5509841,No,"10/31/2023, 08:13:09"


In [6]:
# Show first row to see columns and sample values
df_icite_sample.loc[0]

pmid                                                                           1
doi                                                 10.1016/0006-2944(75)90147-7
title                          Formate assay in body fluids: application in m...
authors                            A B Makar, K E McMartin, M Palese, T R Tephly
year                                                                        1975
journal                                                              Biochem Med
is_research_article                                                         True
citation_count                                                               101
field_citation_rate                                                     5.203998
expected_citations_per_year                                             0.793752
citations_per_year                                                      2.104167
relative_citation_ratio                                                     2.65
nih_percentile              

PMIDs seem to be serial, so we can get an idea of the count/range of rows that we'll need to access from the iCite data

In [21]:
# Find highest PMID in the gathered PMID list.
print(f"Lowest PMID:   {df_pmid['pmid'].min():,}\n"
      f"Highest PMID: {df_pmid['pmid'].max():,}")

Lowest PMID:   1,279,509
Highest PMID: 37,947,614


In [22]:
print(f"Range: {37947614-1279509:,}")

Range: 36,668,105


Write a function that can iterate through this large dataset in chunks and pull out info for any relevant PMID

In [30]:
# THERE IS AN UPDATED VERSION OF THIS FUNCTION LATER
def enrich_df_with_icite(df_pmid, icite_filepath, cols=None, chunk_size=250000):
    """
    Enrich a DataFrame with iCite data using chunks.

    :param df_pmid (pd.DataFrame): DataFrame containing PMID-related data.
    :param icite_filepath (str): Path to the zipped iCite CSV file.
    :param cols (list): List of columns to select from df_pmid before merging.
                   If None, all columns from df_pmid are used.
    :param chunk_size (int): Number of rows to read per chunk.

    :return: pd.DataFrame: Enriched DataFrame with iCite data.
    """

    # If cols is not specified, use all columns from df_pmid
    if cols is None:
        cols = df_pmid.columns.tolist()

    # Initialize an empty DataFrame to store the enriched data
    df_enriched = pd.DataFrame()

    # Create a tqdm wrapper around the generator to track progress
    chunks = tqdm(pd.read_csv(icite_filepath, compression='zip', chunksize=chunk_size),
                  desc="Processing iCite", unit="chunk")

    # Iterate through chunks of the iCite DataFrame
    for chunk in chunks:
        # Merge only the specified columns from df_pmid
        df_merged = pd.merge(df_pmid, chunk[cols], how='left', on='pmid')

        # Append the merged DataFrame to the result
        df_enriched = pd.concat([df_enriched, df_merged], ignore_index=True)

    return df_enriched


In [32]:
# Make it go
df_pmid_enriched = enrich_df_with_icite(df_pmid, icite_filepath, cols)

  for obj in iterable:
  for obj in iterable:
  for obj in iterable:
Processing iCite: 146chunk [07:50,  3.22s/chunk]


In [35]:
df_pmid_enriched

Unnamed: 0,coreproject,pmid,applid,doi,title,authors,year,citation_count,relative_citation_ratio
0,R01CA239701,36127808,10902170,,,,,,
1,R21CA209848,29074302,9321971,,,,,,
2,R21CA209848,31387361,9321971,,,,,,
3,R21CA209848,29027980,9321971,,,,,,
4,R21CA209848,29309429,9321971,,,,,,
...,...,...,...,...,...,...,...,...,...
25626937,P50CA196530,35471840,10690040,,,,,,
25626938,P50CA196530,35793873,10690040,,,,,,
25626939,P50CA196530,36509758,10690040,,,,,,
25626940,P50CA196530,36775354,10690040,,,,,,


All iCite values are appearing blank, but other checks show that the iCite data is there.  
Running enrich_df_with_icite only took around 8 minutes, so I'll try loading the full iCite file as-is and manipulating afterwards.

In [39]:
def load_large_csv(filepath, chunk_size=250000):
    """
    Load a large CSV file into a single DataFrame using chunks.

    :param filepath (str): Path to the CSV file.
    :param chunk_size (int): Number of rows to read per chunk.

    :return: pd.DataFrame: Loaded DataFrame.
    """

    # Initialize an empty DataFrame to store the loaded data
    df = pd.DataFrame()

    # Create a tqdm wrapper around the generator to track progress
    chunks = tqdm(pd.read_csv(filepath, chunksize=chunk_size),
                  desc=f"Loading {filepath}", unit="chunk")

    # Iterate through chunks of the DataFrame
    for chunk in chunks:
        # Append the chunk to the result DataFrame
        df = pd.concat([df, chunk], ignore_index=True)

    return df

In [5]:
# Try loading entire iCite csv into a single df in chunks
# df_icite_full = load_large_csv(icite_filepath, chunk_size=250000)

The above attempt was on track to take a few hours. If I hadn't stopped it, it might have succeeded or it might have hit a memory error.  

I'll try to make the merging-by-chunks method work. 

In [46]:
# Check that the pmid column is the same type in both dfs
df_pmid['pmid'].dtypes == df_icite_sample['pmid'].dtypes

True

Go back and look closely at the df_pmid_enriched output. Check to see if all are NaN. If not, explore

In [50]:
# Show rows without any NaN values
df_pmid_enriched[~df_pmid_enriched.isnull().any(axis=1)]

Unnamed: 0,coreproject,pmid,applid,doi,title,authors,year,citation_count,relative_citation_ratio
883502,R01CA047296,1511878,10650758,10.1016/0378-1119(92)90262-n,A convenient cloning vector containing the GAL...,"L Raycroft, G Lozano",1992.0,3.0,0.05
883508,R01CA053840,1339708,10683285,10.1101/sqb.1992.057.01.012,Protein tyrosine phosphatases: the problems of...,"N K Tonks, Q Yang, A J Flint, M F Gebbink, B R...",1992.0,31.0,0.55
892201,P01CA022443,1457207,10898452,10.1089/aid.1992.8.1611,Studies on the role of the V3 loop in human im...,"S H Chiou, E O Freed, A T Panganiban, W R Kenealy",1992.0,14.0,0.30
892202,P01CA022443,1501283,10898452,10.1128/JVI.66.9.5472-5478.1992,Identification and characterization of fusion ...,"E O Freed, D J Myers",1992.0,48.0,1.13
892220,P01CA022443,1373201,10898452,10.1128/JVI.66.5.3093-3100.1992,5-Azacytidine and RNA secondary structure incr...,"V K Pathak, H M Temin",1992.0,95.0,2.14
...,...,...,...,...,...,...,...,...,...
25499140,P30CA014520,37800093,10905065,10.1055/s-0039-3401815,Physician Burnout and Timing of Electronic Hea...,"Mark A Micek, Brian Arndt, Wen-Jan Tuan, Betsy...",2020.0,1.0,0.11
25577376,P30CA069533,37840910,10888579,10.15695/jstem/v5i2.12,Approaches for Measuring Inclusive Demographic...,"Megan A Mekinda, Sunita Chaudhary, Nathan L Va...",2022.0,0.0,0.00
25583019,P30CA082709,37829495,10695929,10.3389/adar.2022.10400,Sex-Dependent Synaptic Remodeling of the Somat...,"Gregory G Grecco, Jui Yen Huang, Braulio Muñoz...",2022.0,1.0,0.35
25598850,P50CA244688,37854304,10814693,10.1016/j.ssci.2022.105763,Methods to improve the translation of evidence...,"R J Guerin, R E Glasgow, A Tyler, B A Rabin, A...",2022.0,1.0,0.49


In [52]:
# Check number of rows in the input df of pmids with coreproject and applids
len(df_pmid)

175527

The goal should be for the output to have a row for each unique pmid. If no iCite data is available for a given pmid, there should still be a row with the pmid and other values listed as NaN.  

Find our goal number of rows:

In [53]:
# Check number of rows in a list of unique pmids
df_pmid.pmid.nunique()

144658

Try again with tweaks to the merging step

In [77]:
# THERE IS AN UPDATED VERSION OF THIS FUNCTION BELOW
def enrich_df_with_icite(df_pmid, icite_filepath, cols=None, chunk_size=250000):
    """
    Enrich a DataFrame with iCite data using chunks.

    :param df_pmid (pd.DataFrame): DataFrame containing PMID-related data.
    :param icite_filepath (str): Path to the zipped iCite CSV file.
    :param cols (list): List of columns to select from df_pmid before merging.
                   If None, all columns from df_pmid are used.
    :param chunk_size (int): Number of rows to read per chunk.

    :return: pd.DataFrame: Enriched DataFrame with iCite data.
    """

    # If cols is not specified, use all columns from df_pmid
    if cols is None:
        cols = df_pmid.columns.tolist()

    # Initialize an empty list to store the enriched data chunks
    df_enriched_chunks = []

    # Create a tqdm wrapper around the generator to track progress
    chunks = tqdm(pd.read_csv(icite_filepath, compression='zip', chunksize=chunk_size),
                  desc="Processing iCite", unit="chunk")

    # Iterate through chunks of the iCite DataFrame
    for chunk in chunks:
        # Merge only the specified columns from df_pmid where 'pmid' values match
        df_merged = pd.merge(df_pmid, chunk[cols], how='left', on='pmid')

        # Append the merged DataFrame to the list
        df_enriched_chunks.append(df_merged)

    # Concatenate all the chunks into the final enriched DataFrame
    df_enriched = pd.concat(df_enriched_chunks, ignore_index=True)

    return df_enriched


In [78]:
# Try again
df_pmid_enriched = enrich_df_with_icite(df_pmid, icite_filepath, cols, chunk_size=250000)

Processing iCite: 0chunk [00:00, ?chunk/s]

  for obj in iterable:
  for obj in iterable:
  for obj in iterable:
Processing iCite: 146chunk [06:07,  2.52s/chunk]


In [80]:
df_pmid_enriched

Unnamed: 0,coreproject,pmid,applid,doi,title,authors,year,citation_count,relative_citation_ratio
0,R01CA239701,36127808,10902170,,,,,,
1,R21CA209848,29074302,9321971,,,,,,
2,R21CA209848,31387361,9321971,,,,,,
3,R21CA209848,29027980,9321971,,,,,,
4,R21CA209848,29309429,9321971,,,,,,
...,...,...,...,...,...,...,...,...,...
25626937,P50CA196530,35471840,10690040,,,,,,
25626938,P50CA196530,35793873,10690040,,,,,,
25626939,P50CA196530,36509758,10690040,,,,,,
25626940,P50CA196530,36775354,10690040,,,,,,


Way too many rows are showing up, and most of them are NaN...

In [81]:
# Check for rows where at last one of the iCite col values is not NaN
df_pmid_enriched.dropna(subset=['doi',
                                'title',
                                'authors',
                                'year',
                                'citation_count',
                                'relative_citation_ratio'], how='all')

Unnamed: 0,coreproject,pmid,applid,doi,title,authors,year,citation_count,relative_citation_ratio
883502,R01CA047296,1511878,10650758,10.1016/0378-1119(92)90262-n,A convenient cloning vector containing the GAL...,"L Raycroft, G Lozano",1992.0,3.0,0.05
883508,R01CA053840,1339708,10683285,10.1101/sqb.1992.057.01.012,Protein tyrosine phosphatases: the problems of...,"N K Tonks, Q Yang, A J Flint, M F Gebbink, B R...",1992.0,31.0,0.55
891728,K12HD000849,1325845,10746928,,"Changes in Na,K-ATPase gene expression during ...","S K Chambers, M Gilmore-Hebert, B M Kacinski, ...",1992.0,11.0,0.27
892201,P01CA022443,1457207,10898452,10.1089/aid.1992.8.1611,Studies on the role of the V3 loop in human im...,"S H Chiou, E O Freed, A T Panganiban, W R Kenealy",1992.0,14.0,0.30
892202,P01CA022443,1501283,10898452,10.1128/JVI.66.9.5472-5478.1992,Identification and characterization of fusion ...,"E O Freed, D J Myers",1992.0,48.0,1.13
...,...,...,...,...,...,...,...,...,...
25625277,P50CA217674,37737674,10687031,10.14309/ajg.0000000000002508,A Study of Dietary Patterns Derived by Cluster...,"Xiaotao Zhang, Carrie R Daniel, Valeria Solter...",2023.0,0.0,
25625284,P50CA217674,37835569,10687031,10.3390/cancers15194875,The Gut Microbiome as a Biomarker and Therapeu...,"Betul Gok Yavuz, Saumil Datar, Shadi Chamseddi...",2023.0,0.0,
25625290,U54CA274367,37873404,10697365,10.1101/2023.09.30.560293,A Specialized Epithelial Cell Type Regulating ...,"Jia Li, Alan J Simmons, Sophie Chiron, Marisol...",2023.0,0.0,
25625295,U54CA274371,37745323,10708199,10.1101/2023.09.17.557982,Digitize your Biology! Modeling multicellular ...,"Jeanette A I Johnson, Genevieve L Stein-O'Brie...",2023.0,0.0,


In [82]:
# Get number of unique values in all cols in input
df_pmid.nunique()

coreproject      1771
pmid           144658
applid           1783
dtype: int64

In [86]:
# Check to see how many rows have each unique coreproject-pmid combination
df_pmid_enriched.groupby(['coreproject','pmid']).size().reset_index().sort_values(by=0, ascending = False)

Unnamed: 0,coreproject,pmid,0
64337,P30CA016672,29904738,292
95152,P30CA051008,9795182,292
8660,P30CA006973,19380450,292
18894,P30CA008748,35293090,292
95155,P30CA051008,9846989,292
...,...,...,...
56815,P30CA016520,23649625,146
56816,P30CA016520,23658517,146
56817,P30CA016520,23664401,146
56818,P30CA016520,23666239,146


Still seeing many empty duplicate rows. The minimum number of duplicates is 146, which is the same as the number of chunks. Must be an issue with duplication during merging.

In [9]:
# THIS FUNCTION IS IMPROVED AND REPLACED BY `get_icite_data_for_pmids` BELOW
def enrich_df_with_icite(df_pmid, icite_filepath, cols=None, chunk_size=250000):
    """
    Enrich a DataFrame with iCite data using chunks.

    :param df_pmid (pd.DataFrame): DataFrame containing PMID-related data.
    :param icite_filepath (str): Path to the zipped iCite CSV file.
    :param cols (list): List of columns to select from df_pmid before merging.
                   If None, all columns from df_pmid are used.
    :param chunk_size (int): Number of rows to read per chunk.

    :return: pd.DataFrame: Enriched DataFrame with iCite data.
    """

    # If cols is not specified, use all columns from df_pmid
    if cols is None:
        cols = df_pmid.columns.tolist()

    # Initialize an empty list to store the enriched data chunks
    df_enriched_chunks = []

    # Create a tqdm wrapper around the generator to track progress
    chunks = tqdm(pd.read_csv(icite_filepath, compression='zip', chunksize=chunk_size),
                  desc="Processing iCite", unit="chunk")

    # Iterate through chunks of the iCite DataFrame
    for chunk in chunks:
        # Merge only the specified columns from df_pmid where 'pmid' values match
        df_merged = pd.merge(df_pmid, chunk[cols], how='left', on='pmid')

        # Drop rows where all 'cols' except 'pmid' are blank
        df_merged = df_merged.dropna(subset=cols[1:], how='all')

        # Append the merged DataFrame to the list
        df_enriched_chunks.append(df_merged)

    # Concatenate all the chunks into the final enriched DataFrame
    df_enriched = pd.concat(df_enriched_chunks, ignore_index=True)

    return df_enriched


In [93]:
# Try again
df_pmid_enriched = enrich_df_with_icite(df_pmid, icite_filepath, cols, chunk_size=250000)
df_pmid_enriched

  for obj in iterable:
  for obj in iterable:
  for obj in iterable:
Processing iCite: 146chunk [06:25,  2.64s/chunk]


Unnamed: 0,coreproject,pmid,applid,doi,title,authors,year,citation_count,relative_citation_ratio
0,R01CA047296,1511878,10650758,10.1016/0378-1119(92)90262-n,A convenient cloning vector containing the GAL...,"L Raycroft, G Lozano",1992.0,3.0,0.05
1,R01CA053840,1339708,10683285,10.1101/sqb.1992.057.01.012,Protein tyrosine phosphatases: the problems of...,"N K Tonks, Q Yang, A J Flint, M F Gebbink, B R...",1992.0,31.0,0.55
2,K12HD000849,1325845,10746928,,"Changes in Na,K-ATPase gene expression during ...","S K Chambers, M Gilmore-Hebert, B M Kacinski, ...",1992.0,11.0,0.27
3,P01CA022443,1457207,10898452,10.1089/aid.1992.8.1611,Studies on the role of the V3 loop in human im...,"S H Chiou, E O Freed, A T Panganiban, W R Kenealy",1992.0,14.0,0.30
4,P01CA022443,1501283,10898452,10.1128/JVI.66.9.5472-5478.1992,Identification and characterization of fusion ...,"E O Freed, D J Myers",1992.0,48.0,1.13
...,...,...,...,...,...,...,...,...,...
175442,P50CA217674,37737674,10687031,10.14309/ajg.0000000000002508,A Study of Dietary Patterns Derived by Cluster...,"Xiaotao Zhang, Carrie R Daniel, Valeria Solter...",2023.0,0.0,
175443,P50CA217674,37835569,10687031,10.3390/cancers15194875,The Gut Microbiome as a Biomarker and Therapeu...,"Betul Gok Yavuz, Saumil Datar, Shadi Chamseddi...",2023.0,0.0,
175444,U54CA274367,37873404,10697365,10.1101/2023.09.30.560293,A Specialized Epithelial Cell Type Regulating ...,"Jia Li, Alan J Simmons, Sophie Chiron, Marisol...",2023.0,0.0,
175445,U54CA274371,37745323,10708199,10.1101/2023.09.17.557982,Digitize your Biology! Modeling multicellular ...,"Jeanette A I Johnson, Genevieve L Stein-O'Brie...",2023.0,0.0,


This looks better, but still misses any pmid rows without any icite data

In [99]:
# Merge enriched df back to the df_pmid to fill the blanks
df_pmid_icite = pd.merge(df_pmid, df_pmid_enriched, how='left', on=['coreproject','pmid','applid'])
df_pmid_icite

Unnamed: 0,coreproject,pmid,applid,doi,title,authors,year,citation_count,relative_citation_ratio
0,R01CA239701,36127808,10902170,10.1002/cam4.5266,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0,4.0,
1,R21CA209848,29074302,9321971,10.1016/j.imbio.2017.10.028,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0,2.0,0.09
2,R21CA209848,31387361,9321971,10.1161/CIRCULATIONAHA.119.038376,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0,34.0,1.95
3,R21CA209848,29027980,9321971,10.3390/genes8100269,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0,20.0,1.17
4,R21CA209848,29309429,9321971,10.1371/journal.pone.0190949,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0,2.0,0.08
...,...,...,...,...,...,...,...,...,...
194594,P50CA196530,35471840,10690040,10.1146/annurev-immunol-070621-030155,Resistance Mechanisms to Anti-PD Cancer Immuno...,"Matthew D Vesely, Tianxiang Zhang, Lieping Chen",2022.0,71.0,12.39
194595,P50CA196530,35793873,10690040,10.1136/jitc-2022-005025,Quantitative tissue analysis and role of myelo...,"Brian S Henick, Franz Villarroel-Espindola, Il...",2022.0,1.0,0.16
194596,P50CA196530,36509758,10690040,10.1038/s41467-022-34889-z,Brain metastatic outgrowth and osimertinib res...,"Sally J Adua, Anna Arnal-Estapé, Minghui Zhao,...",2022.0,1.0,0.19
194597,P50CA196530,36775354,10690040,10.1038/s41374-022-00796-6,Quantitative assessment of Siglec-15 expressio...,"Saba Shafi, Thazin Nwe Aung, Vasiliki Xirou, N...",2022.0,3.0,0.43


Too many rows appear when the original df_pmid is merged back with the enriched...

In [4]:
df_pmid

Unnamed: 0,coreproject,pmid,applid
0,R01CA239701,36127808,10902170
1,R21CA209848,29074302,9321971
2,R21CA209848,31387361,9321971
3,R21CA209848,29027980,9321971
4,R21CA209848,29309429,9321971
...,...,...,...
175522,P50CA196530,35471840,10690040
175523,P50CA196530,35793873,10690040
175524,P50CA196530,36509758,10690040
175525,P50CA196530,36775354,10690040


In [5]:
# Check for duplicates of the same coreproject-pmid combination 
df_pmid.groupby(['coreproject','pmid']).size().reset_index().sort_values(by=0, ascending=False)

Unnamed: 0,coreproject,pmid,0
64337,P30CA016672,29904738,2
95152,P30CA051008,9795182,2
8660,P30CA006973,19380450,2
18894,P30CA008748,35293090,2
95155,P30CA051008,9846989,2
...,...,...,...
56815,P30CA016520,23649625,1
56816,P30CA016520,23658517,1
56817,P30CA016520,23664401,1
56818,P30CA016520,23666239,1


I'm still getting odd numbers and unintuitive counts when trying to merge the original pmid list with the data pulled from iCite. This may be due to the extra 'coreproject' and 'applid' columns in df_pmid causing duplications or other problems during the merge.  

Try a new approach focused on pulling unique pmids and adding icite data for each. The output will not contain coreproject or applid. The coreproject column can be joined back in at the end using pmid.

In [36]:
# THERE IS AN UDPATED VERSION OF THIS FUNCTION BELOW
def get_icite_data_for_pmids(df_pmid, icite_filepath, cols, 
                             chunk_size=250000, chunk_count_est=None):
    """
    Get iCite data for unique PMIDs from a DataFrame using chunks.

    :param df_pmid (pd.DataFrame): DataFrame containing PMID-related data.
    :param icite_filepath (str): Path to the zipped iCite CSV file.
    :param cols (list): Columns to pull from iCite and include in output df
    :param chunk_size (int): Number of rows to read per chunk.
    :param chunk_count (int): Estimated number of chunks. If None, ignored

    :return: pd.DataFrame: DataFrame with specified columns
    """
    # Get unique PMIDs from df_pmid
    unique_pmids = df_pmid['pmid'].unique()

    # Initialize an empty list to store the enriched data chunks
    df_enriched_chunks = []

    # Create a tqdm wrapper around the generator to track progress
    chunks = tqdm(pd.read_csv(icite_filepath, compression='zip', chunksize=chunk_size),
                  desc="Processing iCite", unit="chunk", total=chunk_count_est)

    # Iterate through chunks of the iCite DataFrame
    for chunk in chunks:
        # Filter the chunk to include only rows with PMIDs in unique_pmids
        chunk_filtered = chunk[chunk['pmid'].isin(unique_pmids)]

        # Append the filtered DataFrame to the list with selected columns
        df_enriched_chunks.append(chunk_filtered[cols])

    # Concatenate all the chunks into the final enriched DataFrame
    df_enriched = pd.concat(df_enriched_chunks, ignore_index=True)

    return df_enriched



In [11]:
# Try the new function that doesn't include coreproject or applid in the output
df_pmid_icite = get_icite_data_for_pmids(df_pmid, icite_filepath, cols)
df_pmid_icite

Processing iCite: 0chunk [00:00, ?chunk/s]

  for obj in iterable:
  for obj in iterable:
  for obj in iterable:
Processing iCite: 146chunk [05:55,  2.44s/chunk]


Unnamed: 0,pmid,doi,title,authors,year,citation_count,relative_citation_ratio
0,1279509,10.1203/00006450-199210000-00018,Expression and regulation of L-selectin on eos...,"J B Smith, R D Kunjummen, T K Kishimoto, D C A...",1992,25,0.67
1,1280555,10.1002/cyto.990130707,Streptavidin-based quantitative staining of in...,"P Srivastava, T L Sladek, M N Goodman, J W Jac...",1992,14,0.43
2,1281066,10.1002/cyto.990130808,"Reticulocyte quantification by flow cytometry,...","K J Schimenti, K Lacerna, A Wamble, L Maston, ...",1992,38,1.45
3,1282437,10.1101/gr.2.2.137,Development of a sensitive reverse transcripta...,"S S Tan, J H Weis",1992,49,1.00
4,1283327,10.1002/gcc.2870050414,Sublocalization of the chromosome 5 breakpoint...,"S W Morris, J T Foust, M B Valentine, W M Robe...",1992,14,0.29
...,...,...,...,...,...,...,...
144591,37928187,10.3389/fcimb.2023.1270935,Co-infection and co-localization of Kaposi sar...,"Peter Julius, Guobin Kang, Stepfanie Siyumbwa,...",2023,0,
144592,37928542,10.3389/fimmu.2023.1236514,Features of the TCR repertoire associate with ...,"Mateusz Pospiech, Mukund Tamizharasan, Yu-Chun...",2023,0,
144593,37928819,10.1007/978-3-031-33842-7_6,Leveraging 2D Deep Learning ImageNet-trained m...,"Bhakti Baheti, Sarthak Pati, Bjoern Menze, Spy...",2023,0,
144594,37930127,10.1097/HC9.0000000000000315,Patient-reported symptoms and interest in symp...,"Andrew M Moon, Sarah Cook, Rachel M Swier, Han...",2023,0,


Looks good at a glance. Runtime is about 6min

In [12]:
# Check the expected row count again
df_pmid.pmid.nunique()

144658

In [13]:
# Check for duplicate pmids in the icite-enriched df
df_pmid_icite.groupby('pmid').size().reset_index().sort_values(by=0,ascending=False)

Unnamed: 0,pmid,0
0,1279509,1
96401,30131384,1
96395,30130351,1
96396,30130469,1
96397,30130544,1
...,...,...
48191,23153072,1
48190,23152935,1
48189,23152800,1
48188,23152566,1


In [14]:
# Try approach where unique pmids are a 1-column df instead of a list
df_unique_pmids = pd.DataFrame(df_pmid['pmid'].unique().tolist(), columns=['pmid'])
df_unique_pmids

Unnamed: 0,pmid
0,36127808
1,29074302
2,31387361
3,29027980
4,29309429
...,...
144653,28651562
144654,31887619
144655,32251621
144656,33148678


In [15]:
# Fill in the blanks in the enriched df with any pmids not found in icite
test = pd.merge(df_unique_pmids, df_pmid_icite, how='left', on='pmid')
test

Unnamed: 0,pmid,doi,title,authors,year,citation_count,relative_citation_ratio
0,36127808,10.1002/cam4.5266,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0,4.0,
1,29074302,10.1016/j.imbio.2017.10.028,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0,2.0,0.09
2,31387361,10.1161/CIRCULATIONAHA.119.038376,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0,34.0,1.95
3,29027980,10.3390/genes8100269,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0,20.0,1.17
4,29309429,10.1371/journal.pone.0190949,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0,2.0,0.08
...,...,...,...,...,...,...,...
144653,28651562,10.1186/s12859-017-1711-z,GRAPE: a pathway template method to characteri...,"Michael I Klein, David F Stern, Hongyu Zhao",2017.0,6.0,0.18
144654,31887619,10.1016/j.oraloncology.2019.104554,Transfer RNA methyltransferase gene NSUN2 mRNA...,"Lingeng Lu, Stephen G Gaffney, Vincent L Canna...",2020.0,15.0,0.94
144655,32251621,10.1016/S1470-2045(20)30111-X,Pembrolizumab for management of patients with ...,"Sarah B Goldberg, Kurt A Schalper, Scott N Get...",2020.0,276.0,18.48
144656,33148678,10.1158/1940-6207.CAPR-20-0394,Clearing the Haze: What Do We Still Need to Le...,"Lisa M Fucito, Hannah Malinosky, Stephen R Bal...",2021.0,0.0,0.00


In [16]:
# Make a list of columns specific to iCite data that excludes pmid
icite_cols =  [ 'doi',
                'title',
                'authors',
                'year',
                'citation_count',
                'relative_citation_ratio']

In [17]:
# Check for rows where at last one of the iCite col values is not NaN
test.dropna(subset=icite_cols, how='all')

Unnamed: 0,pmid,doi,title,authors,year,citation_count,relative_citation_ratio
0,36127808,10.1002/cam4.5266,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0,4.0,
1,29074302,10.1016/j.imbio.2017.10.028,Endogenous antibody responses to mucin 1 in a ...,"Janardan P Pandey, Aryan M Namboodiri, Bethany...",2018.0,2.0,0.09
2,31387361,10.1161/CIRCULATIONAHA.119.038376,Defects in the Exocyst-Cilia Machinery Cause B...,"Diana Fulmer, Katelynn Toomer, Lilong Guo, Kel...",2019.0,34.0,1.95
3,29027980,10.3390/genes8100269,The Plasticizer Bisphenol A Perturbs the Hepat...,"Ludivine Renaud, Willian A da Silveira, E Star...",2017.0,20.0,1.17
4,29309429,10.1371/journal.pone.0190949,ShinyGPA: An interactive visualization toolkit...,"Emma Kortemeier, Paula S Ramos, Kelly J Hunt, ...",2018.0,2.0,0.08
...,...,...,...,...,...,...,...
144653,28651562,10.1186/s12859-017-1711-z,GRAPE: a pathway template method to characteri...,"Michael I Klein, David F Stern, Hongyu Zhao",2017.0,6.0,0.18
144654,31887619,10.1016/j.oraloncology.2019.104554,Transfer RNA methyltransferase gene NSUN2 mRNA...,"Lingeng Lu, Stephen G Gaffney, Vincent L Canna...",2020.0,15.0,0.94
144655,32251621,10.1016/S1470-2045(20)30111-X,Pembrolizumab for management of patients with ...,"Sarah B Goldberg, Kurt A Schalper, Scott N Get...",2020.0,276.0,18.48
144656,33148678,10.1158/1940-6207.CAPR-20-0394,Clearing the Haze: What Do We Still Need to Le...,"Lisa M Fucito, Hannah Malinosky, Stephen R Bal...",2021.0,0.0,0.00


In [18]:
# Get rows where ALL icite-specific values are NaN
test[test[icite_cols].isna().all(axis=1)]

Unnamed: 0,pmid,doi,title,authors,year,citation_count,relative_citation_ratio
844,37945900,,,,,,
2377,31993221,,,,,,
3020,37932419,,,,,,
3858,37947614,,,,,,
3880,37936688,,,,,,
...,...,...,...,...,...,...,...
132248,31026409,,,,,,
138240,37945902,,,,,,
140474,32123530,,,,,,
142744,25513417,,,,,,


In [19]:
# Get rows where ANY icite-specific values are NaN
test[test[icite_cols].isna().any(axis=1)]

Unnamed: 0,pmid,doi,title,authors,year,citation_count,relative_citation_ratio
0,36127808,10.1002/cam4.5266,"Genetic ancestry, differential gene expression...","Freddy A Barragan, Lauren J Mills, Andrew R Ra...",2023.0,4.0,
7,30755818,,Synergistic effects of SHP2 and PI3K pathway i...,"Bowen Sun, Nathaniel R Jensen, Dongjun Chung, ...",2019.0,13.0,0.64
11,35895854,10.1111/biom.13727,A Bayesian multivariate mixture model for high...,"Carter Allen, Yuzhou Chang, Brian Neelon, Won ...",2023.0,2.0,
23,37715500,10.1002/sim.9911,A Bayesian framework for pathway-guided identi...,"Zequn Sun, Dongjun Chung, Brian Neelon, Andrew...",2023.0,0.0,
70,37335961,10.1200/CCI.22.00138,BatMan: Mitigating Batch Effects Via Stratific...,"Ai Ni, Mengling Liu, Li-Xuan Qin",2023.0,0.0,
...,...,...,...,...,...,...,...
144593,35338489,10.1111/biom.13665,Clustering high-dimensional data via feature s...,"Tianqi Liu, Yu Lu, Biqing Zhu, Hongyu Zhao",2023.0,2.0,
144619,37121213,10.1016/j.lungcan.2023.107211,Comparative genomics between matched solid and...,"Gavitt A Woodard, Vivianne Ding, Christina Cho...",2023.0,1.0,
144633,36960400,10.1158/2767-9764.CRC-22-0334,"Quantitative, Spatially Defined Expression of ...","Thazin N Aung, Niki Gavrielatou, Ioannis A Vat...",2023.0,0.0,
144634,37365174,10.1038/s41467-023-39514-1,An optogenetic-phosphoproteomic study reveals ...,"Wenping Zhou, Wenxue Li, Shisheng Wang, Barbor...",2023.0,0.0,


### Detour to improve progress estimations with tqdm

In [23]:
import zipfile
import io

def get_csv_rowcount(filepath, encoding='utf-8'):
    """
    Get the number of rows in a CSV file inside a zip archive.

    :param filepath (str): Path to the zipped CSV file.
    :param encoding (str): Encoding of the CSV file. Default 'utf-8'.

    :return: int: Number of rows in the CSV file.
    """

    with zipfile.ZipFile(filepath, 'r') as zipped_file:
        # Assuming the CSV file inside the zip has the same name as the zip file
        csv_filename = zipped_file.namelist()[0]
        
        with zipped_file.open(csv_filename) as csv_file:
            # Use io.TextIOWrapper to handle decoding from bytes to text
            with io.TextIOWrapper(csv_file, encoding=encoding) as f:
                return sum(1 for line in f)

In [28]:
icite_file_rowcount = get_csv_rowcount(icite_filepath)

In [30]:
chunk_size = 250000
print(f"Row count:   {icite_file_rowcount:>10}\n"
      f"Chunk size:  {chunk_size:>10}\n"
      f"Chunk count: {-(-icite_file_rowcount // chunk_size):>10}") # Divide and use negatives to round up

Row count:     36458403
Chunk size:      250000
Chunk count:        146


Time to count the number of rows is longer than desired (~2.5min), but could still be valuable. The total number of chunks can be used with tqdm `total` to give a more accurate progress bar. 

Finalize the iCite function. Add step to use unique PMID dataframe so that any PMIDs without iCite data will still be listed (with NaN for all iCite columns).  
Also add the ability to specify estimated total chunk count for the tqdm progress bar. 

In [40]:
def get_icite_data_for_pmids(df_pmid, icite_filepath, cols, 
                             chunk_size=250000, chunk_count_est=None):
    """
    Get iCite data for unique PMIDs from a DataFrame using chunks.

    :param df_pmid (pd.DataFrame): DataFrame containing PMID-related data.
    :param icite_filepath (str): Path to the zipped iCite CSV file.
    :param cols (list): Columns to pull from iCite and include in output df
    :param chunk_size (int): Number of rows to read per chunk.
    :param chunk_count (int): Estimated number of chunks. If None, ignored

    :return: pd.DataFrame: DataFrame with specified columns
    """
    # Get unique PMIDs from df_pmid
    unique_pmids = df_pmid['pmid'].unique()

    # Initialize an empty list to store the enriched data chunks
    df_enriched_chunks = []

    # Create a tqdm wrapper around the generator to track progress
    chunks = tqdm(pd.read_csv(icite_filepath, compression='zip', 
                              chunksize=chunk_size),
                  desc="Processing iCite", unit="chunk", total=chunk_count_est)

    # Iterate through chunks of the iCite DataFrame
    for chunk in chunks:
        # Filter the chunk to include only rows with PMIDs in unique_pmids
        chunk_filtered = chunk[chunk['pmid'].isin(unique_pmids)]

        # Append the filtered DataFrame to the list with selected columns
        df_enriched_chunks.append(chunk_filtered[cols])

    # Concatenate all the chunks into the final enriched DataFrame
    df_enriched = pd.concat(df_enriched_chunks, ignore_index=True)

    # Include unique PMIDs not found in iCite with NaN in iCite columns
    df_unique_pmids = pd.DataFrame(df_pmid['pmid'].unique(), columns=['pmid'])
    df_pmids_with_icite = pd.merge(df_unique_pmids, df_enriched, 
                                   how='left', on='pmid')

    # Sort by PMID for consistent handling
    df_pmids_with_icite.sort_values(by='pmid', inplace=True, ignore_index=True)

    return df_pmids_with_icite


In [41]:
df_pmids_with_icite = get_icite_data_for_pmids(df_pmid, icite_filepath, cols, 
                                               chunk_size=250000, 
                                               chunk_count_est=146)

Processing iCite:   0%|          | 0/146 [00:00<?, ?chunk/s]

  for obj in iterable:
  for obj in iterable:
  for obj in iterable:
Processing iCite: 100%|██████████| 146/146 [05:50<00:00,  2.40s/chunk]


In [44]:
df_pmids_with_icite

Unnamed: 0,pmid,doi,title,authors,year,citation_count,relative_citation_ratio
0,1279509,10.1203/00006450-199210000-00018,Expression and regulation of L-selectin on eos...,"J B Smith, R D Kunjummen, T K Kishimoto, D C A...",1992.0,25.0,0.67
1,1280555,10.1002/cyto.990130707,Streptavidin-based quantitative staining of in...,"P Srivastava, T L Sladek, M N Goodman, J W Jac...",1992.0,14.0,0.43
2,1281066,10.1002/cyto.990130808,"Reticulocyte quantification by flow cytometry,...","K J Schimenti, K Lacerna, A Wamble, L Maston, ...",1992.0,38.0,1.45
3,1282437,10.1101/gr.2.2.137,Development of a sensitive reverse transcripta...,"S S Tan, J H Weis",1992.0,49.0,1.00
4,1283327,10.1002/gcc.2870050414,Sublocalization of the chromosome 5 breakpoint...,"S W Morris, J T Foust, M B Valentine, W M Robe...",1992.0,14.0,0.29
...,...,...,...,...,...,...,...
144653,37947333,,,,,,
144654,37947334,,,,,,
144655,37947335,,,,,,
144656,37947337,,,,,,


In [47]:
# Export as csv for development checkpoint use
df_pmids_with_icite.to_csv('pmids_with_icite_20231201.csv', index=False)

The CSV export has some odd behavior when opened in Excel. The author list seems to overflow into the next line, causing an isolated shift in that record.  
This seems due to very logn author lists hitting the Excel cell character limit of 32,767.  

It is not a problem with the dataframe or CSV itself, and should not cause problems when loaded downstream. 

In [54]:
# The author name 'Petrovski' is split in two when loaded in Excel. Check that it is present in df
df_pmids_with_icite[df_pmids_with_icite['pmid'] == 33634751]['authors'].str.contains('Goran Petrovski')

119879    True
Name: authors, dtype: bool