<a href="https://colab.research.google.com/github/justindlc/python-group-test/blob/main/pubmed_coauthorship_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setup
This script uses the library [Metapub](https://github.com/metapub/metapub) to search the E-Utilities API and the [Bio.Entrez package](https://biopython.org/docs/1.75/api/Bio.Entrez.html) to retrieve article information from E-Utilities eSummary records. The script was designed for use in Google Colaboratory, but can also be downloaded to run locally (in Jupyter Notebook, or, with slight modifications, in other environments).

In [None]:
# Install metapub and Bio
!pip install metapub
!pip install Bio

# Import libraries
import csv
import pandas as pd

from metapub import PubMedFetcher
from Bio import Entrez
from google.colab import drive, files

In [None]:
# Mount google drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Read in your CSV file and preview it
# Make sure to change the info in the quotes '' to the location of your CSV file
df = pd.read_csv('/content/drive/MyDrive/path-to-your-file.csv', skipinitialspace=True)
df

Unnamed: 0,FirstName,MI,LastName,Publishes As,OrcID
0,Genevieve,,Milliken,,0000-0002-3057-0659
1,Nicole,,Contaxis,,0000-0001-5279-1623
2,Alisa,,Surkis,,0000-0002-9777-2693
3,Fred,WZ,LaPolla,,0000-0002-3185-9753


#Author Variations
Given a CSV file containing a list of author names as your input, this script will create a search query to find co-authored articles. Format the CSV with column headers FirstName, MI, LastName, Publishes As, OrcID.

The function below creates variations of author names in order to build a robust search. For each author, it produces the variations:

* LastName, FirstName MiddleInitial
* LastName, FirstName
* LastName, FirstInitial MiddleInitial

There is also the option to include author aliases in the column titled "Publishes As." This can be useful to find authors publishing under alternative names, such as maiden names, or using their first initial with full middle name (like F. Scott Fitzgerald).



In [None]:
# Define a function to generate variations of author names.
def generate_pubmed_queries(df):
    # Create blank list to populate
    queries = []

    for _, author in df.iterrows():
        # Assign name variations from CSV
        last_name = author['LastName']
        first_name = author['FirstName']
        middle_name = author['MI']
        publishes_as = author['Publishes As']
        orcid = "(" + author['OrcID'] + "[Author - Identifier]" + ")"

        # Create Author name variations with string formatting
        query_variations = [
            f'({last_name}, {first_name} {middle_name}[Author])',
            f'({last_name}, {first_name}[Author])',
            f'({last_name}, {first_name[0]}{middle_name}[Author])',
        ]

        # Append names from "Publishes As" column
        if pd.notnull(publishes_as):
            for pub in publishes_as.split(';'):
                pub = pub.strip()
                if pub not in query_variations:
                    query_variations.append(f'({pub}[Author])')

        # Append ORCiD
        query_variations.extend([orcid])

        # Join all queries with 'OR' and wrap in parentheses
        search_query = ' OR '.join(query_variations)
        search_query = f'({search_query})'

        # Add the search query to the list
        queries.append(search_query)

    # Return the 'queries' list for use outside of the function
    return queries

# Before running the function, replace any NaNs in the df MI column with blanks,
# otherwise the NaNs will mess up the query
df['MI'] = df['MI'].fillna('')

# Store the function in variable 'search'
search = generate_pubmed_queries(df)

# print PubMed query to check it, and preview transformed dataframe with blank MIs
print(search)
df

['((Milliken, Genevieve [Author]) OR (Milliken, Genevieve[Author]) OR (Milliken, G[Author]) OR (0000-0002-3057-0659[Author - Identifier]))', '((Contaxis, Nicole [Author]) OR (Contaxis, Nicole[Author]) OR (Contaxis, N[Author]) OR (0000-0001-5279-1623[Author - Identifier]))', '((Surkis, Alisa [Author]) OR (Surkis, Alisa[Author]) OR (Surkis, A[Author]) OR (0000-0002-9777-2693[Author - Identifier]))', '((LaPolla, Fred WZ[Author]) OR (LaPolla, Fred[Author]) OR (LaPolla, FWZ[Author]) OR (0000-0002-3185-9753[Author - Identifier]))']


Unnamed: 0,FirstName,MI,LastName,Publishes As,OrcID
0,Genevieve,,Milliken,,0000-0002-3057-0659
1,Nicole,,Contaxis,,0000-0001-5279-1623
2,Alisa,,Surkis,,0000-0002-9777-2693
3,Fred,WZ,LaPolla,,0000-0002-3185-9753


#Boolean Combination
This part of the script defines a function to generate one Boolean statement that searches for articles that at least two authors from the list appear on.

The Boolean statement uses this logic:



> (A OR B OR C OR D...)

>  NOT (A NOT B NOT C NOT D)

>  NOT (B NOT A NOT C NOT D)

>  NOT (C NOT A NOT B NOT D)

>  NOT (D NOT A NOT B NOT C)



where A, B, C, D are names of authors.

The first line does a search for all articles authored by everyone on the list. The following lines eliminate articles that were only published by one author on the list,
leaving any articles where at least two authors appear.

In [None]:
def generate_logical_expression(lst):
    # Store number of list entries as n
    n = len(lst)
    # Store list as local variable for the function to use
    variables = lst
    # Create the first line of the Boolean statement
    expression = "(" + " OR ".join(variables) + ") "
    # Create a counter starting at zero to include items from the list in order
    lst_count = 0

    # Use a for loop to construct the remaining lines of the Boolean statement
    # Range(n) counts out the number of entries in our list
    for i in range(n):

        # Use list comprehension to make a new list that contains names from
        # our list if the name is not already listed
        not_variables = [f"{var}" for j, var in enumerate(variables) if j != i]

        # Construct the remaining lines in the Boolean statement by starting with NOT (
        # and using lst[lst_count] to add each consecutive item from the list as the first value
        # in each new line. Then include the line's second NOT, and add the not_variables
        # list made in the line above, separating each item with NOT
        expression += "NOT (" + lst[lst_count] + " NOT " + " NOT ".join(not_variables) + ") "

        # Add one to the lst_count to cycle through all items in the list in this for loop
        lst_count += 1

    # Return the value of our expression (search string) for use outside of the function
    return expression

# Store function PubMedFetcher() in variable 'fetch' for ease of use
fetch = PubMedFetcher()

# Run this function above on the search variable (outcome of last function)
result = generate_logical_expression(search)

# Use metapub library to conduct the API search and retrieve no more than 1000 PMIDs
pmid_list = fetch.pmids_for_query(result, retmax=1000)
print(result)
print(pmid_list)
print(len(pmid_list))

(((Milliken, Genevieve [Author]) OR (Milliken, Genevieve[Author]) OR (Milliken, G[Author]) OR (0000-0002-3057-0659[Author - Identifier])) OR ((Contaxis, Nicole [Author]) OR (Contaxis, Nicole[Author]) OR (Contaxis, N[Author]) OR (0000-0001-5279-1623[Author - Identifier])) OR ((Surkis, Alisa [Author]) OR (Surkis, Alisa[Author]) OR (Surkis, A[Author]) OR (0000-0002-9777-2693[Author - Identifier])) OR ((LaPolla, Fred WZ[Author]) OR (LaPolla, Fred[Author]) OR (LaPolla, FWZ[Author]) OR (0000-0002-3185-9753[Author - Identifier]))) NOT (((Milliken, Genevieve [Author]) OR (Milliken, Genevieve[Author]) OR (Milliken, G[Author]) OR (0000-0002-3057-0659[Author - Identifier])) NOT ((Contaxis, Nicole [Author]) OR (Contaxis, Nicole[Author]) OR (Contaxis, N[Author]) OR (0000-0001-5279-1623[Author - Identifier])) NOT ((Surkis, Alisa [Author]) OR (Surkis, Alisa[Author]) OR (Surkis, A[Author]) OR (0000-0002-9777-2693[Author - Identifier])) NOT ((LaPolla, Fred WZ[Author]) OR (LaPolla, Fred[Author]) OR (LaP

#eSummary
[eSummary](https://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESummary_) is one of several tools the E-Utilities API provides for interacting with the information in PubMed. It provides brief summary data of records, and you can pull that information via PMIDs.

In [None]:
# Setting up the base URL for NCBI Entrez Utilities
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

# Specifying the database to be queried
db = 'db=pubmed'

# Maximum number of records to retrieve
retmax = 50

# Starting point for retrieving records
retstart = 0

# Format for fetching data - text format
fetch_retmode = "&retmode=text"

# Type of data to retrieve - abstracts
fetch_rettype = "&rettype=abstract"

# Supply your email address inside the quotes "" for identification
Entrez.email = ""

def fetch_article_info(pmids):
    # Step 1: Use ePost to upload the list of PMIDs
    post_handle = Entrez.epost("pubmed", id=",".join(pmids))
    post_results = Entrez.read(post_handle)
    web_env = post_results["WebEnv"]
    query_key = post_results["QueryKey"]

    # Step 2: Use eSummary to fetch summaries for the uploaded PMIDs
    summary_handle = Entrez.esummary(db="pubmed", webenv=web_env, query_key=query_key)
    summary_records = Entrez.read(summary_handle)

    # Parse the summary records and extract relevant information
    articles = []
    for record in summary_records:
        article_info = {
            'PMID': record['Id'],
            'Title': record['Title'],
            'Journal': record['FullJournalName'],
            'AuthorList': record['AuthorList'],
            'PublicationYear': record['PubDate'].split()[0] if 'PubDate' in record else 'N/A'
        }
        articles.append(article_info)
    return articles

def save_to_csv(articles, filename):
    # Save fetched article information to a CSV file
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['PMID', 'Title', 'Journal', 'AuthorList', 'PublicationYear']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for article in articles:
            writer.writerow(article)

def main():
    # Fetch article information using provided PMIDs
    pmids = pmid_list
    articles = fetch_article_info(pmids)

    # Save fetched information to a CSV file
    save_to_csv(articles, 'article_info.csv')

if __name__ == "__main__":
    # Execute main function if the script is run directly
    main()

# Download File
The file should now be in your temporary Google Colaboratory file space. You can download it from the Colab menu (on the left side of the screen), or use this code to download it directly to your computer.

In [None]:
files.download('article_info.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>