# Demo Network Construction Notebook

This notebook is intended to construct a PI network consisting of Authors, Publications, Awards, and MeSH Terms utilizing award information collected by BRIMR, which is then used to query pubmed using biopython's Entrez wrapper. This notebook is broken down into detailed steps in the protocol paper.

In [None]:
import pandas as pd
from Bio import Entrez
import os
from Bio import Entrez
import itertools
import unittest
import random
from typing import Iterable, Dict, Any
from ast import literal_eval
import xml.etree.ElementTree as ET
import zipfile

# Protocol 1 - 13

We create the required directories and folders in the cell below. This mainly consists of creating the construction/Network folder and the construction/Data folder. The construction/Network folder will eventually contain the required triples for ingestion into Neo4J and the construction/Data folder will contain the required BRIMR excel file which contains relevant award information and the required desc2025.xml file which is used to filter out the MeSH Headings that are too general. This corresponds to Protocol 1 - 12.

In [None]:

# Create directories if they don't exist

# Create the construction/Data folder.
if not os.path.exists("./Data"):
    os.makedirs("./Data")

# Create the construction/Network folder.
if not os.path.exists("./Network"):
    os.makedirs("./Network")


Now that we have created two folders within our construction folder, let us now build out the network.

## Retrieve the network data

Typically a network data structure consists of nodes and edges. Nodes will be objects in the network, whereas edges are the relationships between those objects. In this demo network we choose to create a PI network, or a network that focuses on the relationships between principal investigators/contact authors that share publications. Our network will contain information on Authors, Publications, Awards, and MeSH terms. We obtain this data primarily from three different sources - BRIMR, RePORTER, and Pubmed, which cumulatively contain the information necessary to create the demo network.

### Filter and format awards

We will now filter and format the awards dataframe. First let us read in our information from the past year from BRIMR. More information on how to obtain the information is described in Protocol 1 - 7 and 1-10. This section corresponds to Protocol 1-13.

In [None]:
# Let us declare a dataframe that will store all of the information from BRIMR 2024.
# This may take a few seconds to load into memory.
whole_df = pd.read_excel("./Data/Brimr_2024.xlsx", header = 1)
whole_df.head(5)


We can clearly see that there are numerous columns, although the columns that we care about are Organization, Project Number, PI Name, and Project Title. Now we would like to filter the awards by the institution.



#### Filter the dataframe by institution of your choice: (We utilize Icahn School of Medicine by default)

 If you would like the <b>default</b> institution, MOUNT SINAI ICAHN SCHOOL OF MEDICINE run the cell below.

In [None]:
institution = "MOUNT SINAI ICAHN SCHOOL OF MEDICINE"

 If you would like a <b> random </b> institution instead run the cell below.
  
<b>WARNING:</b> If you select this option there is the possibility of obtaining an empty network. 
Uncomment the cell below if you would like a random institution.

In [None]:
#institution = random.choice(whole_df['Organization'].unique())
#print(institution)

#### Perform filtering of the dataframe by selected institiution
Here we filter the dataframe to ensure that only awards from a given instution are included in the network.

In [None]:
# Filter the dataframe by the given institution.
filtered_df = whole_df[whole_df['Organization'] == (institution)]
filtered_df.head(5)

Now that we have filtered the dataframe by the given institution, let us filter out R&D contract awards.

In [None]:
# Filter the dataframe by the given institution.
filtered_df = filtered_df[filtered_df['FUNDING MECHANISM'] != 'R&D Contracts']

Now that we have a filtered the dataframe by institution and funding mechanism, let us remove unnecessary columns from the dataframe (i.e. Congressional District, City, etc.)

In [None]:
# Remove unnecessary columns.
filtered_df = filtered_df[['Organization', 'PROJECT NUMBER', 'PI NAME', 'PROJECT TITLE']]
filtered_df

We can clearly see that we have ~727 awards over the past year from 2024.

####  Formatting Project Number and querying pubmed through biopython's entrez submodule.

Unfortunately, by default the whole project number is not usable to query pubmed. In order to perform a pubmed search we need to use everything after the 4th character and before the dash. We can write a function to do this for us and ensure that we are correctly modifying the project number. More information on grant project numbers can be found here at [Understanding Grant Numbers](https://www.era.nih.gov/eraHelp/commons/Commons/understandGrantNums.htm?TocPath=Commons%20Basics%7C_____2). More information on how to search Pubmed utilizing grant numbers can be found here at [Pubmed User Guide](https://pubmed.ncbi.nlm.nih.gov/help/#gr).

For example:
- 5R01DK046865-31 -> DK046865
- 1R21DK139543-01A1 -> DK139543
- 5R01EY029736-05 -> EY029736

In [None]:
# Function to modify the project number to be queryable by pubmed.
def modify_pm(s: str) -> str:
    """
    Modifies a string from the BRIMR dataset to a format suitable for PubMed queries.

    Args:
        s (str): The input string, typically a grant number.

    Returns:
        str: A modified string for PubMed query, or None if invalid.
    """
    try:
        s = s.split("-")[0]
        s = s[4:]
        return s
    except Exception as e:
        print(e)
        return None



## This is for unit tests of the modify_pm function to ensure that it is doing what it should be.
class TestModifyPM(unittest.TestCase):

    def test_valid_inputs(self):
        self.assertEqual(modify_pm("5R01DK046865-31"), "DK046865")
        self.assertEqual(modify_pm("1R21DK139543-01A1"), "DK139543")
        self.assertEqual(modify_pm("5R01EY029736-05"), "EY029736")


# Run the tests in Jupyter
unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestModifyPM))

#### Modify the project number using the modify_pm

In [None]:
# Modify the whole df so that PubMed is queryable.
filtered_df['PROJECT NUMBER BAK'] = filtered_df['PROJECT NUMBER']
filtered_df['PROJECT NUMBER'] = filtered_df['PROJECT NUMBER'].apply(modify_pm)
filtered_df.head(5)

Let us do a last sanity check that all project numbers are unique. That is we will remove all duplicate project number's so that we are not accidentally collecting the same project information twice.

In [None]:
filtered_df = filtered_df.drop_duplicates(subset="PROJECT NUMBER")

Write out the filtered_df into a file so that we can maintain the list of project numbers. This should be ~684 awards that are contained in the dataframe. We write out the dataframe to be in "./construction/grants.csv".

In [None]:
filtered_df.to_csv("grants.csv", index=False)

### Pubmed search
Here we utilize the pubmed entrez tool to perform a search of all the pmid's associated with each grant number. We do this by utilizing biopython's entrez wrapper so that we can query PubMed programtically. This step corresponds to Protocol 1-13.

Let us first establish a few helper functions to help us out. We would like to establish the following helper functions:
1. get_pmids : A function that will allow us to retrieve a list of pmids for each grant. 
2. get_authors : A function that will allow us to retrieve a list of authors for each grant.

In [None]:
# Function to retrieve a list of pmids for a given grant number.
def get_pmids(grant_number: str) -> list[str]:
    """
    Gets and returns all PMIDs for a given grant number.

    Args:
        grant_number (str): Grant number to search in PubMed.

    Returns:
        List[str]: List of PMIDs (max 10,000) as strings.
    """
    handle = Entrez.esearch(db="pubmed", term=grant_number, retmax=10000)
    record = Entrez.read(handle)
    handle.close()

    pmids = record.get("IdList", [])
    return pmids

# Function to retrieve a list of PubMed records for the given PMIDs.
def get_records(pmids: list[str]) -> list[dict]:
    """
    Gets and returns a list of PubMed records for the given PMIDs.

    Args:
        pmids (List[str]): A list of PMIDs from the PubMed database.

    Returns:
        List[dict]: A list of records returned by Entrez.efetch.
    """
    handle = Entrez.efetch(db="pubmed", id=",".join(pmids), rettype="xml")
    records = Entrez.read(handle)
    handle.close()

    return records
    

Let us now use Pubmed's entrez tool through the biopython wrapper. Please enter an email in the cell below to comply with the NCBI usage policy for this tool.

In [None]:
email = "youremail@gmail.com"

Now that an email has been entered, we need to search for all PubMed ID's or PMID's associated with each grant. We do this by constructing a search for each award and then add it to a dictionary storing this information. We then use this dictionary to create a Pandas DataFrame. We will then save the Pandas DataFrame to "./construction/grant_pmids.csv". 

In [None]:


# Set your email to comply with NCBI's usage policy
Entrez.email = email  # Replace with your email address

# Let us create a dictionary to store the associated PMID's with Awards.
grant_pmids = {}

# Iterate through the project numbers for the organization.
for i, PM in enumerate(filtered_df.itertuples()):

    # Print out grant processed and the project number.
    print(f"Processing grant {i + 1}: {PM[2]}")
    grant_number = PM[2]
    year = PM[3]
    
    # Create the query for pubmed.
    search_query = f'"{grant_number}"[gr]'
    
    try:
        # Search for PMIDs associated with the grant
        pmids = get_pmids(search_query)
        
        # Print out the associated grant number and pmids.
        print(grant_number, pmids)

        grant_pmids[grant_number] = pmids  # Store PMIDs for the grant.
    except Exception as e:
        print(f"Error processing grant {grant_number}: {e}")



# Save results to a CSV file inside the organization's folder
result_df = pd.DataFrame(list(grant_pmids.items()), columns=['Grant Number', 'PMIDs'])
result_file = os.path.join("./", 'grant_pmids.csv')
result_df.to_csv(result_file, index=False)



# Protocol 1 - 14

### Retrieve publication metadata from PubMed

Now that we have obtained all of the relevant PMIDs, let us obtain metadata from each publication recorded, this includes the article's title, Journal, Author name and affiliation, and MeSH terms for each publication. We do this by using the following helper function. This helper function will retrieve 1000 publication's metadata at a time and associate it with each pubmed id. This section corresponds to Protocol 1-14.

In [None]:

# Helper function to retrieve metadata for a given set of PMIDs.
def get_metadata(pmids: Iterable[str]) -> Dict[str, Dict[str, Any]]:
    """
    Retrieve metadata for a given set of PMIDs using Entrez.

    Parameters:
        pmids (Iterable[str]): A collection of PubMed IDs (PMIDs) to retrieve metadata for.

    Returns:
        Dict[str, Dict[str, Any]]: A dictionary where each key is a PMID and the value
        is another dictionary containing:
            - Title (str)
            - Authors (str): semicolon-separated
            - Affiliations (str): semicolon-separated
            - Journal (str)
            - Volume (str)
            - Issue (str)
            - Pages (str)
            - Year (str)
            - MeSH Terms (str): semicolon-separated
    """
    # Initialize an empty dictionary to store metadata.
    metadata = {}
    # Set the batch size to 1000.
    batch_size = 1000
    # Convert the PMIDs to a list.
    pmid_list = list(pmids)
    # Iterate through the PMIDs in batches of 1000.
    for i in range(0, len(pmid_list), batch_size):
        batch_pmids = ",".join(map(str, pmid_list[i:i + batch_size]))
        print(
            f"Fetching metadata for PMIDs {i + 1} to "
            f"{min(i + batch_size, len(pmid_list))}..."
        )
        # Fetch the metadata for the given PMIDs.
        try:
            # Fetch the metadata for the given PMIDs.
            handle = Entrez.efetch(
                db="pubmed",
                id=batch_pmids,
                rettype="medline",
                retmode="xml"
            )
            # Read the metadata for the given PMIDs.
            records = Entrez.read(handle)
            handle.close()
            print("Retrieved")
            # Iterate through the records.
            for article in records["PubmedArticle"]:
                pmid = article["MedlineCitation"]["PMID"]
                citation = article["MedlineCitation"]
                article_data = citation["Article"]
                journal_info = article_data["Journal"]
                # Get the title, journal, issue, volume, year, and pages of the article.
                title = article_data.get("ArticleTitle", "N/A")
                journal = journal_info.get("Title", "N/A")
                issue = journal_info.get("JournalIssue", {}).get("Issue", "N/A")
                volume = journal_info.get("JournalIssue", {}).get("Volume", "N/A")
                year = journal_info.get("JournalIssue", {}).get("PubDate", {}).get("Year", "N/A")
                pages = article_data.get("Pagination", {}).get("MedlinePgn", "N/A")
                # Get the authors and affiliations of the article.
                authors = []
                affiliations = []
                if "AuthorList" in article_data:
                    for author in article_data["AuthorList"]:
                        if "LastName" in author and "ForeName" in author:
                            authors.append(f"{author['ForeName']} {author['LastName']}")
                        if "AffiliationInfo" in author:
                            affiliations.extend([
                                aff["Affiliation"]
                                for aff in author["AffiliationInfo"]
                                if "Affiliation" in aff
                            ])
                # Get the MeSH terms of the article.
                mesh_terms = []
                if "MeshHeadingList" in citation:
                    mesh_terms = [
                        mesh["DescriptorName"]
                        for mesh in citation["MeshHeadingList"]
                        if "DescriptorName" in mesh
                    ]
                # Add the metadata to the dictionary.
                metadata[pmid] = {
                    "Title": title,
                    "Authors": "; ".join(authors),
                    "Affiliations": "; ".join(affiliations),
                    "Journal": journal,
                    "Volume": volume,
                    "Issue": issue,
                    "Pages": pages,
                    "Year": year,
                    "MeSH Terms": "; ".join(mesh_terms)
                }
            # Print out the success message.
            print(
                f"Successfully retrieved metadata for batch {i + 1} to "
                f"{min(i + batch_size, len(pmid_list))}."
            )
        # Print out the error message.
        except Exception as e:
            print(f"Error retrieving metadata for PMIDs: {e}")
    # Return the metadata.
    return metadata


Now let us actually retrieve the information using the helper function that we declared above. We save this information into a file that is delimited by |, as this is a unique character that is not frequently found in things like titles or names. (Titles sometimes have commas in their names.)

In [None]:
# Read in the grant_pmids.csv file.
df = pd.read_csv("grant_pmids.csv")
# Initialize an empty set to store all unique PMIDs.
all_pmids = set()
# Iterate through the PMIDs in the dataframe.
for pmid_list in df['PMIDs']:
    pmid_list = eval(pmid_list) if isinstance(pmid_list, str) else pmid_list
    all_pmids.update(pmid_list)
print(f"Collected {len(all_pmids)} unique PMIDs.")

# Retrieve metadata for all unique PMIDs
metadata = get_metadata(all_pmids)


# Save metadata to a CSV file

metadata_df = pd.DataFrame.from_dict(metadata, orient='index')

# Filter out pmids that have a year < 2023 as we have awards from 2024.
metadata_df['Year'] = pd.to_numeric(metadata_df['Year'].replace("N/A", None), errors='coerce').astype("Int64")
metadata_df = metadata_df[metadata_df['Year'] >= 2024]


metadata_file = os.path.join("./", 'pmid_metadata.csv')
metadata_df.to_csv(metadata_file, index_label='PMID', sep = "|")


print(f"Metadata retrieval complete. Results saved to {metadata_file}.")

Now that we have a list of grants and authors let us process it into a network for Neo4J. We need to create a csv file for nodes and for edges. We will have four node types:
- Authors <id, label>
- Publications <id, label, Title, Journal, Authors>
- Awards <id, label, Title, Contact>
- MeSH <id, label> nodes. 

We will also write out the following edges:
- Coauthors : Authors - Authors
- Publications : Authors - Publications
- MeSH : MeSH - Publications
- Awards : Awards - Authors
- Awards : Awards - Publications


# Protocol 1 - 15

## Create Network Files

### Create "Authors" node type

We first need to curate a list of authors. One of the primary challenges of working with publication and authorship networks, is the problem of Author Name Disambiguation (AND), which is the task of identifying who is who. For example, someone may have entered their name on one publication as First Name, Middle Initial, Last Name, but on another publication wrote it as First Name, Middle Name, Last Name. This obviously is a complicated problem throughout bibliometrics, so for the sake of simplicity, we assume that all author's with the exact same names are the same, and authors that have the same middle initial and matching first and last names are also the same. Let us  create our list of unique nodes of Authors. We can get this information by retrieving all of the contact PI names found in [grants.csv](./grants.csv) as well as the publication names stored in [pmid_metadata.csv](./pmid_metadata.csv). This section creates the file "./Network/Authors.nodes.csv" and corresponds to Protocol 1 - 15.

In [None]:
# Function to explode authors and affiliations into rows
def explode_authors_affiliations(df: pd.DataFrame) -> pd.DataFrame:
    records = []

    for _, row in df.iterrows():
        pmid = row.name  # assuming PMID is the index
        authors = [a.strip() for a in str(row['Authors']).split(';') if a.strip()]
        affiliations = [a.strip() for a in str(row['Affiliations']).split(';') if a.strip()]
        
        # Handle mismatch: align only up to the shortest length
        for author, affiliation in zip(authors, affiliations):
            records.append({
                "PMID": pmid,
                "Author": author,
                "Affiliation": affiliation
            })

    return pd.DataFrame(records)

author_df = explode_authors_affiliations(metadata_df[['Authors', 'Affiliations']])


Now we need to ensure that each author is within a given institution, so we need to ensure that affiliation contains a keyword. Please enter a keyword from the given institution {{institution}}. For the Icahn School of Medicine at Mount Sinai, the keyword would be ICAHN.

In [None]:
keyword = "ICAHN"
author_df[author_df['Affiliation'].str.contains(keyword, case=False, na=False)]
author_df.drop_duplicates(subset="Author")
del author_df['PMID']
del author_df['Affiliation']


We also need to filter out all of the authors for only contact PIs so that we don't obtain unwanted authors in the network.

In [None]:
# Clean PI names from "Last, First" to "First Last" and format
filtered_df = filtered_df.copy()
filtered_df['Cleaned PI Name'] = (
    filtered_df['PI NAME']
    .fillna('')  # handle NaNs
    .str.replace(".", "", regex=False)
    .str.split(",")
    .apply(lambda x: " ".join(x[::-1]).strip() if isinstance(x, list) and len(x) > 1 else x[0] if x else "")
    .str.title()
)

# Extract PI names into authors_df
awards_authors = pd.DataFrame(filtered_df['Cleaned PI Name'].copy())
awards_authors.rename(columns={'Cleaned PI Name': 'Author'}, inplace=True)

# Build final authors node table (PIs only)
authors_df = awards_authors.copy()
authors_df['label'] = authors_df['Author'].str.strip()
authors_df = authors_df.drop(columns=["Author"])
authors_df = authors_df.drop_duplicates()

# Save to CSV
authors_df.to_csv(os.path.join("./Network", "Authors.nodes.csv"), index=False)

### Create "Publications" node type

Now we need to create the publications data from the metadata dataframe that we created earlier in the notebook. This section creates the file "./Network/Publications.nodes.csv".

In [None]:

publications_df = metadata_df.copy()
# Add label column from index
publications_df['label'] = publications_df.index

# Format Authors column: take first author and add "et al."
publications_df['Authors'] = (
    publications_df['Authors']
    .fillna('')
    .apply(lambda x: (x.split(';')[0].strip() + ' et al.') if x else '')
)

# Safe column access and joining (with fallbacks for missing data)
def format_journal_row(row):
    parts = [
        row.get('Journal', ''),
        row.get('Volume', ''),
        row.get('Issue', ''),
        row.get('Pages', '')
    ]
    # Remove empty strings and join with commas
    joined = ", ".join([part for part in parts if part])
    year = row.get('Year', '')
    return f"{joined} ({year})" if year else joined

# Apply formatted journal info
publications_df['Journal'] = publications_df.apply(format_journal_row, axis=1)
publications_df = publications_df.drop(
    columns=['Affiliations', 'Volume', 'Issue', 'Pages', 'Year', 'MeSH Terms'],
    errors='ignore'  # in case any of them are already missing
)
publications_df = publications_df[['label', 'Title','Authors', 'Journal']]
publications_df.to_csv(os.path.join("./Network", "Publications.nodes.csv"), index=False)

We would also like to create a dictionary to ensure that the publications we are gathering are valid (i.e. a publication might be from the year 2023, which is after the award was granted/renewed). In the cases where a publication is not valid, we simply ignore it. 

In [None]:
valid_pub = dict(zip(publications_df['label'], [True] * len(publications_df)))

### Create "Awards" node type

Here we create the "Awards" node type. This simply involves collecting all of the awards that we used to collect publications and authors. This section creates the file "./Network/Awards.nodes.csv".

In [None]:
# Create a copy of the filtered dataframe. 
awards_df = filtered_df.copy()

# Only get relevant columns from the dataframe.
awards_df = awards_df[['PROJECT NUMBER BAK', 'PROJECT TITLE', 'Cleaned PI Name']]

# Rename the relevant columns.
awards_df = awards_df.rename(columns={
    'PROJECT NUMBER BAK': 'label',
    'PROJECT TITLE': 'Title',
    'Cleaned PI Name': 'Contact'
})

awards_df.drop(
    columns=['Organization', 'PROJECT NUMBER', 'PI NAME'],
    errors='ignore'  # in case any of them are already missing
)

# Create/format the node type for saving.
awards_df = awards_df[['label', 'Title', 'Contact']]
awards_df.to_csv(os.path.join("./Network", "Awards.nodes.csv"), index=False)


### Create "MeSH" node type

Here we create the "MeSH" node type. This is done by collecting information from the metadata DataFrame. We already have collected this information, we just need to assign an edge between each publication and its MeSH terms and filter out the high level MeSH terms by depth. We choose a depth level of 5 for this task.

In [None]:
# Create a copy of the metadata dataframe.
mesh_df = metadata_df.copy()
all_terms = (
    metadata_df['MeSH Terms']
    .dropna()
    .str.split(';')                  # split each row on semicolons
    .explode()                       # flatten into a single column
    .str.strip()                     # remove extra spaces
    .dropna()
    .unique()                        # get unique values
)

# Convert to list if needed
mesh_df = pd.DataFrame()
mesh_df['label'] = sorted(all_terms.tolist())
mesh_df = mesh_df[mesh_df['label'] != ""]

with zipfile.ZipFile("./Data/desc2025.zip", mode="r") as z:
    with z.open("desc2025.xml") as xml_file:
        tree = ET.parse(xml_file)
        root = tree.getroot()

deep_mesh_terms = set()
for descriptor in root.findall('.//DescriptorRecord'):
    name_elem = descriptor.find('.//DescriptorName/String')
    treenumbers = descriptor.findall('.//TreeNumberList/TreeNumber')
    
    # If any tree number has a dot, it's deeper than level 1
    if any(tn.text.count(".") > 5 for tn in treenumbers):
        if name_elem is not None:
            deep_mesh_terms.add(name_elem.text.strip())

# Step 2: Filter mesh_df
mesh_df['label'] = mesh_df['label'].astype(str)
mesh_df = mesh_df[mesh_df['label'].isin(deep_mesh_terms)]

to_drop = ['Humans', 'Mice', 'Rats', "New York City",
           'Prevalence', 'Medicare', 'Medicaid',
           'Risk Factors', 'Mice, Inbred C57BL',
           'Retrospective Studies', 'Longitudinal Studies',
           'Prospective Studies', 'COVID-19', 'Mice, Knockout',
           'Incidence', 'Risk Assessment', 'Vaccination',
           'Antibodies, Viral', 'Mice, Transgenic',
           'Antibodies, Neutralizing', 'Antibodies, Monoclonal']
mesh_df = mesh_df[~mesh_df['label'].isin(to_drop)]
# Create the node type for saving.
mesh_df.to_csv(os.path.join("./Network", "MeSH.nodes.csv"), index=False)

# Create a list of valid mesh terms.
valid_mesh = set(mesh_df['label'])

# Protocol 1 - 16

### Create Coauthor: Author - Author edges

Now we will create the edges for the network. The first type of edges we create are coauthors edges. We collect this information from the metadata DataFrame. We assign an edge between an author and another author if they share a valid publication. We then drop all duplicate edges. This is in accordance with Protocol 1 - 16.

In [None]:
import re
def normalize_author_name(name: str) -> str:
    # Normalize to "Lastname Initial" format
    parts = re.split(r'\s+', name.strip())
    if len(parts) == 0:
        return ''
    lastname = parts[-1]
    initials = ''.join([p[0] for p in parts[:-1] if p])  # e.g. "John H." → "JH"
    return f"{lastname.lower()} {initials.lower()}"

def compare_author_names(author: str, list_authors: list[str]) -> bool:
    target = normalize_author_name(author)
    return target in list_authors

def normalize_author_list(list_authors: list[str]) -> set[str]:
    return set(normalize_author_name(a) for a in list_authors)


coauthors_edges = metadata_df.copy()
coauthors_edges['pmid'] = coauthors_edges.index
coauthors_edges = coauthors_edges[['Authors', 'pmid']]
coauthors_edges



# Assuming your DataFrame is called `df` and the column is 'Authors'

edges = []

valid_authors = normalize_author_list(authors_df['label'])
for pmid, row in coauthors_edges.iterrows():
    if valid_pub.get(pmid, False):
        authors = [a.strip() for a in str(row['Authors']).split(';') if a.strip()]
        
        # Create all pairwise combinations
        for a1, a2 in itertools.combinations(authors, 2):
            if compare_author_names(a1, valid_authors) and compare_author_names(a2, valid_authors):
                edges.append({'source label': a1, 'target label': a2})
                edges.append({'source label': a2, 'target label': a1})  # Add reverse direction

# Create the edge list DataFrame
coauthors_edges_df = pd.DataFrame(edges)

coauthors_edges_df.drop_duplicates(subset = ["source label", "target label"], inplace = True)

del coauthors_edges
coauthors_edges_df.to_csv(os.path.join("./Network/", "Authors.Coauthors.Authors.edges.csv"), index=False)

### Create Author - Publications edges

Now we would like to create the edges between authors and their respective publications. We again utilize the metadata DataFrame we collected earlier to create edges between authors and their respective publications and drop duplicate edges. 

In [None]:
edges = []
for pmid, row in metadata_df.iterrows():
    if not valid_pub.get(pmid, False):
        continue
    else:
        pass
    authors = [a.strip() for a in str(row['Authors']).split(';') if a.strip()]
    
    for author in authors:
        if compare_author_names(author, valid_authors):
            edges.append({'source label': pmid, 'target label': author})

# Create the edge list DataFrame
edge_df = pd.DataFrame(edges)

# Drop duplicates.
edge_df.drop_duplicates(subset = ["source label", "target label"], inplace = True)


# Save to CSV with the required filename
edge_df.to_csv(os.path.join('./Network', 'Publications.Publications.Authors.edges.csv'), index=False)


cols = list(edge_df.columns)
i, j = cols.index('source label'), cols.index('target label')
cols[i], cols[j] = cols[j], cols[i]

edge_df = edge_df[cols]
edge_df.to_csv(os.path.join('./Network/', "Authors.Publications.Publications.edges.csv"), index=False)

authors_publications_edge_df = edge_df.copy()
del edge_df

### Create Awards edges

Now we would like to create edges between Awards and their contact principal investigators. Therefore we create edges between the "Awards" node type and the "Authors" node type. We also create edges between the "Awards" node type and the "Publications" node type. These edges represent the awards that are associated with particular publications.

#### Create Awards: Awards-Authors edges

Here we create the edges between the contact principal investigators and their awards. We do this by taking the original dataframe that we wrote out "./grants.csv" and utilizing the data their.

In [None]:
awards_edges = filtered_df[['PROJECT NUMBER BAK', 'Cleaned PI Name']].copy()
awards_edges = awards_edges.rename(columns = {
    "PROJECT NUMBER BAK": "source label",
    "Cleaned PI Name" : "target label"
})
awards_edges.drop_duplicates(subset = ["source label", "target label"], inplace = True)
awards_edges['target label'] = awards_edges['target label'].str.strip()
awards_edges.to_csv(os.path.join("./Network/", "Awards.Awards.Authors.edges.csv"), index=False)
cols = list(awards_edges.columns)
i, j = cols.index('source label'), cols.index('target label')
cols[i], cols[j] = cols[j], cols[i]
awards_edges = awards_edges[cols]
awards_edges.to_csv(os.path.join("./Network/", "Authors.Awards.Awards.edges.csv"), index=False)

#### Create Awards-Publications edges

Here we create the edges between the "Awards" node type and the "Publications" node type. These edges represent the awards associated with each Publication.

In [None]:
# Rename 'PROJECT NUMBER' in filtered_df to match 'Grant Number'
result_df = pd.read_csv("grant_pmids.csv")
filtered_df_renamed = filtered_df.rename(columns={'PROJECT NUMBER': 'Grant Number'})

# Merge on 'Grant Number' and bring in the 'label'
merged_df = result_df.merge(
    filtered_df_renamed[['Grant Number', 'PROJECT NUMBER BAK']],
    on='Grant Number',
    how='inner'
)
print(merged_df)
# Build directed edges: from award label to each PMID
edges = []

merged_df['PMIDs'] = merged_df['PMIDs'].apply(literal_eval)
for _, row in merged_df.iterrows():
    source_label = row['PROJECT NUMBER BAK']
    pmids = row['PMIDs']

    if isinstance(pmids, list) and pmids:
        for pmid in pmids:
            print(pmid)
            if valid_pub.get(pmid, False):
                edges.append({
                    'source label': source_label,
                    'target label': str(pmid)
                })
            else:
                pass

print(edges)
# Convert to DataFrame and export
edges_df = pd.DataFrame(edges)
edges_df.drop_duplicates(subset = ["source label", "target label"], inplace = True)
edges_df.to_csv(os.path.join("./Network", "Awards.Awards.Publications.edges.csv"), index=False)
awards_publications_edge_df = edges_df
del edges_df
print(awards_publications_edge_df)
cols = list(awards_publications_edge_df.columns)
i, j = cols.index('source label'), cols.index('target label')
cols[i], cols[j] = cols[j], cols[i]
awards_publications_edge_df = awards_publications_edge_df[cols]
awards_publications_edge_df.to_csv(os.path.join("./Network", "Publications.Awards.Awards.edges.csv"), index=False)


### Create MeSH: MeSH - Publications edges

Here we create edges between the "MeSH" node type and the "Publications" node type. These edges represent how each publication has a set of "MeSH" terms that are associated with each Publication. 

In [None]:
mesh_edges = []

for pmid, row in metadata_df.iterrows():
    mesh_terms = [m.strip() for m in str(row['MeSH Terms']).split(';') if m.strip()]
    for mesh in mesh_terms:
        if valid_pub.get(pmid, False) and mesh in valid_mesh:
            mesh_edges.append({
                'source label': pmid,
                'target label': mesh
            })

publications_mesh_edges_df = pd.DataFrame(mesh_edges)
publications_mesh_edges_df.drop_duplicates(subset = ["source label", "target label"], inplace = True)
publications_mesh_edges_df.to_csv(os.path.join("./Network/",'Publications.MeSH.MeSH.edges.csv'), index=False)
cols = list(publications_mesh_edges_df.columns)
i, j = cols.index('source label'), cols.index('target label')
cols[i], cols[j] = cols[j], cols[i]
publications_mesh_edges_df = publications_mesh_edges_df[cols]
publications_mesh_edges_df.to_csv(os.path.join("./Network/",'MeSH.MeSH.Publications.edges.csv'), index=False)

# Protocol 1 - 17

## Format files for neo4j ingestion

Now we need to format files for neo4j ingestion. We do the actual ingestion with the script in "./PINetwork/src/import_csv.py" and "./PINetwork/src/import_csv_with_merge.py" in a later step, but that step requires that all *.nodes.csv files have the fields id and label which are unique and *.edges.csv which have the fields source, relation, and target which correspond to the id column for each node. We have previously in Protocol 1 - 15 and 1 - 16 wrote out nodes and edges with just label, source label, and target label. Now we need to create unique ids for each node type and use the proper fields for the edge files. This is in accordance with Protocol 1 - 17 and 1 - 18.

### Format nodes for neo4j ingestion

First we create/format the nodes files for ingestion. This involves creating a dictionary that maps each label to a unique id number and then writes it out the id field. We use this dictionary when we format the edges. This is in accordance with Protocol 1 - 17.

In [None]:
network_folder = './Network'
files = [f for f in os.listdir(network_folder) if f.endswith('.nodes.csv')]

node_type_to_id_dict = {}
label_to_id_dict = {}

current_id = 1

for file in files:
    node_type = file.replace('.nodes.csv', '')
    path = os.path.join(network_folder, file)

    df = pd.read_csv(path)
    label_to_id = {}
    ids = []

    for label in df['label']:
        label = str(label).strip()

        if 'Publications' in node_type:
            # Publications use label directly as ID
            ids.append(label)
            if label not in label_to_id_dict:
                label_to_id_dict[label] = label  # use label as ID
        else:
            if label not in label_to_id_dict:
                label_to_id_dict[label] = current_id
                label_to_id[label] = current_id
                ids.append(current_id)
                current_id += 1
            else:
                ids.append(label_to_id_dict[label])
    
    df['id'] = ids

    # Save type-specific mapping only for non-Publication types
    if 'Publications' not in node_type:
        node_type_to_id_dict[node_type] = label_to_id

    # Save updated node file
    df.to_csv(path, index=False)

# Protocol 1 - 18

### Format edges

Now we format the edge files. That is we rename the fields source label to be source_label and target label to be target_label. We also create the source, relation, and target fields in this cell. This is in accordance with Protocol 1 - 18.

In [None]:
# Path to edge files
network_folder = './Network'
files = [f for f in os.listdir(network_folder) if f.endswith('.edges.csv')]

# Assumes label_to_id_dict already exists
# If not, you should load or build it before this step

for file in files:
    print(file)
    path = os.path.join(network_folder, file)
    df = pd.read_csv(path)
    # Rename label columns if they exist
    if 'source label' in df.columns and 'target label' in df.columns:
        df.rename(columns={
            'source label': 'source_label',
            'target label': 'target_label'
        }, inplace=True)
    df['source_label'] = df['source_label'].astype(str)
    df['target_label'] = df['target_label'].astype(str)
    # Add relation based on filename like A.B.C.edges.csv → B
    parts = file.split('.')
    relation = parts[1] if len(parts) == 5 else 'Unknown'
    df['relation'] = relation
    # Map source and target IDs using label_to_id_dict
    df['source'] = df['source_label'].map(label_to_id_dict)
    df['target'] = df['target_label'].map(label_to_id_dict)


    df = df[df['source'].notnull() & df['target'].notnull()]

    # Save updated file
    df.to_csv(path, index=False)

# Protocol 1 - 19

### Create a merged network file.

Here we create a merged network of edges that we can optionally load into Cytoscape later to visually inspect the whole network.

In [None]:
# Iterate over all edge files in construction.
df = pd.DataFrame()
for file in os.listdir("./Network/"):
    if file.endswith(".edges.csv"):
        df = pd.concat([df, pd.read_csv(f"./Network/{file}")])
df.to_csv("./merged_network.csv", index=False)