# Gather NIH RePORTER API Data
Started 2023-07-27 ZD  

This notebook will explore the process of using the NIH RePORTER API (Endpoint: https://api.reporter.nih.gov/v2/projects/search; Docs: https://api.reporter.nih.gov/) to gather grants data. The goal is to build a system that can use the NOFOs (Notice of Funding Opportunity) and/or Awards listed in the Key Programs file in order to get all awards associated with each Key Program. **This notebook will not be used for production processing.**  

**Update:** All functionality from this notebook has been copied to `nih_reporter_api.py`, `clean_grants_data.py`, or `main.py`.

**NOTE**  
From a banner at the top of https://api.reporter.nih.gov/:  

`HHS has issued a directive for OPDIVS to implement standardized terminology in funding opportunities. As part of this initiative, the Office of Extramural Research (OER) at NIH is currently in the process of updating funding opportunity templates, websites, and other relevant resources to align with the revised terminology. This information can be found in the OER Policy Announcement 2023-02.`

`Effective August 31, 2023, there will be a change in the way FOA Numbers are handled through the API Service. Moving forward, only the long FOA Number (Full_FOA) will be available, and it will be referred to as the "Opportunity Number." Additionally, the short FOA Number will no longer be supported and will be removed from the API.`

In [1]:
import pandas as pd
import requests
from time import sleep # for retrying API calls
from math import ceil # for pagination logging

# Method to import from parent directory
import os
import sys
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
sys.path.append(root_dir)

import config

In [2]:
# Early attempt

# def get_nih_reporter_grants(foa_values, print_meta=False):
#     """Get grants info for provided foa_values"""

#     base_url = "https://api.reporter.nih.gov/v2/projects/search"
#     grants_data = []

#     for foa in foa_values:
#         params = {
#             "criteria": {
#                 "foa": [foa],
#                 "exclude_subprojects": True,
#                 "agencies": ["NCI"]
#             },
#             "limit": 500,
#             "sort_field":"ApplId",
#             "sort_order":"asc",
#         }

#         response = requests.post(base_url, 
#                                  json=params, 
#                                  headers={
#                                      "accept": "application/json", 
#                                      "Content-Type": "application/json"})

#         if response.status_code == 200:
#             grants = response.json()
#             grants_data.extend(grants['results'])
#         else:
#             print(f"Error occurred while fetching grants for FOA '{foa}': {response.status_code}")

#         if print_meta == True:
#             print(f"Response Metadata:\n {grants['meta']}")

#     return grants_data

Some troubleshooting findings not shown here that led to the `get_nih_reporter_grants_from_nofo` function:
* The "include_fields" and "exclude_fields" do not seem to be working as the documentation says they should. They give unpredictable and unintuitive results, so they wer removed from the API call parameters. This means that we'll gather more info than necessary and then remove unneeded columns later. Not performant, but may be negligible at our low scale. 
* The `offset` and `limit` parameters need to be used together in a loop to create manual pagination in order to capture all results. The max limit on records within a response is 500, but some NOFOs have 1,000s of records. Without manual pagination, only the first 500 records will be gathered before moving on to the next NOFO. 
* Using sort_field with `ProjectNum` throws errors. Stick with default `ApplId`, even though that's less relevant to us. Maybe not all records have ProjectNums. 
* If a blank value is input into the NOFO, it tries to return ~150,000 records until the max `offset` value of 14,900 is reached. 

In [3]:
# OBSOLETE, REPLACED LATER BY get_nih_reporter_grants
def get_nih_reporter_grants_from_nofo(nofo_values:str, print_meta=False):
    """Get grants info for provided NOFOs.
    
    :param nofo_values: string of Notice of Funding Opportunities (NOFOs) 
        separated by semicolon. 

    :param print_meta: boolean indicator. If True, print API gathering process
        results to console"""

    base_url = "https://api.reporter.nih.gov/v2/projects/search"
    grants_data = []

    for nofo in nofo_values:
        # Check for blank NOFO values
        if not nofo:
            print(f"Blank NOFO value encountered. Skipping.")
            continue

        # Set default values for params not likley to change
        LIMIT = 500
        MAX_ATTEMPTS = 5
        RETRY_TIME = 2

        # Set starting value for counters
        offset = 0
        page = 0
        attempts = 0

        # RePORTER API sets a max limit of 500 records per call.
        # Keep looping each call in "pages" until the number of records 
        # gathered reaches the total number of records available. 
        
        # Set a cap on the number of attempts at a failed call before 
        # moving on to the next NOFO.
        while attempts < MAX_ATTEMPTS:
            # Set parameters for API call
            params = {
                "criteria": {
                    "foa": [nofo],
                    "exclude_subprojects": True,
                    "agencies": ["NCI"]
                },
                "limit": LIMIT,
                "offset": offset,
                "sort_field": "FiscalYear",
                "sort_order": "desc",
            }

            try: 
                # Define response details
                response = requests.post(base_url, 
                                        json=params, 
                                        headers={
                                            "accept": "application/json", 
                                            "Content-Type": "application/json"})

                # If response is good, get results
                if response.status_code == 200:
                    grants = response.json()
                    # Add API source indicator
                    for grant in grants['results']:
                        grant['api_source_search'] = f"NOFO_search: {nofo}"
                    # Add grants to running list
                    grants_data.extend(grants['results'])

                    # Increase offset by limit to get next "page"
                    total_records = grants['meta']['total']
                    offset = offset + LIMIT
                    page = page + 1

                    # Print paginated partial optional metadata
                    # Consider replacing this with proper logging
                    if print_meta == True:
                        total_pages = max(ceil(total_records/LIMIT),1)
                        print(f"NOFO Results: {nofo} ({page}/{total_pages}): "
                              f"{grants['meta']}")

                    # Stop looping if offset has reached total record count
                    if offset >= total_records:
                        break

                # Handle 500 errors by retrying after 2 second delay
                elif response.status_code == 500:
                    attempts = attempts + 1
                    print(f"Received a 500 error for NOFO '{nofo}'. "
                          f"Retrying after {RETRY_TIME} seconds. "
                          f"Attempt {attempts}/{MAX_ATTEMPTS}")
                    sleep(RETRY_TIME)
                else:
                    print(f"Error occurred while fetching grants for NOFO "
                          f"'{nofo}': {response.status_code}")
                    break

            except requests.exceptions.RequestException as e:
                print(f"An error occurred while making the API call for NOFO "
                      f"'{nofo}': {e}")
                break

    return grants_data

In [4]:
# Copy/Pase the All of Us Qualtrics NOFO cell, and convert string to list
allOfUs_string = "HG21-041;LM21-002;MH22-200;OD21-144;OD22-150;OD22-153;OTA17-002;OTA18-001;OTA19-002;OTA19-003;OTA19-004;OTA20-001;OTA20-009;OTA22-006;OTA23-001;OTA23-003;A17-446;;PA18-91;PA20-145;PA20-185;PA20-272;PAR20-150;PM23-001;PM23-002;RM21-002;RM21-003;RM21-005;RM21-006;RM21-016;SM18-009;SM21-007;SP20-002"
allOfUs_listed = allOfUs_string.split(';')

# Use complicated long list of All of Us NOFOs as NOFO search
nofo_values = allOfUs_listed

# Simple NOFO for NOFO search
#nofo_values = ['RFA-CA-21-052']

# Run grants_data
grants_data = get_nih_reporter_grants_from_nofo(nofo_values, print_meta=True)

print(f"Total records gathered: {len(grants_data)}")

NOFO Results: HG21-041 (1/1): {'search_id': 'PmNufeXFdEeemdS9wQYsOA', 'total': 0, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/PmNufeXFdEeemdS9wQYsOA/projects'}}
NOFO Results: LM21-002 (1/1): {'search_id': '9bYF0UqSbk-SD-r7Bb2F6A', 'total': 0, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/9bYF0UqSbk-SD-r7Bb2F6A/projects'}}
NOFO Results: MH22-200 (1/1): {'search_id': '3mKwhA1ol0OwdS-uuL3b4A', 'total': 0, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/3mKwhA1ol0OwdS-uuL3b4A/projects'}}
NOFO Results: OD21-144 (1/1): {'search_id': 'fIEZPTd42USdtTLaHyb-Hw', 'total': 0, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted

In [5]:
# Display first 10 grants
for grant in grants_data[:10]:
    print(grant['project_num'])

5R21CA238356-02
1R21CA238356-01A1
5R21CA230879-02
1R21CA230879-01
5R01CA270483-02
1R01CA270483-01
5R01CA264984-03
5R01CA268597-02
5R01CA259469-02
5R01CA251545-03


In [6]:
# Display all data gathered for first grant
grants_data[0]

{'appl_id': 10064138,
 'subproject_id': None,
 'fiscal_year': 2021,
 'org_name': 'EMORY UNIVERSITY',
 'org_city': 'ATLANTA',
 'org_state': 'GA',
 'org_state_name': None,
 'dept_type': 'PUBLIC HEALTH & PREV MEDICINE',
 'project_num': '5R21CA238356-02',
 'project_serial_num': 'CA238356',
 'org_country': 'UNITED STATES',
 'organization': {'org_name': 'EMORY UNIVERSITY',
  'city': None,
  'country': None,
  'org_city': 'ATLANTA',
  'org_country': 'UNITED STATES',
  'org_state': 'GA',
  'org_state_name': None,
  'dept_type': 'PUBLIC HEALTH & PREV MEDICINE',
  'fips_country_code': None,
  'org_duns': ['066469933'],
  'org_ueis': ['S352L5PJLMP8'],
  'primary_duns': '066469933',
  'primary_uei': 'S352L5PJLMP8',
  'org_fips': 'US',
  'org_ipf_code': '2384501',
  'org_zipcode': '303221007',
  'external_org_id': 2384501},
 'award_type': '5',
 'activity_code': 'R21',
 'award_amount': 216667,
 'is_active': False,
 'is_territory': False,
 'project_num_split': {'appl_type_code': '5',
  'activity_code

### We've pulled data from the API. Now to determine how to unpack and format it.

Try to make sense of this JSON response by flattening it into a dataframe

In [7]:
# Load grants JSON as pandas dataframe
df_grants = pd.DataFrame(grants_data)

# Check that length matches above
print(f"Row count: {len(df_grants)}")

print(f"Matches total grants gathered? {len(df_grants) == len(grants_data)}")

# Display first 5 rows
df_grants.head()

Row count: 3186
Matches total grants gathered? True


Unnamed: 0,appl_id,subproject_id,fiscal_year,org_name,org_city,org_state,org_state_name,dept_type,project_num,project_serial_num,...,arra_funded,budget_start,budget_end,cfda_code,funding_mechanism,direct_cost_amt,indirect_cost_amt,project_detail_url,date_added,api_source_search
0,10064138,,2021,EMORY UNIVERSITY,ATLANTA,GA,,PUBLIC HEALTH & PREV MEDICINE,5R21CA238356-02,CA238356,...,N,2020-12-01T12:12:00Z,2022-11-30T12:11:00Z,393,Non-SBIR/STTR,144200,72467.0,https://reporter.nih.gov/project-details/10064138,2020-12-05T07:12:23Z,NOFO_search: A17-446
1,9893342,,2020,EMORY UNIVERSITY,ATLANTA,GA,,PUBLIC HEALTH & PREV MEDICINE,1R21CA238356-01A1,CA238356,...,N,2019-12-03T12:12:00Z,2020-11-30T12:11:00Z,393,Non-SBIR/STTR,119912,65411.0,https://reporter.nih.gov/project-details/9893342,2019-12-07T07:12:55Z,NOFO_search: A17-446
2,9728901,,2019,SLOAN-KETTERING INST CAN RESEARCH,NEW YORK,NY,,,5R21CA230879-02,CA230879,...,N,2019-06-01T12:06:00Z,2022-05-31T12:05:00Z,393,Non-SBIR/STTR,126585,100762.0,https://reporter.nih.gov/project-details/9728901,2019-06-01T07:06:09Z,NOFO_search: A17-446
3,9582452,,2018,SLOAN-KETTERING INST CAN RESEARCH,NEW YORK,NY,,,1R21CA230879-01,CA230879,...,N,2018-06-20T12:06:00Z,2019-05-31T12:05:00Z,393,Non-SBIR/STTR,108750,77974.0,https://reporter.nih.gov/project-details/9582452,2018-06-23T07:06:14Z,NOFO_search: A17-446
4,10610465,,2023,UNIVERSITY OF PENNSYLVANIA,PHILADELPHIA,PA,,OTHER HEALTH PROFESSIONS,5R01CA270483-02,CA270483,...,N,2023-04-01T12:04:00Z,2024-03-31T12:03:00Z,395,Non-SBIR/STTR,530940,111689.0,https://reporter.nih.gov/project-details/10610465,2023-05-04T04:05:27Z,NOFO_search: PA18-91


In [8]:
# Show all top-level columns captured by the dataframe conversion
df_grants.columns.tolist()

['appl_id',
 'subproject_id',
 'fiscal_year',
 'org_name',
 'org_city',
 'org_state',
 'org_state_name',
 'dept_type',
 'project_num',
 'project_serial_num',
 'org_country',
 'organization',
 'award_type',
 'activity_code',
 'award_amount',
 'is_active',
 'is_territory',
 'project_num_split',
 'principal_investigators',
 'contact_pi_name',
 'program_officers',
 'agency_ic_admin',
 'agency_ic_fundings',
 'cong_dist',
 'spending_categories',
 'project_start_date',
 'project_end_date',
 'organization_type',
 'all_text',
 'foa',
 'full_foa',
 'full_study_section',
 'award_notice_date',
 'is_new',
 'mechanism_code_dc',
 'core_project_num',
 'terms',
 'pref_terms',
 'abstract_text',
 'project_title',
 'phr_text',
 'spending_categories_desc',
 'awd_doc_num',
 'init_encumbrance_date',
 'can_task',
 'special_topic_code',
 'agency_code',
 'covid_response',
 'arra_funded',
 'budget_start',
 'budget_end',
 'cfda_code',
 'funding_mechanism',
 'direct_cost_amt',
 'indirect_cost_amt',
 'project_detai

In [9]:
# Define columns to keep. Select those that  match info in the current INS projects.tsv.
# project_type, lead_doc, and award_amount_category will be derived later

select_cols = [
    'project_num',
    'core_project_num',
    'appl_id',
    'fiscal_year',
    'project_title',
    'abstract_text',
    'pref_terms',
    'org_name',
    'org_city',
    'org_state',
    'org_country',
    'principal_investigators',
    'program_officers',
    'award_amount',
    'agency_ic_fundings',
    'award_notice_date',
    'project_start_date',
    'project_end_date',
    'full_foa',
]

In [10]:
# Select only the columns of interest defined above in a new dataframe
df_grants_select = df_grants[select_cols]

In [11]:
# In the grants JSON above, the PI column was nested. Check how the column 
# appears within the dataframe. 
df_grants_select['principal_investigators']

0       [{'profile_id': 1873123, 'first_name': 'COLLEE...
1       [{'profile_id': 1873123, 'first_name': 'COLLEE...
2       [{'profile_id': 11856998, 'first_name': 'Jada'...
3       [{'profile_id': 11856998, 'first_name': 'Jada'...
4       [{'profile_id': 8950071, 'first_name': 'Rebecc...
                              ...                        
3181    [{'profile_id': 12302905, 'first_name': 'Benja...
3182    [{'profile_id': 10131112, 'first_name': 'Joe',...
3183    [{'profile_id': 15167932, 'first_name': 'Gavin...
3184    [{'profile_id': 15056131, 'first_name': 'Luisa...
3185    [{'profile_id': 12628430, 'first_name': 'Cheng...
Name: principal_investigators, Length: 3186, dtype: object

### Explore how to flatten and clean up the grants data for downstream use

The following columns are multi-level json and need to be handled: 
* `principal_investigators`
* `program_officers`
* `agency_ic_fundings`

Example Structures:   
'principal_investigators': [{'profile_id': 1873123,  
    'first_name': 'COLLEEN',  
    'middle_name': 'M',  
    'last_name': 'MC BRIDE',  
    'is_contact_pi': True,  
    'full_name': 'COLLEEN M MC BRIDE',    
    'title': 'GRACE CRUM ROLLINS PROFESSOR AND CHAIR',  
    'email': None}],  
...  
  'program_officers': [{'first_name': 'VERONICA',  
    'middle_name': '',  
    'last_name': 'CHOLLETTE',  
    'full_name': 'VERONICA  CHOLLETTE',  
    'email': None}],  
...  
  'agency_ic_fundings': [{'fy': 2020,  
    'code': 'CA',  
    'name': 'National Cancer Institute',  
    'abbreviation': 'NCI',  
    'total_cost': 185323.0}],  

  
We need to extract values of interest, combine them, and then clean them as necessary.

In [12]:
def concatenate_full_names(row):
    """Replace json row values with full names list from within json"""

    # Check for blank row, then get each full name
    if row:
        full_names = [item['full_name'] for item in row]

        # Concatenate with a comma space between
        return ', '.join(full_names)
    else:
        # Return blank output for blank input
        return ''

In [13]:
# Test function aboe. Get the full name of investigators from the nested values
df_grants_select['principal_investigators'].apply(concatenate_full_names)

0                                      COLLEEN M MC BRIDE
1                                      COLLEEN M MC BRIDE
2                                 Jada Gabrielle Hamilton
3                                 Jada Gabrielle Hamilton
4       Rebecca  Ashare, Salimah H. Meghani, Brooke  W...
                              ...                        
3181                           Benjamin Peter Kleinstiver
3182                                        Joe R Delaney
3183                                            Gavin  Ha
3184                                 Luisa  Escobar Hoyos
3185                                      Chengcheng  Jin
Name: principal_investigators, Length: 3186, dtype: object

In [14]:
# Get the full name of POs from the nested values
df_grants_select['program_officers'].apply(concatenate_full_names)

0       VERONICA  CHOLLETTE
1       VERONICA  CHOLLETTE
2             Wendy  Nelson
3             Wendy  Nelson
4            SHARON A. ROSS
               ...         
3181              Jerry  Li
3182           Stefan  Maas
3183       Miguel  Ossandon
3184               Yin  Liu
3185    Phillip J. Daschner
Name: program_officers, Length: 3186, dtype: object

The name columns (PI and PO) have inconsistent capitalization and spacing. Fix that for better presentation.  
Standardize names by capitalizing the first letter of each word and keeping all others lower-case.  

This method will lose some nuance in name capitalization, e.g.:
* "da Vinci" will become "Da Vinci"
* "McKinley" will become "Mckinley"
* "Mc Kinley" will stay "Mc Kinley"

In [15]:
def format_name_column(name_str):
    # Capitalize the first letter of each name
    formatted_name = name_str.title()
    
    # Remove double whitespaces between names
    formatted_name = formatted_name.replace('  ',' ')
    
    return formatted_name

In [16]:
# Check name concatenation and standardization (applied together) for investigators.
df_grants_select['principal_investigators'].apply(concatenate_full_names).apply(format_name_column)

0                                      Colleen M Mc Bride
1                                      Colleen M Mc Bride
2                                 Jada Gabrielle Hamilton
3                                 Jada Gabrielle Hamilton
4       Rebecca Ashare, Salimah H. Meghani, Brooke Wor...
                              ...                        
3181                           Benjamin Peter Kleinstiver
3182                                        Joe R Delaney
3183                                             Gavin Ha
3184                                  Luisa Escobar Hoyos
3185                                       Chengcheng Jin
Name: principal_investigators, Length: 3186, dtype: object

In [17]:
# Check name concatenation and standardization (applied together) for POs
df_grants_select['program_officers'].apply(concatenate_full_names).apply(format_name_column)

0        Veronica Chollette
1        Veronica Chollette
2              Wendy Nelson
3              Wendy Nelson
4            Sharon A. Ross
               ...         
3181               Jerry Li
3182            Stefan Maas
3183        Miguel Ossandon
3184                Yin Liu
3185    Phillip J. Daschner
Name: program_officers, Length: 3186, dtype: object

In [18]:
# Define columns with nested name values
name_cols = ['principal_investigators', 'program_officers']

# Copy dataframe to avoid SettingWithCopyWarning error
df_formatted = df_grants_select.copy()

# Apply name corrections to all rows
df_formatted[name_cols] = (df_formatted[name_cols]
                               .apply(lambda col: col
                                      .apply(concatenate_full_names)
                                      .apply(format_name_column)))

In [19]:
# Check formatting of name columns within first ten rows of dataframe
df_formatted[['project_num', 'principal_investigators','program_officers']].head(10)

Unnamed: 0,project_num,principal_investigators,program_officers
0,5R21CA238356-02,Colleen M Mc Bride,Veronica Chollette
1,1R21CA238356-01A1,Colleen M Mc Bride,Veronica Chollette
2,5R21CA230879-02,Jada Gabrielle Hamilton,Wendy Nelson
3,1R21CA230879-01,Jada Gabrielle Hamilton,Wendy Nelson
4,5R01CA270483-02,"Rebecca Ashare, Salimah H. Meghani, Brooke Wor...",Sharon A. Ross
5,1R01CA270483-01,"Rebecca Ashare, Salimah H. Meghani, Brooke Wor...",Alexis Diane Bakos
6,5R01CA264984-03,Mikhail Nikiforov,Thomas K. Howcroft
7,5R01CA268597-02,"Colin Goding, Constantinos Koumenis",Elizabeth Woodhouse
8,5R01CA259469-02,"Nitin S Baliga, Charles S Cobbs, Parvinder Hothi",Joseph Kofi Agyin
9,5R01CA251545-03,Gokul M. Das,Joanna M. Watson


Names are handled. Now to extract the NCI funding from the nested agency funding column.  
The API query above selects only grants where NCI is providing administrative support. This step will extract the actual funding NCI is providing, which in some cases may be zero.  

This means that the current approach is gathering NCI-supported grants, but not necessarily NCI-funded grants.  
We are not gathering the converse of this, where NCI is not the admin but is providing funding. 

In [20]:
# Check format of first value in nested funding column
df_formatted['agency_ic_fundings'][0]

[{'fy': 2021,
  'code': 'CA',
  'name': 'National Cancer Institute',
  'abbreviation': 'NCI',
  'total_cost': 216667.0}]

In [23]:
# Check for rows with more than one set of nested funding values
nested_values = df_formatted['agency_ic_fundings'].apply(len)
df_formatted[nested_values > 1]['agency_ic_fundings']

176     [{'fy': 2023, 'code': 'CA', 'name': 'National ...
444     [{'fy': 2023, 'code': 'CA', 'name': 'National ...
728     [{'fy': 2023, 'code': 'CA', 'name': 'National ...
1210    [{'fy': 2023, 'code': 'CA', 'name': 'National ...
1421    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
1564    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
1664    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
1865    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2788    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2864    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2880    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2921    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2923    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2925    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
2977    [{'fy': 2021, 'code': 'CA', 'name': 'National ...
3180    [{'fy': 2022, 'code': 'CA', 'name': 'National ...
Name: agency_ic_fundings, dtype: object

Some `agency_ic_fundings` values list multiple agencies with details for each.  
This is worth keeping for reference, but relies on API data that will be removed before exporting to project TSVs. (We only keep the NCI and total fundings)

In [138]:
def find_projects_with_multiple_agencies(df):
    """Find rows in grants dataframe that have multiple funding agencies"""
    
    selected_rows = []

    # Iterate through each row in the dataframe
    for index, row in df.iterrows():
        agency_fundings = row['agency_ic_fundings']

        # Check if there are multiple agency fundings for the project
        if len(agency_fundings) > 1:
            nci_funding = None
            other_funding = []

            # Separate NCI and other fundings
            for entry in agency_fundings:
                if entry['abbreviation'] == 'NCI':
                    nci_funding = entry['total_cost']
                else:
                    other_funding.append(entry)

            # Create a new row in selected_rows for each other agency funding
            for other_entry in other_funding:
                row_data = {
                    'project_num': row['project_num'],
                    'code': other_entry['code'],
                    'name': other_entry['name'],
                    'abbreviation': other_entry['abbreviation'],
                    'other_cost': other_entry['total_cost'],
                    'nci_cost': nci_funding
                }
                selected_rows.append(row_data)

    # Create a new dataframe from the selected rows
    selected_df = pd.DataFrame(selected_rows)

    return selected_df

In [141]:
# Run function to get df of funding from other agencies for each project
df_other_funding = find_projects_with_multiple_agencies(df_formatted)

df_other_funding.head(20)

Unnamed: 0,project_num,code,name,abbreviation,other_cost,nci_cost
0,5R01CA267554-02,GM,National Institute of General Medical Sciences,NIGMS,320000.0,21745.0
1,5R01CA263504-02,GM,National Institute of General Medical Sciences,NIGMS,320000.0,24109.0
2,5R01CA266415-02,GM,National Institute of General Medical Sciences,NIGMS,313600.0,32047.0
3,5R01CA263747-02,GM,National Institute of General Medical Sciences,NIGMS,320000.0,11736.0
4,1R01CA267554-01A1,GM,National Institute of General Medical Sciences,NIGMS,320000.0,22189.0
5,1R01CA263504-01A1,GM,National Institute of General Medical Sciences,NIGMS,320000.0,31131.0
6,1R01CA263747-01A1,GM,National Institute of General Medical Sciences,NIGMS,320000.0,18506.0
7,1R01CA266415-01A1,GM,National Institute of General Medical Sciences,NIGMS,320000.0,33224.0
8,3P30CA076292-24S7,OD,NIH Office of the Director,OD,96290.0,53709.0
9,3P30CA016058-46S3,OD,NIH Office of the Director,OD,50000.0,50000.0


In [142]:
def extract_total_cost(fundings):
    """Extract the total NCI funding from IC award totals"""

    for funding in fundings:
        # CA is the NCI code
        if funding['code'] == 'CA':
            return int(funding['total_cost'])
    # Return 0 if no NCI funding found
    return int(0)

In [143]:
# Replace the nested column with one just for NCI funding amount
df_formatted['agency_ic_fundings'] = (df_formatted['agency_ic_fundings']
                                        .apply(extract_total_cost))

In [144]:
# Check first 10 rows of new funding column
df_formatted[['project_num','agency_ic_fundings']].head(10)

Unnamed: 0,project_num,agency_ic_fundings
0,5R21CA238356-02,216667
1,1R21CA238356-01A1,185323
2,5R21CA230879-02,227347
3,1R21CA230879-01,186724
4,5R01CA270483-02,642629
5,1R01CA270483-01,699429
6,5R01CA261691-03,556856
7,1R01CA272991-01A1,526386
8,1R01CA273716-01A1,655228
9,1R01CA273357-01A1,634435


### Rename columns to match expected values within INS-ETL

In [145]:
# Create a dict with old:expected col names
expected_col_rename_dict = {
    "project_num": "project_id",
    "core_project_num": "queried_project_id",
    "appl_id": "application_id",
    "pref_terms": "keywords",
    "agency_ic_fundings": "nci_funded_amount"
}

In [148]:
# Check new INS names
df_formatted.rename(columns=expected_col_rename_dict).head()

Unnamed: 0,project_id,queried_project_id,application_id,fiscal_year,project_title,abstract_text,keywords,org_name,org_city,org_state,org_country,principal_investigators,program_officers,award_amount,nci_funded_amount,award_notice_date,project_start_date,project_end_date,full_foa
0,5R21CA238356-02,R21CA238356,10064138,2021,Evaluating Deliberative Democracy Approaches w...,PROJECT SUMMARY\nSeveral national organization...,Address;Adoption;African;Agreement;Area;Awaren...,EMORY UNIVERSITY,ATLANTA,GA,UNITED STATES,Colleen M Mc Bride,Veronica Chollette,216667,216667,2020-11-24T12:11:00Z,2019-12-03T12:12:00Z,2022-11-30T12:11:00Z,PA-17-446
1,1R21CA238356-01A1,R21CA238356,9893342,2020,Evaluating Deliberative Democracy Approaches w...,PROJECT SUMMARY\nSeveral national organization...,Address;Adoption;African;Agreement;Area;Awaren...,EMORY UNIVERSITY,ATLANTA,GA,UNITED STATES,Colleen M Mc Bride,Veronica Chollette,185323,185323,2019-12-02T12:12:00Z,2019-12-03T12:12:00Z,2021-11-30T12:11:00Z,PA-17-446
2,5R21CA230879-02,R21CA230879,9728901,2019,Responses to Genetic Risk Modifier Testing Amo...,PROJECT SUMMARY\nWomen with a BRCA1/2 gene mut...,Address;Adverse effects;Affect;Age;BRCA1 gene;...,SLOAN-KETTERING INST CAN RESEARCH,NEW YORK,NY,UNITED STATES,Jada Gabrielle Hamilton,Wendy Nelson,227347,227347,2019-05-17T12:05:00Z,2018-06-20T12:06:00Z,2022-05-31T12:05:00Z,PA-17-446
3,1R21CA230879-01,R21CA230879,9582452,2018,Responses to Genetic Risk Modifier Testing Amo...,PROJECT SUMMARY\nWomen with a BRCA1/2 gene mut...,Address;Adverse effects;Affect;Age;BRCA1 gene;...,SLOAN-KETTERING INST CAN RESEARCH,NEW YORK,NY,UNITED STATES,Jada Gabrielle Hamilton,Wendy Nelson,186724,186724,2018-06-20T12:06:00Z,2018-06-20T12:06:00Z,2020-05-31T12:05:00Z,PA-17-446
4,5R01CA270483-02,R01CA270483,10610465,2023,Cannabis use and outcomes in ambulatory patien...,PROJECT ABSTRACT\nBetween 24-40% of cancer pat...,Adjuvant Analgesic;Adult;African American;Amer...,UNIVERSITY OF PENNSYLVANIA,PHILADELPHIA,PA,UNITED STATES,"Rebecca Ashare, Salimah H. Meghani, Brooke Wor...",Sharon A. Ross,642629,642629,2023-05-02T12:05:00Z,2022-04-15T12:04:00Z,2027-03-31T12:03:00Z,PA-18-917


This looks good, though I still need to add in some INS idiosyncrasies later:  
* `project_type`=`Grant` (hardcoded for now)
* assign `lead_doc` using Key Player. Will get tricky for shared grants/NOFOs
* `award_amount_category` (e.g. $4M to $10M)

### Gathering and cleaning so far have used a single input
Try using the cleaned key programs file as input and gathering/cleaning all provided NOFO/Award values

In [149]:
# Set filepath using config
key_programs_path = f"../{config.CLEANED_KEY_PROGRAMS_CSV}"

# Load cleaned Key Programs CSV as a dataframe
df_key_programs = pd.read_csv(key_programs_path) 

In [150]:
# Check loaded dataframe
df_key_programs.head()

Unnamed: 0,program_name,program_acronym,focus_area,doc,contact_pi,contact_pi_email,contact_nih,contact_nih_email,nofo,award,program_link,data_link,cancer_type
0,Pancreatic Adenocarcinoma Stromal Reprograming...,PSRC/PASSCODE,DCC,"DCB,DCTD","MAITRA, ANIRBAN",amaitra@mdanderson.org,"Hildesheim, Jeff; UJHAZY, PETER",hildesheimj@mail.nih.gov; ujhazyp@mail.nih.gov,RFA-CA-21-041; RFA-CA-21-042,1 U24 CA274274-01,https://www.cancer.gov/about-nci/organization/...,,Pancreas Cancer
1,Oncology Models Forum (U24),OMF,"DCC,Cancer Moonshot",DCB,"BUTTE, ATUL J",atul.butte@ucsf.edu,Christine Nadeau;Joanna Watson,christine.nadeau@nih.gov;watsonjo@mail.nih.gov,PAR14-239; PAR-16-059; PAR-17-245; PAR-20-131,5U24CA195858,https://www.cancer.gov/about-nci/organization/...,,This program focuses on cancer broadly - not l...
2,Pediatric Preclinical in Vivo Testing (PIVOT),PIVOT,"Pediatric/AYA,DCC,Cancer Moonshot","CIB,CTEP,DCTD","BULT, CAROL J",carol.bult@jax.org,"Smith, Malcolm",Malcolm.Smith@nih.gov,RFA-CA-20-034; RFA-CA-14-018; RFA-CA-20-041,U24CA263963,https://ctep.cancer.gov/MajorInitiatives/Pedia...,,This program focuses on cancer broadly - not l...
3,Cancer Immunologic Data Commons,CIDC,Cancer Moonshot,DCTD,"CERAMI, ETHAN",cerami@jimmy.harvard.edu,"THURIN, MAGDALENA",thurinm@mail.nih.gov,RFA-CA-17-006; RFA-CA-22-038,1U24CA224316,https://dctd.cancer.gov/ResearchNetworks/cimac...,,This program focuses on cancer broadly - not l...
4,CANCER IMMUNE MONITORING AND ANALYSIS CENTERS,CIMAC,Cancer Moonshot,"CTEP,DCTD",,,"Thurin, Magdalena",thurinm@mail.nih.gov,RFA-CA-17-005;RFA-CA-22-038,,https://dctd.cancer.gov/ResearchNetworks/cimac...,,This program focuses on cancer broadly - not l...


In [31]:
# Try pulling and formatting NOFO list from each program into useable format
for index, row in df_key_programs.iterrows():
    program_name = row['program_name']
    nofo_str = row['nofo']

    # Check for NaN value to avoid error on split()    
    if pd.isna(nofo_str):
        print(f"{program_name}: No NOFOs defined.")
        
    # Separate NOFOs by semicolon, remove spaces, and then add to list
    else:
        nofo_list = [x.strip() for x in nofo_str.split(';')]
        print(f"{program_name}: {nofo_list}")

Pancreatic Adenocarcinoma Stromal Reprograming ConSortium (PSRC/PASSCODE): ['RFA-CA-21-041', 'RFA-CA-21-042']
Oncology Models Forum (U24): ['PAR14-239', 'PAR-16-059', 'PAR-17-245', 'PAR-20-131']
Pediatric Preclinical in Vivo Testing (PIVOT) : ['RFA-CA-20-034', 'RFA-CA-14-018', 'RFA-CA-20-041']
Cancer Immunologic Data Commons: ['RFA-CA-17-006', 'RFA-CA-22-038']
CANCER IMMUNE MONITORING AND ANALYSIS CENTERS: ['RFA-CA-17-005', 'RFA-CA-22-038']
Cellular Cancer Biology Imaging Research: ['RFA-CA-21-002']
Barrett’s Esophagus Translational Research Network (BETRNet): ['RFA-CA-16-007', 'RFA-CA-16-006', 'RFA-CA-10-014']
Fusion Oncoproteins in Childhood Cancers (FusOnc2): ['RFA-CA-17-049', 'RFA-CA-19-016']
The Early Detection Research Network (EDRN): ['CA-14-017', 'CA-21-034', 'CA-21-033', 'CA-21-035', 'CA-14-015', 'CA-14-016', 'CA-16-009']
Cancer Prevention-Interception Targeted Agent Discovery Program: ['RFA-CA-21-038', 'RFA-CA-22-055', 'RFA-CA-21-039']
 Small Cell Lung Cancer (SCLC) Consortiu

In [32]:
# Pull API fields to keep from config
select_cols = config.API_FIELDS
select_cols

['project_num',
 'core_project_num',
 'appl_id',
 'fiscal_year',
 'project_title',
 'abstract_text',
 'pref_terms',
 'org_name',
 'org_city',
 'org_state',
 'org_country',
 'principal_investigators',
 'program_officers',
 'award_amount',
 'agency_ic_fundings',
 'award_notice_date',
 'project_start_date',
 'project_end_date',
 'full_foa',
 'api_source_search']

In [152]:
# Remove spaces and non-alphanumeric characters from program name for filename
for program_name in df_key_programs['program_name']:
    newName = ''.join(filter(str.isalnum, program_name))
    print(newName)

PancreaticAdenocarcinomaStromalReprogramingConSortiumPSRCPASSCODE
OncologyModelsForumU24
PediatricPreclinicalinVivoTestingPIVOT
CancerImmunologicDataCommons
CANCERIMMUNEMONITORINGANDANALYSISCENTERS
CellularCancerBiologyImagingResearch
BarrettsEsophagusTranslationalResearchNetworkBETRNet
FusionOncoproteinsinChildhoodCancersFusOnc2
TheEarlyDetectionResearchNetworkEDRN
CancerPreventionInterceptionTargetedAgentDiscoveryProgram
SmallCellLungCancerSCLCConsortium
AcquiredResistancetoTherapyNetworkARTNet
AllofUs
ChildhoodCancerSurvivorStudy
TheAdjuvantLungCancerEnrichmentMarkerIdentificationandSequencingTrialsALCHEMIST
PaCMEN
AIDSandCancerSpecimenResourceACSR
SPOREinBladderCancer
SPORETargetedTherapiesforGlioma
TheMemorialSloanKetteringCancerCenterSPOREinLeukemia
TheUniversityofTexasMDAndersonCancerCenterSPOREinHepatocellularCarcinoma
BrainTumorSPOREGrant
VanderbiltIngramCancerCenterSPOREinGastrointestinalCancer
ADMIRALStudyAdmixtureanalysisofacutelymphoblasticleukemiainAfricanAmericanchildren

Not shown here: some ad-hoc analyses investigating odd gathering results for the ALCHEMIST Program.

The NOFOs for ALCHEMIST are actually Award numbers. Looking back at the key_programs.tsv, there was a shift in columns on that row.   
Maybe a csv -> df -> tsv issue. Switched all formats to csv. (Update: Frame-shift issue was present in the raw data, so not a filetype conversion issue)    
Added a `qualtrics_output_2023-07-19_manual_fix.csv` in which the following are corrected:
* Column shift starting with NOFOs in ALCHEMIST
* Removed extra semicolon in All of Us NOFOs that split PA18-91 into PA18- and 91  


Updated config to point to this manually fixed csv instead of the original. Eventually need code to detect and/or fix these automatically. 
Also getting some odd encoding behavior whenever the raw qualtrics csv is manually edited. Especially the last column name newline in "check all that \n apply"

## Add in Grants data when specified in qualtrics csv

Reuse and tweak the function used to call the API for NOFOs. There's probably a more clever way to do this.

In [154]:
# OBSOLETE, replaced later by get_nih_repoter_grants
def get_nih_reporter_grants_from_award(award_values:str, print_meta=False):
    """Get grants info for provided Award Numbers.
    
    :param award_values: string of core projects and/or awards separated 
        separated by semicolons. 

    :param print_meta: boolean indicator. If True, print API gathering process
        results to console"""

    base_url = "https://api.reporter.nih.gov/v2/projects/search"
    grants_data = []

    for award in award_values:
        # Check for blank Award values
        if not award:
            print(f"Blank award value encountered. Skipping.")
            continue

        # Set default values for params not likley to change
        LIMIT = 500
        MAX_ATTEMPTS = 5
        RETRY_TIME = 2

        # Set starting value for counters
        offset = 0
        page = 0
        attempts = 0

        # RePORTER API sets a max limit of 500 records per call.
        # Keep looping each call in "pages" until the number of records 
        # gathered reaches the total number of records available. 
        
        # Set a cap on the number of attempts at a failed call before 
        # moving on to the next award.
        while attempts < MAX_ATTEMPTS:
            # Set parameters for API call
            params = {
                "criteria": {
                    "project_nums": [award],
                    "exclude_subprojects": True,
                    "agencies": ["NCI"]
                },
                "limit": LIMIT,
                "offset": offset,
                "sort_field": "FiscalYear",
                "sort_order": "desc",
            }

            try: 
                # Define response details
                response = requests.post(base_url, 
                                        json=params, 
                                        headers={
                                            "accept": "application/json", 
                                            "Content-Type": "application/json"})

                # If response is good, get results
                if response.status_code == 200:
                    grants = response.json()
                    # Add API source indicator
                    for grant in grants['results']:
                        grant['api_source_search'] = f"Award_search: {award}"
                    # Add grants to running list
                    grants_data.extend(grants['results'])

                    # Increase offset by limit to get next "page"
                    total_records = grants['meta']['total']
                    offset = offset + LIMIT
                    page = page + 1

                    # Print paginated partial optional metadata
                    # Consider replacing this with proper logging
                    if print_meta == True:
                        total_pages = max(ceil(total_records/LIMIT),1)
                        print(f"Award Results: {award} ({page}/{total_pages}):"
                              f" {grants['meta']}")

                    # Stop looping if offset has reached total record count
                    if offset >= total_records:
                        break

                # Handle 500 errors by retrying after 2 second delay
                elif response.status_code == 500:
                    attempts = attempts + 1
                    print(f"Received a 500 error for Aroject '{award}'. "
                          f"Retrying after {RETRY_TIME} seconds. "
                          f"Attempt {attempts}/{MAX_ATTEMPTS}")
                    sleep(RETRY_TIME)
                else:
                    print(f"Error occurred while fetching grants for Award "
                          f"'{award}': {response.status_code}")
                    break

            except requests.exceptions.RequestException as e:
                print(f"An error occurred while making the API call for "
                      f"Award'{award}': {e}")
                break

    return grants_data

Test the award-based gathering on a single Program with Awards listed

In [156]:
# Copy/Pase the PaCMEN Qualtrics Award cell, and convert string to list
PaCMEN_string = "U01CA224146; U01CA224193-01; 4U01CA224348-02"
PaCMEN_listed = PaCMEN_string.split(';')

# Use list of PaCMEN awards as award_values
award_values = PaCMEN_listed

# Run grants_data
pacmen_grants_data = get_nih_reporter_grants_from_award(award_values, print_meta=True)

print(f"Total records gathered: {len(pacmen_grants_data)}")

Award Results: U01CA224146 (1/1): {'search_id': 'nyqgRG1wA0G3yTdXozjvdw', 'total': 3, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/nyqgRG1wA0G3yTdXozjvdw/projects'}}
Award Results:  U01CA224193-01 (1/1): {'search_id': 'AoBk7GHFFUGb5XfJSya4Eg', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/AoBk7GHFFUGb5XfJSya4Eg/projects'}}
Award Results:  4U01CA224348-02 (1/1): {'search_id': 'UoEURctrv0qBX83NonhcqQ', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/UoEURctrv0qBX83NonhcqQ/projects'}}
Total records gathered: 5


In [159]:
# Quick test of converting award-gathered grants to dataframe
df_pacmen = pd.DataFrame(pacmen_grants_data)
df_pacmen.head()

Unnamed: 0,appl_id,subproject_id,fiscal_year,org_name,org_city,org_state,org_state_name,dept_type,project_num,project_serial_num,...,arra_funded,budget_start,budget_end,cfda_code,funding_mechanism,direct_cost_amt,indirect_cost_amt,project_detail_url,date_added,api_source_search
0,10250566,,2021,DANA-FARBER CANCER INST,BOSTON,MA,,,5U01CA224146-03,CA224146,...,N,2021-09-01T12:09:00Z,2022-08-31T12:08:00Z,353,Non-SBIR/STTR,350000,261615,https://reporter.nih.gov/project-details/10250566,2021-09-04T04:09:59Z,Award_search: U01CA224146
1,10242454,,2020,DANA-FARBER CANCER INST,BOSTON,MA,,,4U01CA224146-02,CA224146,...,N,2020-09-01T12:09:00Z,2021-08-31T12:08:00Z,353,Non-SBIR/STTR,350000,261615,https://reporter.nih.gov/project-details/10242454,2020-09-05T07:09:23Z,Award_search: U01CA224146
2,9449587,,2017,DANA-FARBER CANCER INST,BOSTON,MA,,,1U01CA224146-01,CA224146,...,N,2017-09-30T12:09:00Z,2020-08-31T12:08:00Z,353,Non-SBIR/STTR,1050000,795039,https://reporter.nih.gov/project-details/9449587,2017-10-01T02:10:16Z,Award_search: U01CA224146
3,9450411,,2017,FRED HUTCHINSON CANCER RESEARCH CENTER,SEATTLE,WA,,,1U01CA224193-01,CA224193,...,N,2017-09-30T12:09:00Z,2020-08-31T12:08:00Z,353,Non-SBIR/STTR,1389466,457363,https://reporter.nih.gov/project-details/9450411,2017-10-01T02:10:16Z,Award_search: U01CA224193-01
4,10242459,,2020,MASSACHUSETTS GENERAL HOSPITAL,BOSTON,MA,,,4U01CA224348-02,CA224348,...,N,2020-09-01T12:09:00Z,2021-08-31T12:08:00Z,353,Non-SBIR/STTR,350000,227095,https://reporter.nih.gov/project-details/10242459,2020-09-05T07:09:23Z,Award_search: 4U01CA224348-02


Looks promising. Use the stand-in `process_program` function to add award-gathered grants data to nofo-gathered grants data and clean all in a loop.

To iterate through each program, we need to gather and clean during the loop. Make a placeholder cleaning function.

In [160]:
def clean_grants_data(grants_data, select_cols):
    """Create clean dataframes from JSON API responses."""
    
    # Load JSON as dataframe
    df_grants = pd.DataFrame(grants_data)

    df_grants_selected = df_grants[select_cols]

    # Other cleaning steps 
    # ... 
    clean_grants = df_grants_selected.copy()
    #...

    return clean_grants

In [161]:
# THIS FUNCTION IS OVERWRITTEN LATER IN THE NOTEBOOK
def process_program(program_name, nofo_list, award_list, select_cols):
    """Create grants TSVs for each key program."""

    # Get grants data using NIH RePORTEr API
    nofo_grants_data = get_nih_reporter_grants_from_nofo(nofo_list, print_meta=True)
    award_grants_data = get_nih_reporter_grants_from_award(award_list, print_meta=True)

    # Combine data
    grants_data = nofo_grants_data + award_grants_data

    # Create dataframe and clean grants data
    clean_grants = clean_grants_data(grants_data, select_cols)

    # Export grants data as tsv for each program
    # TEST EXPORT DISABLED TO AVOID FUTURE CONFUSION
    # Use program names with only alphanumeric characters for filename
    # file_name = f"{''.join(filter(str.isalnum, program_name))}.csv" # Changed to csv
    # clean_grants.to_csv(file_name, index=False)

In [162]:
# Iterate through nofos and awards for each program
for index, row in df_key_programs.iterrows():
    program_name = row['program_name']
    nofo_str = row['nofo']
    award_str = row['award']

    print(f"---\n{program_name}")

    # Skip any programs where no funding is provided
    if pd.isna(nofo_str) and pd.isna(award_str):
        print(f"No NOFOs or Awards defined. Skipping.")
    else:
        nofo_list = [] if pd.isna(nofo_str) else [x.strip() for x in nofo_str.split(';')]
        award_list = [] if pd.isna(award_str) else [x.strip() for x in award_str.split(';')]

        if not nofo_list:
            process_program(program_name, nofo_list, award_list, select_cols)
        elif not award_list:
            process_program(program_name, nofo_list, award_list, select_cols)
        else:
            process_program(program_name, nofo_list, award_list, select_cols)


---
Pancreatic Adenocarcinoma Stromal Reprograming ConSortium (PSRC/PASSCODE)
NOFO Results: RFA-CA-21-041 (1/1): {'search_id': 'dcZHO0I8r0iXdqU3Euiv3g', 'total': 6, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/dcZHO0I8r0iXdqU3Euiv3g/projects'}}
NOFO Results: RFA-CA-21-042 (1/1): {'search_id': 'AQtzkxzxDkyUqvDBBxTYtw', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/AQtzkxzxDkyUqvDBBxTYtw/projects'}}
Award Results: 1 U24 CA274274-01 (1/1): {'search_id': 'DSlmuUgSH0yOZPE9meEhLA', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/DSlmuUgSH0yOZPE9meEhLA/projects'}}
---
Oncology Models Forum (U24)
NOFO Results: PAR14-239 (1/1): {'sear

Success! Results will look good once the cleaning functions are properly added.  
First, let's clean and combine the nofo vs award api functions

In [164]:
def get_nih_reporter_grants(search_values:str, search_type:str, print_meta=False):
    """Get grants info for either NOFOs or Awards as specified.
    
    :param search_values: string of values to search, separated by semicolons

    :param search_type: type of search (e.g., 'nofo' or 'award')

    :param print_meta: boolean indicator. If True, print API gathering 
                        process results to console.
    """

    # Set the search field based on the search_type
    if search_type == 'nofo':
        search_field = "foa"
    elif search_type == 'award':
        search_field = "project_nums"
    else:
        raise ValueError("Invalid search type.")

    base_url = "https://api.reporter.nih.gov/v2/projects/search"
    grants_data = []

    for search_value in search_values:
        # Check for blank search values
        if not search_value:
            print(f"Blank {search_type} value encountered. Skipping.")
            continue

        # Set default values for params not likley to change
        LIMIT = 500
        MAX_ATTEMPTS = 5
        RETRY_TIME = 2

        # Set starting value for counters
        offset = 0
        page = 0
        attempts = 0

        # RePORTER API sets a max limit of 500 records per call.
        # Keep looping each call in "pages" until the number of records 
        # gathered reaches the total number of records available. 
        
        # Set a cap on the number of attempts at a failed call before 
        # moving on to the next award.
        while attempts < MAX_ATTEMPTS:
            # Set parameters for API call
            params = {
                "criteria": {
                    search_field: [search_value],
                    "exclude_subprojects": True,
                    "agencies": ["NCI"]
                },
                "limit": LIMIT,
                "offset": offset,
                "sort_field": "FiscalYear",
                "sort_order": "desc",
            }

            try: 
                # Define response details
                response = requests.post(base_url, 
                                        json=params, 
                                        headers={
                                            "accept": "application/json", 
                                            "Content-Type": "application/json"})

                # If response is good, get results
                if response.status_code == 200:
                    grants = response.json()
                    # Add API source indicator
                    for grant in grants['results']:
                        grant['api_source_search'] = f"{search_type}_{search_value}"
                    # Add grants to running list
                    grants_data.extend(grants['results'])

                    # Increase offset by limit to get next "page"
                    total_records = grants['meta']['total']
                    offset = offset + LIMIT
                    page = page + 1

                    # Print paginated partial optional metadata
                    # Consider replacing this with proper logging
                    if print_meta == True:
                        total_pages = max(ceil(total_records/LIMIT),1)
                        print(f"{search_type}: {search_value} "
                              f"{page}/{total_pages}): {grants['meta']}")

                    # Stop looping if offset has reached total record count
                    if offset >= total_records:
                        break

                # Handle 500 errors by retrying after 2 second delay
                elif response.status_code == 500:
                    attempts = attempts + 1
                    print(f"Received a 500 error for "
                          f"{search_type} '{search_value}'. "
                          f"Retrying after {RETRY_TIME} seconds. "
                          f"Attempt {attempts}/{MAX_ATTEMPTS}")
                    sleep(RETRY_TIME)
                else:
                    print(f"Error occurred while fetching grants for "
                          f"{search_type} '{search_value}': "
                          f"{response.status_code}")
                    break

            except requests.exceptions.RequestException as e:
                print(f"An error occurred while making the API call for "
                      f"{search_type} '{search_value}': {e}")
                break

    return grants_data

Rework the process_program to use the new get_nih_reporter_grants that combines nofos and award search

In [165]:
def process_program(program_name, nofo_list, award_list, select_cols):
    """Create grants TSVs for each key program."""

    # Get grants data using NIH RePORTER API for NOFOs and Awards
    nofo_grants_data = get_nih_reporter_grants(nofo_list, 'nofo', print_meta=True)
    award_grants_data = get_nih_reporter_grants(award_list, 'award', print_meta=True)

    # Combine data
    combined_grants_data = nofo_grants_data + award_grants_data

    # Create dataframe and clean grants data
    clean_grants = clean_grants_data(combined_grants_data, select_cols)

    # Export grants data as tsv for each program
    # EXPORT DISABLED TO AVOID CONFUSION
    # Use program names with only alphanumeric characters for filename
    # file_name = f"{''.join(filter(str.isalnum, program_name))}.csv"
    # clean_grants.to_csv(file_name, index=False)

In [166]:
for index, row in df_key_programs.iterrows():
    program_name = row['program_name']
    nofo_str = row['nofo']
    award_str = row['award']

    print(f"---\n{program_name}")

    if pd.isna(nofo_str) and pd.isna(award_str):
        print(f"No NOFOs or Awards defined. Skipping.")
    elif pd.isna(nofo_str):
        award_list = [x.strip() for x in award_str.split(';')]
        process_program(program_name, [], award_list, select_cols)
    elif pd.isna(award_str):
        nofo_list = [x.strip() for x in nofo_str.split(';')]
        process_program(program_name, nofo_list, [], select_cols)
    else:
        nofo_list = [x.strip() for x in nofo_str.split(';')]
        award_list = [x.strip() for x in award_str.split(';')]
        process_program(program_name, nofo_list, award_list, select_cols)

---
Pancreatic Adenocarcinoma Stromal Reprograming ConSortium (PSRC/PASSCODE)
nofo: RFA-CA-21-041 1/1): {'search_id': 'X7dvS4WqO06mT-I5XBDLWg', 'total': 6, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/X7dvS4WqO06mT-I5XBDLWg/projects'}}
nofo: RFA-CA-21-042 1/1): {'search_id': 'jJdCM8xcrE6gmTroy1l0Ng', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': False, 'properties': {'URL': 'https:/reporter.nih.gov/search/jJdCM8xcrE6gmTroy1l0Ng/projects'}}
award: 1 U24 CA274274-01 1/1): {'search_id': 'KIVKsorZW0WzmI9Bvp1UNA', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/KIVKsorZW0WzmI9Bvp1UNA/projects'}}
---
Oncology Models Forum (U24)
nofo: PAR14-239 1/1): {'search_id': 'mcqaPlAZqE2YONF1Y7jC-A', 't

In [167]:
# Rework the string-to-list conversion to reduce redundant redundancy
for index, row in df_key_programs.iterrows():
    program_name = row['program_name']
    nofo_str = row['nofo']
    award_str = row['award']

    print(f"---\n{program_name}")
    if isinstance(award_str, str): 
        award_list = [x.strip() for x in award_str.split(';')]
    else: award_list = []
    if isinstance(nofo_str, str): 
        nofo_list = [x.strip() for x in nofo_str.split(';')]
    else: nofo_list = []

    print(award_list)
    print(nofo_list)

---
Pancreatic Adenocarcinoma Stromal Reprograming ConSortium (PSRC/PASSCODE)
['1 U24 CA274274-01']
['RFA-CA-21-041', 'RFA-CA-21-042']
---
Oncology Models Forum (U24)
['5U24CA195858']
['PAR14-239', 'PAR-16-059', 'PAR-17-245', 'PAR-20-131']
---
Pediatric Preclinical in Vivo Testing (PIVOT) 
['U24CA263963']
['RFA-CA-20-034', 'RFA-CA-14-018', 'RFA-CA-20-041']
---
Cancer Immunologic Data Commons
['1U24CA224316']
['RFA-CA-17-006', 'RFA-CA-22-038']
---
CANCER IMMUNE MONITORING AND ANALYSIS CENTERS
[]
['RFA-CA-17-005', 'RFA-CA-22-038']
---
Cellular Cancer Biology Imaging Research
[]
['RFA-CA-21-002']
---
Barrett’s Esophagus Translational Research Network (BETRNet)
[]
['RFA-CA-16-007', 'RFA-CA-16-006', 'RFA-CA-10-014']
---
Fusion Oncoproteins in Childhood Cancers (FusOnc2)
[]
['RFA-CA-17-049', 'RFA-CA-19-016']
---
The Early Detection Research Network (EDRN)
[]
['CA-14-017', 'CA-21-034', 'CA-21-033', 'CA-21-035', 'CA-14-015', 'CA-14-016', 'CA-16-009']
---
Cancer Prevention-Interception Targeted

Check that searching empty NOFOs or Awards were not resulting in unexpected results.

In [168]:
# Use pacmen awards and blank nofo
pac_str = "U01CA224146; U01CA224193-01; 4U01CA224348-02"
pac_list = [award.strip() for award in pac_str.split(';')]
nofo_list = []

# Gather grants from API for pacmen awards and blank nofos
award_data = get_nih_reporter_grants(pac_list,'award',True)
nofo_data = get_nih_reporter_grants(nofo_list, 'nofo',True)

# Combine into single grants awards
grants_data = award_data + nofo_data

print(f"Total records found: {len(grants_data)}")

award: U01CA224146 1/1): {'search_id': 'kv_-tkwWkEmIGB_LnjxPIw', 'total': 3, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/kv_-tkwWkEmIGB_LnjxPIw/projects'}}
award: U01CA224193-01 1/1): {'search_id': 'Uiu5MSe8zkuiHURByOANcQ', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/Uiu5MSe8zkuiHURByOANcQ/projects'}}
award: 4U01CA224348-02 1/1): {'search_id': 'Wn3PaQp4Nku7vZGTcB7gjw', 'total': 1, 'offset': 0, 'limit': 500, 'sort_field': 'FiscalYear', 'sort_order': 'desc', 'sorted_by_relevance': True, 'properties': {'URL': 'https:/reporter.nih.gov/search/Wn3PaQp4Nku7vZGTcB7gjw/projects'}}
Total records found: 5


## All the pieces look good
Time to refactor and add to `data_preparation.py`, `clean_grants_data.py`, and `main.py`