# Summary Statistics 
2023-08-08 ZD  

This notebook will explore options to gather summary statistics and other reporting data calculated from data within processed grants data.  

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import re
from itertools import combinations


# Method to import from parent directory
import os
import sys
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
sys.path.append(root_dir)

import config

### Load data from all Key Program grants output CSVs into a single dataframe for stats

In [2]:
# Define directory containing processed grant data
processed_dir = '../'+config.PROCESSED_DIR

# Define directory to store reports. Create if doesn't already exist
reports_dir = '../'+config.REPORTS_DIR
if not os.path.exists(reports_dir):
    os.makedirs(reports_dir)

# # Backup paths to allow for using same set of gathered data during notebook development
# processed_dir = '../' + 'data/processed/2023-07-19/api-gathered-2023-08-25'
# reports_dir = '../' + 'reports/2023-07-19/api-gathered-2023-08-25'
if not os.path.exists(reports_dir):
    os.makedirs(reports_dir)

print(f"Project data pulled from: {processed_dir}")
print(f"Reports will be output to: {reports_dir}")

Project data pulled from: ../data/processed/2023-07-19/api-gathered-2023-08-30
Reports will be output to: ../reports/2023-07-19/api-gathered-2023-08-30


In [3]:
# Load grants data
grants_filename = os.path.join(processed_dir, 'project.tsv')
df = pd.read_csv(grants_filename, sep='\t')

# Rename program.program_id column
df.rename(columns={'program.program_id':'program'}, inplace=True)

### Quick detour to handle abstract text cleaning

In [4]:
# Pull one abstract to check for odd character encoding
text = df.abstract_text[1]
text

'IMMUNE MONITORING AND ANALYSIS OF CANCER AT STANFORD (IMACS) Abstract The Center for Immune Monitoring and Analysis of Cancer at Stanford (IMACS) will perform highly comprehensive assays of immune phenotype and function for NCI-identified clinical trials. These will include standardized assays already developed on CyTOF, high-dimensional flow cytometry, Luminex, TCRseq, and RNAseq platforms. As part of the program we will also standardize and offer as assays Stanford-invented technologies under development, including Multiplexed Ion Beam Imaging (MIBI) and Assays of Transposon- Accessible Chromatin (ATAC-seq). We have designed our center structure to work with investigators to define the assays best suited to the immunological questions being posed, and match these with the required sample types. We will perform quality control measures on all assays, as well as generate a standard report for each assay and project. Data will be organized via our online database, Stanford Data Miner, 

Lots of odd characters in here like "\xad" and "\xa0". Double-spaces were an issue before as well.  
Make a function that can clean the column with regex and space replacement.


In [5]:
def clean_abstract(text):
    # Define a list of characters to be removed or replaced
    chars_to_remove = ['\xad']
    chars_to_space = ['  ']

    if isinstance(text, str):
        # Remove unwanted characters (no space added)
        for char in chars_to_remove:
            text = text.replace(char, '')
        # Replace unwanted characters with a space
        for char in chars_to_space:
            text = text.replace(char, ' ')

        # Remove non-breaking spaces, newlines, and other unwanted characters
        cleaned_text = re.sub(r'[\s\xa0]+', ' ', text).strip()
        return cleaned_text
        
    # Return original value if it is NaN or float
    else: 
        return text

In [6]:
# Try it on the example abstract from before
clean = clean_abstract(text)
clean

'IMMUNE MONITORING AND ANALYSIS OF CANCER AT STANFORD (IMACS) Abstract The Center for Immune Monitoring and Analysis of Cancer at Stanford (IMACS) will perform highly comprehensive assays of immune phenotype and function for NCI-identified clinical trials. These will include standardized assays already developed on CyTOF, high-dimensional flow cytometry, Luminex, TCRseq, and RNAseq platforms. As part of the program we will also standardize and offer as assays Stanford-invented technologies under development, including Multiplexed Ion Beam Imaging (MIBI) and Assays of Transposon- Accessible Chromatin (ATAC-seq). We have designed our center structure to work with investigators to define the assays best suited to the immunological questions being posed, and match these with the required sample types. We will perform quality control measures on all assays, as well as generate a standard report for each assay and project. Data will be organized via our online database, Stanford Data Miner, 

In [7]:
# Clean abstract column
df['abstract_text'] = df['abstract_text'].apply(clean_abstract)

In [8]:
# Check top 10 abstracts 
for abstract in df['abstract_text'][0:10]: print(abstract)

ABSTRACT Advances in immunotherapy have shown efficacy in various cancers. Immunotherapeutic approaches are becoming the new treatment modality in some cancer types. However, clinical efficacy is still limited to a marginal number of patients due to a newly developing understanding of the complex tumor microenvironment and a limited ability to appropriately select patients who have specific biomarker signatures and have the potential to optimally respond to specific immunotherapy or combination strategies. An additional challenge in assessment of biomarkers that show clinical utility to predict patient benefit includes the use of different methodologies and platforms that make it difficult to make accurate conclusions on the biomarkers in question. Thus, optimized biomarker strategies that can overcome immune barriers will allow tailoring of the therapeutic approaches and result in bold interventions that would be most beneficial to individual patients. The National Cancer Moonshot was

No sign of the problem characters. There is likely a better way to clean this in a less explicitly defined way, but this works well enough to get rid of the egregious encoding oddities and can be expanded as others are identified.

### Start exploring for patterns and stats to report out

In [9]:
# Look at a single row in detail
df.loc[0]

project_num                                                  1U24CA224285-01
core_project_num                                                 U24CA224285
appl_id                                                              9455402
fiscal_year                                                             2017
project_title              Translational Cancer Immune Monitoring and Ana...
abstract_text              ABSTRACT Advances in immunotherapy have shown ...
pref_terms                 Affect;Antibodies;Basic Science;Bioinformatics...
org_name                                UNIVERSITY OF TX MD ANDERSON CAN CTR
org_city                                                             HOUSTON
org_state                                                                 TX
org_country                                                    UNITED STATES
principal_investigators    Gheath Al-Atrash, Cara L Haymaker, Ignacio I. ...
program_officers                                            Magdalena Thurin

In [10]:
df.dtypes

project_num                object
core_project_num           object
appl_id                     int64
fiscal_year                 int64
project_title              object
abstract_text              object
pref_terms                 object
org_name                   object
org_city                   object
org_state                  object
org_country                object
principal_investigators    object
program_officers           object
award_amount                int64
agency_ic_fundings          int64
award_notice_date          object
project_start_date         object
project_end_date           object
full_foa                   object
api_source_search          object
program                    object
dtype: object

In [11]:
# Check fiscal years of all grants
df['fiscal_year'].value_counts().reset_index().sort_values(by='index')

Unnamed: 0,index,fiscal_year
23,2000,1
22,2001,2
19,2002,3
18,2003,3
17,2004,3
12,2005,6
21,2006,2
15,2007,5
20,2008,2
14,2009,5


In [12]:
# Check columns
df.columns.tolist()

['project_num',
 'core_project_num',
 'appl_id',
 'fiscal_year',
 'project_title',
 'abstract_text',
 'pref_terms',
 'org_name',
 'org_city',
 'org_state',
 'org_country',
 'principal_investigators',
 'program_officers',
 'award_amount',
 'agency_ic_fundings',
 'award_notice_date',
 'project_start_date',
 'project_end_date',
 'full_foa',
 'api_source_search',
 'program']

In [13]:
# Get number of core projects for each program
df.groupby('program')['core_project_num'].nunique().reset_index()

Unnamed: 0,program,core_project_num
0,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,1
1,AcquiredResistancetoTherapyNetworkARTNet,6
2,AllofUs,1866
3,BarrettsEsophagusTranslationalResearchNetworkB...,4
4,BrainTumorSPOREGrant,1
5,CANCERIMMUNEMONITORINGANDANALYSISCENTERS,4
6,CancerPreventionInterceptionTargetedAgentDisco...,2
7,CellularCancerBiologyImagingResearch,4
8,FredHutchinsonCancerResearchCenterLungSPORE,1
9,FusionOncoproteinsinChildhoodCancersFusOnc2,9


### Build a summary of all Programs and the number of projects for each

In [14]:
# Get the year from the date objects
# Lose some resolution but make the data easier to read for stats
df['project_start_date_year'] = df['project_start_date'].apply(lambda x: int(x[:4]))
df['project_end_date_year'] = df['project_end_date'].apply(lambda x: int(x[:4]))

In [15]:
# Copy fiscal year column for later min and max stats
df['fiscal_year_copy'] = df['fiscal_year']

In [16]:
# Define functions to apply to each column
agg_funcs = {
    'api_source_search': 'nunique',
    'core_project_num': 'nunique',
    'project_num': 'nunique',
    'agency_ic_fundings': 'sum', # This will be a little off due to duplicates!
    'project_start_date_year': 'min',
    'project_end_date_year': 'max',
    'fiscal_year': 'min',
    'fiscal_year_copy': 'max'
}

In [17]:
# # Define column titles better suited for reporting
# rename_dict = {
#     "program": "Program",
#     "core_project_num": "Core Project Count",
#     "project_num": "Grant/Award Count",
#     'agency_ic_fundings': "Total NCI Funding (since 2000)", # This will be a little off due to duplicates! 
#     "api_source_search": "Provided NOFOs/Awards with Associated Grants",
#     "project_start_date_year": "Earliest Project Start Date",
#     "project_end_date_year": "Latest Project End Date",
#     "fiscal_year": "Earliest Fiscal Year",
#     "fiscal_year_copy": "Latest Fiscal Year"
# }

In [18]:
# Group by 'program' and apply aggregation functions defined above
summary_stat_df = df.groupby('program').agg(agg_funcs).reset_index()

# # Rename columns for better presentation as defined above
# summary_stat_df.rename(columns=rename_dict, inplace=True)

# # Store program summary 
# program_summary_filename = reports_dir + '/' + 'programSummaryStats.csv'
# summary_stat_df.to_csv(program_summary_filename, index=False)

In [19]:
summary_stat_df

Unnamed: 0,program,api_source_search,core_project_num,project_num,agency_ic_fundings,project_start_date_year,project_end_date_year,fiscal_year,fiscal_year_copy
0,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,1,1,4,2107456,2020,2023,2020,2022
1,AcquiredResistancetoTherapyNetworkARTNet,2,6,6,7357128,2017,2027,2022,2022
2,AllofUs,6,1866,3208,1440658962,1981,2028,2018,2023
3,BarrettsEsophagusTranslationalResearchNetworkB...,3,4,40,38637870,2011,2024,2011,2021
4,BrainTumorSPOREGrant,1,1,26,44541554,2002,2023,2002,2022
5,CANCERIMMUNEMONITORINGANDANALYSISCENTERS,2,4,8,55551765,2017,2028,2017,2023
6,CancerPreventionInterceptionTargetedAgentDisco...,1,2,2,2412236,2022,2027,2022,2022
7,CellularCancerBiologyImagingResearch,1,4,8,12810400,2021,2026,2021,2023
8,FredHutchinsonCancerResearchCenterLungSPORE,1,1,6,12339863,2019,2024,2019,2023
9,FusionOncoproteinsinChildhoodCancersFusOnc2,2,9,16,73063983,2018,2024,2018,2022


For the most part, Earliest Grant Date aligns with earliest Fiscal Year. Outliers are the very early start dates for All of Us, ALCHEMIST, and EDRN programs compared to the earlist fiscal year. This might indicate a project that received additional funds or supplements many years after the proejct began. Could this be data deposition or a similar modernization effort?

## Enrich Key Program CSV with Grants data
This will begin to mock up the INS Key Programs page as requested by ODS. 

In [20]:
# Use config to get path and load clean key programs csv 
key_programs_filename = '../' + config.CLEANED_KEY_PROGRAMS_CSV
key_programs_df = pd.read_csv(key_programs_filename)

In [21]:
# Make a temporary program ID column to use for mapping stats to key programs
key_programs_df['program'] = key_programs_df['program_name'].apply(lambda name: ''.join(filter(str.isalnum, name)))

In [22]:
# Check columns in key programs df
key_programs_df.columns.tolist()

['program_name',
 'program_acronym',
 'focus_area',
 'doc',
 'contact_pi',
 'contact_pi_email',
 'contact_nih',
 'contact_nih_email',
 'nofo',
 'award',
 'program_link',
 'data_link',
 'cancer_type',
 'program']

In [23]:
# Check columns in program summary stats df
summary_stat_df.columns.tolist()

['program',
 'api_source_search',
 'core_project_num',
 'project_num',
 'agency_ic_fundings',
 'project_start_date_year',
 'project_end_date_year',
 'fiscal_year',
 'fiscal_year_copy']

In [24]:
# Add stats to key programs df using temp program ID as connector
key_programs_enriched = pd.merge(left=key_programs_df, 
                                right=summary_stat_df[['program',
                                                       'core_project_num', 
                                                       'project_num',
                                                       'agency_ic_fundings']], 
                                on='program')
# Drop temp program ID column
key_programs_enriched.drop(columns='program', inplace=True)

# Rename stat columns 
stat_renamer_dict = {
    'core_project_num': 'core_project_count',
    'project_num': 'grant_count',
    'agency_ic_fundings': 'total_nci_funding',
}
key_programs_enriched.rename(columns=stat_renamer_dict, inplace=True)

In [25]:
key_programs_enriched.columns.tolist()

['program_name',
 'program_acronym',
 'focus_area',
 'doc',
 'contact_pi',
 'contact_pi_email',
 'contact_nih',
 'contact_nih_email',
 'nofo',
 'award',
 'program_link',
 'data_link',
 'cancer_type',
 'core_project_count',
 'grant_count',
 'total_nci_funding']

In [26]:
# Store enriched Key Programs Stats Table
key_program_stats_filename = reports_dir + '/' + 'keyProgramStats.csv'
key_programs_enriched.to_csv(key_program_stats_filename, index=False)

## Continue exploring data report options

In [27]:
# Get a df of all grants with project start dates before year 2000
early_project_df = df[df['project_start_date_year'] < 2000]

In [28]:
# # Export to csv for quick ad-hoc analysis
# early_project_df.to_csv('earlyProjectReport.csv',index=False)

In [29]:
# Group to find patterns and common sources for the projects with very early start dates
early_project_df.groupby(['program', 'api_source_search', 'core_project_num', 'project_title'])['project_num'].nunique().reset_index()

Unnamed: 0,program,api_source_search,core_project_num,project_title,project_num
0,AllofUs,nofo_PA20-185,R01CA031845,Synthetic Studies Related to Cancer Research/T...,2
1,AllofUs,nofo_PA20-185,R01CA047296,A Pathway of Tumor Suppression,2
2,AllofUs,nofo_PA20-185,R01CA053840,Protein Tyrosine Dephosphorylation & Signal Tr...,2
3,AllofUs,nofo_PA20-185,R01CA067007,Mismatch Repair and Carcinogenesis,2
4,AllofUs,nofo_PA20-185,R01CA067985,DNA Damage Repair by MUTYH and MUTYH Variants ...,3
...,...,...,...,...,...
100,AllofUs,nofo_PA20-272,P30CA082103,Project HOPE: The Pediatric/AYA Omics Project,1
101,AllofUs,nofo_PA20-272,P30CA082709,The Big Ten Electronic Health Record Consortiu...,1
102,AllofUs,nofo_PA20-272,U24CA055727,CHILDHOOD CANCER SURVIVOR STUDY: Somatic and G...,1
103,TheAdjuvantLungCancerEnrichmentMarkerIdentific...,award_U10CA031946,U10CA031946,CANCER AND LEUKEMIA GROUP B,5


### Check for "Bottleneck Effect" grants

In [30]:
def find_rows_with_different_values(df, shared_column, compare_column):
    """Find rows that share a value in a specified column
    but have different values in another specified column.
    """

    grouped = df.groupby(shared_column)[compare_column].transform('nunique')
    selected_row_df = df[grouped > 1]

    return selected_row_df

In [31]:
# Check for Core Projects found in multiple programs and compare sources

shared_column = 'core_project_num'
compare_column = 'program'

# Get grant-level rows with same project but different program
df_shared = find_rows_with_different_values(df, shared_column, compare_column)

# Group with the provided search value and count unique grants 
df_shared_projects = (df_shared.groupby(
                        ['api_source_search', shared_column, compare_column])
                        .size().reset_index()
                        .rename(columns={0:'grant_count'}))

# Store shared programs
shared_projects_filename = reports_dir + '/' + 'sharedProjects.csv'
df_shared_projects.to_csv(shared_projects_filename, index=False)

For any Core Project gathered for more than one program (e.g. shared NOFOs or a NOFO in one program and an Award in another), the grants and all downstream outputs will need to be associated with BOTH programs. It's important to identify these. 

In [32]:
# Show core projects found in multiple programs
df_shared_projects

Unnamed: 0,api_source_search,core_project_num,program,grant_count
0,award_R01CA239701,R01CA239701,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,4
1,award_U10CA180821,U10CA180821,TheAdjuvantLungCancerEnrichmentMarkerIdentific...,35
2,nofo_PA20-272,R01CA239701,AllofUs,2
3,nofo_PA20-272,U10CA180821,AllofUs,8
4,nofo_PA20-272,U24CA163056,AllofUs,1
5,nofo_PA20-272,U54CA163004,AllofUs,1
6,nofo_PA20-272,U54CA163059,AllofUs,1
7,nofo_PA20-272,U54CA163060,AllofUs,1
8,nofo_PA20-272,U54CA224019,AllofUs,1
9,nofo_PA20-272,U54CA231630,AllofUs,2


Some Key Programs specify Awards for grants rather than the entire NOFO. Only the Award and downstream grants should be associated with that Program, rather than everything dowstream of the larger NOFO.  
However, all grants gathered from NIH RePORTER necessarily have a 'full_foa' (aka NOFO) value within the RePORTER data. This is a potential for complications.  

We need to check for two things:  
1. Did we search NIH RePORTER with any NOFOs that do not match the full_foa within grants returned? 
    - That would be unexpected and indicate an API gathering problem.
    - Would appear in the table below as a rows with "nofo_"... in the api_source_search column
2. Did we search NIH RePORTER with any Awards with differing full_foa values within grants returned? 
    - That's acceptable - it could indicate that the particular Core Project received funding from many sources (e.g. data sharing NOFO effort)
    - It's still important to note. We need to be careful not to accidentally make "upstream" connections from grants back to full_foas back to programs for projects derived from specified Award gathering.


In [33]:
# Group by core project and combine programs into unordered set
grouped = df.groupby("core_project_num")["program"].apply(set)

# Create combos of programs for each core project with itertools combinations
df_shared_programs = grouped.apply(lambda x: list(combinations(x, 2)))

# Flatten the program combos and count occurrences
df_shared_programs = df_shared_programs.explode().value_counts().reset_index()

# Split program combo column into two separate columns
# NOTE this will need to be reworked if a project has more than 2 programs
df_shared_programs[['program_1','program_2']] = df_shared_programs['index'].apply(pd.Series)

# Reorder and rename
df_shared_programs.rename(columns={'program':'shared_project_count'}, inplace=True)
df_shared_programs = df_shared_programs[['program_1','program_2','shared_project_count']]

# Export as report
shared_programs_filename = reports_dir + '/' + 'sharedProjectsByProgramPair.csv'
df_shared_programs.to_csv(shared_programs_filename, index=False)

# Show table of program combos and number of shared projects between them
df_shared_programs

Unnamed: 0,program_1,program_2,shared_project_count
0,FusionOncoproteinsinChildhoodCancersFusOnc2,AllofUs,5
1,AllofUs,BarrettsEsophagusTranslationalResearchNetworkB...,4
2,CellularCancerBiologyImagingResearch,AllofUs,2
3,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,AllofUs,1
4,AllofUs,TheAdjuvantLungCancerEnrichmentMarkerIdentific...,1
5,AllofUs,AcquiredResistancetoTherapyNetworkARTNet,1


In [34]:
# Check for rows with the same ODS-provided NOFO but different FOA/NOFO accordign to RePORTER 
shared_column = 'api_source_search'
compare_column = 'full_foa'

# Select only rows with different values
df_shared = find_rows_with_different_values(df, shared_column, compare_column)
# Group and summarize
df_shared.groupby([shared_column, compare_column]).size().reset_index().rename(columns={0:'grant_count'})

Unnamed: 0,api_source_search,full_foa,grant_count
0,award_P50CA097257,PAR-00-087,9
1,award_P50CA097257,PAR-05-156,7
2,award_P50CA097257,PAR-10-003,5
3,award_P50CA097257,PAR-14-353,5
4,award_P50CA165962,PA-18-906,1
5,award_P50CA165962,PA-21-071,1
6,award_P50CA165962,PAR-10-003,5
7,award_P50CA165962,PAR-18-313,6
8,award_P50CA196530,PAR-14-031,6
9,award_P50CA196530,PAR-18-313,4


## Build visualizations in Plotly
The goal is to communicate patterns and highlight limitations or oddities within the grants data in order to provide feedback and capabilities to ODS.  
Visualizations may be a better route than spreadsheets to summarize this. 

### Program Funding

#### Pie Chart

In [35]:
# # Grouping data to calculate the sum of total_cost for each program
# program_agency_funding = df.groupby('program')['agency_ic_fundings'].sum().reset_index()

# # Creating the Pie Chart
# fig = px.pie(program_agency_funding, 
#              names='program', 
#              values='agency_ic_fundings', 
#              title='Distribution of Agency IC Fundings Across Programs',
#              height=600,
#              width=1200,
#              )

# # Show the Pie Chart
# fig.show()

#### TreeMap Block Chart

In [36]:
# # Grouping data to calculate the sum of total_cost for each program
# # program_core_funding = df.groupby(['program', 'core_project_num'])['agency_ic_fundings'].sum().reset_index()

# # Creating the Treemap
# fig = px.treemap(df, 
#                  path=['program', 'project_title', 'project_num'], 
#                  values='agency_ic_fundings',
#                  title='Distribution of NCI Funding by Program, Project, and Award',
#                  height=1200,
#                  width=1800)

# # Show the Treemap
# fig.show()

In [37]:
# Grouping data to calculate the sum of agency_ic_funding for each org_state
state_funding = df.groupby('org_state', dropna=False)['agency_ic_fundings'].sum().reset_index()
state_funding

Unnamed: 0,org_state,agency_ic_fundings
0,AL,9095775
1,AR,2305873
2,AZ,19291308
3,BC,1313155
4,CA,276507528
5,CO,9434500
6,CT,46490803
7,DC,17706480
8,DE,823703
9,FL,47141005


#### Chloropleth (US Map with funding by state)

In [38]:
# # Grouping data to calculate the sum of agency_ic_funding for each org_state
# state_funding = df.groupby('org_state',dropna=False)['agency_ic_fundings'].sum().reset_index()

# # Creating the Chloropleth
# fig = px.choropleth(state_funding, 
#                     locations='org_state', 
#                     locationmode='USA-states', 
#                     color='agency_ic_fundings',
#                     scope='usa',
#                     color_continuous_scale='Blues',
#                     title='Total NCI Funding by State since 2000',
#                     height=600,
#                     width=1200)

# # Show the Chloropleth
# fig.show()

In [39]:
program_fy_funding = df.groupby(['program','fiscal_year'])['agency_ic_fundings'].sum().reset_index()
program_fy_funding

Unnamed: 0,program,fiscal_year,agency_ic_fundings
0,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,2020,1898254
1,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,2021,11554
2,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,2022,197648
3,AcquiredResistancetoTherapyNetworkARTNet,2022,7357128
4,AllofUs,2018,186724
...,...,...,...
131,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,2019,2134771
132,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,2020,2207890
133,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,2021,2003668
134,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,2022,2046999


#### Stacked Bar Chart (Program funding per year)

In [40]:
# # Group data
# program_fy_funding = df.groupby(['program','fiscal_year'])['agency_ic_fundings'].sum().reset_index()

# # Create the line chart
# fig = px.bar(program_fy_funding, 
#               x='fiscal_year', 
#               y='agency_ic_fundings', 
#               color='program', 
#               title='NCI Funding Since Fiscal Year 2000 by Key Program',
#               labels={'fiscal_year': 'Fiscal Year', 'agency_ic_fundings': 'NCI Funding', 'program':'Key Program'},
#               height=600,
#               width=1800
#             )
# # Show the line chart
# fig.show()

#### Sankey Diagram
There was a lot of trial and error not fully shown here

In [41]:
# Sum NCI funding for each program, NOFO, and project combo
sankey_summary = df.groupby(['program', 'full_foa', 'core_project_num'])['agency_ic_fundings'].sum().reset_index()
sankey_summary

Unnamed: 0,program,full_foa,core_project_num,agency_ic_fundings
0,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,PA-19-056,R01CA239701,1898254
1,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,PA-20-272,R01CA239701,197648
2,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,PA-21-071,R01CA239701,11554
3,AcquiredResistancetoTherapyNetworkARTNet,RFA-CA-21-052,U54CA224019,1308998
4,AcquiredResistancetoTherapyNetworkARTNet,RFA-CA-21-052,U54CA224081,1119180
...,...,...,...,...
1997,TheUniversityofTexasMDAndersonCancerCenterSPOR...,PAR-18-313,P50CA217674,9063325
1998,VanderbiltIngramCancerCenterSPOREinGastrointes...,PAR-18-313,P50CA236733,11590546
1999,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,PAR-14-031,P50CA196530,12004456
2000,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,PAR-18-313,P50CA196530,8328403


In [42]:
# Build connections from programs to full_foas
links_program_nofo = df.groupby(['program','full_foa'])['agency_ic_fundings'].sum().reset_index()
# Rename with standard cols
links_program_nofo.columns = ['source','target','value']


# Build connections from full_foas to project_nums
links_nofo_project = df.groupby(['full_foa','core_project_num'])['agency_ic_fundings'].sum().reset_index()
# Rename with standard cols
links_nofo_project.columns = ['source','target','value']

# Combine different links dataframes
links = pd.concat([links_program_nofo, links_nofo_project])

# Remove links with no funding
links = links[links['value'] > 0]
links

Unnamed: 0,source,target,value
0,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,PA-19-056,1898254
1,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,PA-20-272,197648
2,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,PA-21-071,11554
3,AcquiredResistancetoTherapyNetworkARTNet,RFA-CA-21-052,6306724
4,AcquiredResistancetoTherapyNetworkARTNet,RFA-CA-21-053,1050404
...,...,...,...
1987,RFA-CA-22-038,U24CA224285,1924560
1988,RFA-CA-22-038,U24CA224309,1836054
1989,RFA-CA-22-038,U24CA224319,1983570
1990,RFA-CA-22-038,U24CA224331,1973873


In [43]:
# Create nodes dataframe from links dataframe
nodes_data = pd.concat([links['source'], links['target']]).unique()
nodes = pd.DataFrame({
    'node': nodes_data,
    'node_id': range(len(nodes_data))
})

# Create a dictionary to map node names to node IDs
node_id_mapping = dict(zip(nodes['node'], nodes['node_id']))

# Map the node names to their corresponding node IDs in the links dataframe
links['source'] = links['source'].map(node_id_mapping)
links['target'] = links['target'].map(node_id_mapping)

links

Unnamed: 0,source,target,value
0,0,23,1898254
1,0,25,197648
2,0,26,11554
3,1,58,6306724
4,1,59,1050404
...,...,...,...
1987,60,1916,1924560
1988,60,1917,1836054
1989,60,1918,1983570
1990,60,1919,1973873


In [44]:
# # Create Sankey diagram
# fig = go.Figure(go.Sankey(
#     node=dict(
#         pad=15,
#         thickness=20,
#         line=dict(color="black", width=0.5),
#         label=nodes['node'],  # Use the node names as labels

#     ),
#     link=dict(
#         source=links['source'],
#         target=links['target'],
#         value=links['value']
#     )
# ))

# # Customize layout
# fig.update_layout(
#     title_text="NCI Key Program Funding Flow",
#     font_size=14,
#     height=1800,
#     width=1200
# )

# # Display the figure
# fig.show()

## Program Keywords (from Grants)

In [45]:
# Get top keywords across all grants gathered
df['pref_terms'].str.split(';').explode().reset_index().groupby('pref_terms').size().reset_index().sort_values(by=0, ascending=False).rename(columns={0:'keyword_count'}).head(50)

Unnamed: 0,pref_terms,keyword_count
5006,Malignant Neoplasms,2813
6241,Patients,2722
2301,Data,2504
11625,novel,2483
8129,Testing,2482
13064,tumor,2437
3590,Goals,2186
1510,Cells,2158
10764,improved,2075
2413,Development,2068


In [46]:
# Split keywords and explode
keywords_df = df.assign(pref_terms=df['pref_terms'].str.split(';')).explode('pref_terms')

# Count keyword frequencies per program
keyword_counts = keywords_df.groupby(['program', 'pref_terms']).size().reset_index(name='keyword_count')

keyword_counts

Unnamed: 0,program,pref_terms,keyword_count
0,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,12 year old,1
1,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,Accounting,1
2,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,Acute Lymphocytic Leukemia,4
3,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,Administrative Supplement,2
4,ADMIRALStudyAdmixtureanalysisofacutelymphoblas...,Admixture,4
...,...,...,...
20092,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,tumor,11
20093,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,tumor initiation,1
20094,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,tumor microenvironment,4
20095,YaleSPOREinLungCancerYSILCTheBiologyandPersona...,tumor progression,11


Keywords seem valuable but it's difficult to find an appropriate visualization. I'd rather not rely on WordCloud. 