# USG grants crawl
## Agency-specific replication from  Lee & Chung (2022)

### Previously

In the previous chapter we looked at how often a selected set of open-science infrastructure related terms showed up in [grants.gov](https://www.grants.gov/web/grants) grant descriptions, and which agencies' grants they were showing up in.  

The keyword list we used was one of our own making.  However, we are certianly not the first to explore this topic.  Indeed [Lee & Chung (2022)](https://doi.org/10.47989/irpaper949) looked precisely at the question of what keywords were most associated with this topic.  We can use their keywords and replicate our previous analysis using the expanded and empirically supported list of words.  This only involves minor changes to the previous noteook.

### Loading the database once more

Let's begin by loading up the database provided by the website, which is stored in an xml format.

In [1]:
from bs4 import BeautifulSoup
import xmltodict
import sys

# FUTURE NOTE: it may be possible to do a check for a local file meeting the relevant criterion and conditionally 
# download from https://www.grants.gov/extract/ (and extract compressed file) in the event a local target isn't found.
# For the moment though...

# load up the xml file; hard-path to local file.  Adjust as necessary
pathToXML='C://Users//dbullock//Documents//code//gitDir//USG_grants_crawl//inputData//GrantsDBExtract20230113v2.xml'

# open and parse file
with open(pathToXML, 'r') as f:
    govGrantData_raw = f.read()

# convert xml to dictionary
with open(pathToXML) as xml_file:
    govGrantData_dictionary = xmltodict.parse(xml_file.read())

# quick size legibility function generated by code-davinci-002
def convert_bytes(bytes):
    if bytes < 1024:
        return str(bytes) + " B"
    elif bytes < 1048576:
        return str(round(bytes/1024, 1)) + " KB"
    elif bytes < 1073741824:
        return str(round(bytes/1048576, 1)) + " MB"
    elif bytes < 1099511627776:
        return str(round(bytes/1073741824, 1)) + " GB"
    else:
        return str(round(bytes/1099511627776, 1)) + " TB"
    
# terminal reports
print('Dictionary conversion successful')
print('\n' + str(len(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])) + ' grant entries found, totalling '+ convert_bytes(sys.getsizeof(govGrantData_raw)))
print('\n and with dictionary keys:\n')
print(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'][0].keys())

Dictionary conversion successful

70330 grant entries found, totalling 256.2 MB

 and with dictionary keys:

dict_keys(['OpportunityID', 'OpportunityTitle', 'OpportunityNumber', 'OpportunityCategory', 'FundingInstrumentType', 'CategoryOfFundingActivity', 'CategoryExplanation', 'CFDANumbers', 'EligibleApplicants', 'AdditionalInformationOnEligibility', 'AgencyCode', 'AgencyName', 'PostDate', 'CloseDate', 'LastUpdatedDate', 'AwardCeiling', 'AwardFloor', 'EstimatedTotalProgramFunding', 'ExpectedNumberOfAwards', 'Description', 'Version', 'CostSharingOrMatchingRequirement', 'ArchiveDate', 'GrantorContactEmail', 'GrantorContactEmailDescription', 'GrantorContactText'])


### Keywords and terms

Although we aren't going to inspect the keywords and agencies on their own this time, we still need to collect them.  Once we have loaded them, we can determine which words are occuring in which grants, and which agencies those grants are associated with.  The resulting information can be placed in a dictionary, where the relevant information can be accessed by using the [tuple](https://www.w3schools.com/python/python_tuples.asp) corresponding to the desired agency and keyword (e.g. ('agency','keyword')

In [2]:
import json
import seaborn as sns
import itertools
import pandas as pd
import matplotlib.pyplot as plt
import re

#HERE'S THE CHANGE FROM THE PREVOUS NOTEBOOK
# open the keywords csv file
inputKeywords=pd.read_csv('OSterms_LeeChung2022.csv')
print(inputKeywords)

# split it into a list.  Each term is kept on a separate line
keywords=inputKeywords['terms'].tolist()

grantFindsOut={}

# iterate through the keywords
for iKeywords in keywords:
    # create a blank list to store the IDs of the grants with the keyword in the description
    grantsFound=[]
    compiledSearch=re.compile('\\b'+iKeywords.lower()+'\\b')
    for iListing in govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0']:
        # maybe it doesn't have a description field
        try:
            # case insensitive regex search find for the keyword
            if bool(compiledSearch.search(iListing['Description'].lower().replace('-',''))):
                #append the ID if found
                grantsFound.append(iListing['OpportunityID'])
        except:
            # do nothing, if there's no description field, then the word can't be found
            pass
            
    # store the found entries in the output dictionary.  Use the keyword as the key (with spaces replaced with underscores),
    # and the value being the list of grant IDs
    grantFindsOut[iKeywords.replace(' ','_')]=grantsFound

# no need to save it

import numpy as np
def getGrantAgencies(listOfGrantStrucs):
    # generate a vector for the agency names
    agencyNameVec=[[] for iGrant in range(len(listOfGrantStrucs)) ]
    # iterate through the grants
    for iIndex,iListing in enumerate(listOfGrantStrucs):
        # this time we're just getting the relevant agency label
        # yes, we're redoing what occured in the previous block
        # why are you like this government agencies
        try:    
        # in the normal case
            nameHold=iListing['AgencyCode'].split('-')[0]
            # set it in the corresponding item in the list
            agencyNameVec[iIndex]=nameHold
        except:
            try:
                # if its not there, get the full name
                agencyName=iListing['AgencyName']
                # and extract the capital letters
                nameHold=([char for char in agencyName if char.isupper()])
                # set it in the corresponding item in the list
                agencyNameVec[iIndex]=nameHold
            except:
                # well, if you can't adhere to a formatting standard, then you get lumped into other
                nameHold='other'
                # set it in the corresponding item in the list
                agencyNameVec[iIndex]=nameHold
    return agencyNameVec

def getGrantValues(listOfGrantStrucs):
    grantValVec=[[] for iGrant in range(len(listOfGrantStrucs)) ]
    for iIndex,iListing in enumerate(listOfGrantStrucs):
        try:
            # if you can find the expected program funding value, add it to vector while forcing the string to an int
            grantValVec[iIndex]=np.int64(iListing['EstimatedTotalProgramFunding'])   
        except:
            # if you can't
            try:
                # try and infer a value, if the data is avaialble
                # do this by estimating the mean grant value, and multiplying by the expected number of grant awards
                totalAvgValue=np.multiply(np.divide((np.int64(iListing['AwardCeiling'])+int(iListing['AwardFloor'])),2),iListing['ExpectedNumberOfAwards'])
                # add that value to the val vec
                grantValVec[iIndex]=totalAvgValue
            except:
                # just add zero, as a place holder
                grantValVec[iIndex]=np.int64(0)
    return grantValVec

def getGrantIDs(listOfGrantStrucs):
    #extremely simple
    grantIDsVec=[iGrant['OpportunityID'] for iGrant in govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0']]
    return grantIDsVec

grantAgencies=getGrantAgencies(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])
grantIDs=getGrantIDs(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])
grantAgenciesUnique=np.unique(grantAgencies)
#create dictionary holder
dataHolder={}

#create a dataframe
#grantIDsDF=pd.DataFrame(data=blankData,columns=grantAgenciesUnique,index=list(grantFindsOut.keys()),dtype=object)
#grantHoldStruc=np.zeros((len(grantAgenciesUnique),len(list(grantFindsOut.keys()))))
for matrix_keywordIndex, iKeywords in enumerate(keywords):
    currentGrants=grantFindsOut[iKeywords.replace(' ','_')]
    for iCurrentGrants in currentGrants:
        #find out what it's index is
        currentGrantIndex=grantIDs.index(iCurrentGrants)
        #find out what agency that is
        currentAgency=grantAgencies[currentGrantIndex]
        #place it in the dataframe
        matrix_agencyIndex=list(grantAgenciesUnique).index(currentAgency)
        #(row, column)
        #grantIDsDF.loc[currentAgency,iKeywords]=grantIDsDF.loc[iKeywords,currentAgency].append(iCurrentGrants)
        tupleKey=tuple([currentAgency,iKeywords])
        #if it's not there, make it a blank
        if not tupleKey in list(dataHolder.keys()):
            dataHolder[tupleKey]=[]
            
        dataHolder[tupleKey].append(iCurrentGrants)

# create a count matrix
countMatrix=np.zeros([len(keywords),len(grantAgenciesUnique)])
for matrix_keywordIndex, iKeywords in enumerate(keywords):
    for matrix_agencyIndex, iAgency in enumerate(grantAgenciesUnique):
        tupleKey=tuple([iAgency,iKeywords])
        #try and index into it
        try:
            currVal=len(dataHolder[tupleKey])
        except:
        #if it's not there, then there aren't any grants in that cell
            currVal=0
        countMatrix[matrix_keywordIndex,matrix_agencyIndex]=currVal

                                       categories                    terms
0        pre-registrations and registered reports       replication crisis
1        pre-registrations and registered reports              methodology
2        pre-registrations and registered reports          preregistration
3        pre-registrations and registered reports              replication
4        pre-registrations and registered reports       registered reports
5                                       preprints                preprints
6                                       preprints          social sciences
7                                 reproducibility          reproducibility
8                                 reproducibility             transparency
9                                 reproducibility            replicability
10                                reproducibility                 COVID-19
11                                reproducibility                   ethics
12                       

### A small wait

Because the previous analysis isn't coded particularly efficient, it can take a moment to complete.  Part of this has to do with the inefficiency required to index back in to the database, as well as the inefficient storage method for the information we are getting (i.e. appending to lists in a large dictionary)

In any case, once we have the relevant data structure we can look at which agencies are using which terms, and also receive an ouput of the [grants.gov](https://www.grants.gov/web/grants) IDs associated with those grants

In [3]:
# chat-davinci-002 prompt
# an iteractive jupyer notebook widget that returns two subplot windows.  The input is a numerical matrix.  The interface features two dropdown menus that allow you to select a row (i) and column (j) from the matrix.  On the left side of the subplot outputs, a matrix heatmap plotting the numerical data.  On the right side of the subplot outputs, a blank plot that is used to display text indicating the value found in the specific matrix (i,j) entry selected in the dropdown menus.

import ipywidgets as widgets
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import pandas as pd
import seaborn as sns
from IPython.display import clear_output
from matplotlib.colors import LogNorm

def heatmap_plot(matrix, heatmap_ax, row, column):
    """
    Plots the heatmap with a crosshair at the desired location
    """
    # if row is empty, default to column
    if row == '':
        row = column
    # if column is empty, default to row
    if column == '':
        column = row
    # if both are empty, no outline
    if row == '' and column == '':
        row = 0
        column = 0
    # if both are not empty, only highlight the relevant cell
    if row != '' and column != '':
        row = row
        column = column
    # create the heatmap plot
    sns.heatmap(matrix, ax=heatmap_ax, norm=LogNorm(), cmap='viridis', cbar=True, xticklabels=list(grantAgenciesUnique) , yticklabels=list(keywords))
    # create the outline
    heatmap_ax.axvline(x=column+.5, color='red', linewidth=2)
    heatmap_ax.axhline(y=row+.5, color='red', linewidth=2)
    # show the plot

def plot_list(axis, list_of_text, font_size=None, font_color='black', font_family='sans-serif') :
    """
    A function for plotting a list of text elements evenly across a passed in axis.  The function begins by taking in the passed in axis and measuring the space available.  The function then uses those dimensions to determine both the font size and how the list elements should be split into rows and columns so as to take up the maximum amount of space available within the axis, without overlapping.  The function then plots those list elements to the axis space.  Finally the plot is displayed.  The function does not alter the size of the input axes or resultant figure.

    Parameters
    ----------
    axis : matplotlib.axes.Axes
        The axis to plot the list of text elements to.
    list_of_text : list
        A list of text elements to plot to the axis.
    font_size : int, optional
        The font size to use for the text elements.  If not passed in, the function will calculate the font size based on the size of the axis.
    font_color : str, optional
        The color of the text elements.  The default is 'black'.
    font_family : str, optional
        The font family to use for the text elements.  The default is 'sans-serif'.

    testBox:
    
    aaaaa
    aaaaa
    aaaaa
    

    Returns
    -------
    None.

    """
    import math


    # get the axis dimensions
    #first get the figure handle
    fig=axis.get_figure()
    bbox = axis.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    axis_width, axis_height = bbox.width, bbox.height
    # returns in pixels, for some reason
    #axis_width = axis.get_window_extent().width
    #axis_height = axis.get_window_extent().height

    # no need to get units for these axes sizes as we can safely assume they are in inches.
    spaceNum=4

    #assumed aspect ratio, how many characters can you fit along x amount of space vertically : horizontally; see text box for demo
    textAspectRatio=5/5
    
    # calculate the maximum number of characters in the text elements and use this to establish the expected character width of columns
    max_characters = max([len(x) for x in list_of_text])

    # create a list of spaces to add to the end of each list element
    spaces = [' ' * (max_characters - len(x)) for x in list_of_text]

    # join the list of text elements with the list of spaces
    list_of_text = [x + y for x, y in zip(list_of_text, spaces)]

    nonSpaceCount=len(list_of_text) - list_of_text.count(' ')
    spacaceCount=list_of_text.count(' ')
    #estimate total character footprint
    charFootprint=nonSpaceCount+(spacaceCount/2)
    #quasi math: rows=3/5*cols; squareform=cols^2; totalChars=(3/5*cols)*cols
    colNum=math.ceil(math.sqrt(charFootprint*5/3))
    rowNum=math.ceil(colNum*(textAspectRatio))
    
    #element_per_row=math.ceil(colNum/(max_characters+2))
    #get the nearest root that's equal to or greter than len(list_of_text) root
    math.ceil(charFootprint/rowNum)

    mergedText=''
    # conditional appending
    for iTextIndex, iTextElements in enumerate(list_of_text):
        #if it's divisible by the number of elements per row
        if (iTextIndex+1) % math.ceil(charFootprint/rowNum) == 0:
            mergedText=mergedText + iTextElements + '\n'
        else:
            mergedText=mergedText + iTextElements + spaceNum * ' '

    #how many chars per row, spaces only count as half, it seems
    #rowCharNumber= (rows_element_num * (max_characters  + math.ceil(spaceNum/2)))-math.ceil(spaceNum/2)  

    #72 seems to not actuall be = to one inch
    fontScaleFactor=.3

    # calculate the maximum allowable font size based on the both the height and width axes, such that no text from list_of_text will exceed the axes boundaries.  Assume 1 point of font is equal to 1/72 inches.
    maxWidthFont=(axis_width / (colNum / (72 * fontScaleFactor)))
    maxHeightFont=(axis_height /( rowNum / (32* fontScaleFactor)))
    max_font_size = min([maxWidthFont, maxHeightFont])

    # if a font size was passed in, use it.  Otherwise, use the calculated font size.
    if font_size is None :
        font_size = max_font_size

    # plot the list of text elements to the axis
    axis.text(0, 0, mergedText, fontsize=font_size, color=font_color, family=font_family)

    # display the plot
    plt.show()


# create a function that updates the heatmap
def heatmap_and_text(countMatrix,rowSelect,columnSelect):
    """
    Plots both the heatmap and the textbox of grants in a 1 by 2 subplot
    """


    fig, ax = plt.subplots(2, 1, figsize=(10, 20))

    # plot the heatmap
    heatmap_plot(countMatrix, heatmap_ax=plt.gcf().get_axes()[0], row=rowSelect, column=columnSelect)
    keyTuple=tuple([col_menu.value,row_menu.value])
    try:
        list_to_plot=dataHolder[keyTuple]
    except:
        list_to_plot=['No grants found']
    plot_list(plt.gcf().get_axes()[1],list_to_plot)
    # show the plot

def update_plots(rowSelectName,columnSelectName):
    """
    Performs the updating
    """
    
    rowIndex=keywords.index(rowSelectName)
    colIndex=list(grantAgenciesUnique).index(columnSelectName)
    heatmap_and_text(countMatrix,rowIndex,colIndex)
    
    
# link the dropdown menus to the update functions
#row_menu.observe(update_heatmap, names='value')
#col_menu.observe(update_heatmap, names='value')
# display the widgets
#display(row_menu)
#display(col_menu)

# update the heatmap
#update_heatmap(None)
# create a dropdown menu for the rows
row_menu = widgets.Dropdown(
    options=keywords,
    #value=,
    description='Row:',
    disabled=False,
)
# create a dropdown menu for the columns
col_menu = widgets.Dropdown(
    options=grantAgenciesUnique,
    #value='',
    description='Column:',
    disabled=False,
    )


%matplotlib inline
from ipywidgets import interact
#establishes interactivity
interact(update_plots,rowSelectName=row_menu,columnSelectName=col_menu)

  silent = bool(old_value == new_value)


interactive(children=(Dropdown(description='Row:', options=('replication crisis', 'methodology', 'preregistrat…

<function __main__.update_plots(rowSelectName, columnSelectName)>

### Interacting with the plot 

The widget should allow you to select which terms to work with.  For the moment (i.e. early stages of this notebook) the interaface is relatively rudamentary but the heatmap plot should feature a crosshair indicating which agency and term you are looking at.  The plot beneath that should inclde a list of the grant.gov IDs.  In many cases no grants are found meeting the criteria, and so a large text indicator should appear stating this.  However in the event that grants are found, they should be listed.  Currently the text scaling for this feature is rudamentary, and so if too many are found their font might be extremely small (future [modifications](https://stackoverflow.com/questions/55729075/matplotlib-how-to-autoscale-font-size-so-that-text-fits-some-bounding-box) could adress this).  Additionally, the text elements themselves may be [capable of being hyperlinks](https://matplotlib.org/stable/gallery/misc/hyperlinks_sgskip.html).

In any case, we can also attempt to replicate this process and look at the value of the grants as well.  As before, this computation will take a moment.

Specific to the plot itself, it's clear to see that the inclusion of "R" is throwing off the analysis.  This is likely because the search is returning any instance of the letter R, independent of any relevance to R, the analysis program.  One way to eal with this would be to alter the search to make use of regex and [word boundaries](https://www.regular-expressions.info/wordboundaries.html).

In [None]:
#now do it again with value

#get the values
allGrantVals=getGrantValues(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])

# create a count matrix
valueMatrix=np.zeros([len(keywords),len(grantAgenciesUnique)])
for matrix_keywordIndex, iKeywords in enumerate(keywords):
    for matrix_agencyIndex, iAgency in enumerate(grantAgenciesUnique):
        tupleKey=tuple([iAgency,iKeywords])
        #try and index into it
        try:
            currentGrants=dataHolder[tupleKey]
        except:
        #if it's not there, then there aren't any grants in that cell
            currentGrants=[]
            
        for iGrants in currentGrants:
            #find the ID index
            currentAllIndex=grantIDs.index(iGrants)
            #find the value associated with this index
            currentGrantValue=allGrantVals[currentAllIndex]
            #add it to the matrix
            valueMatrix[matrix_keywordIndex,matrix_agencyIndex]=valueMatrix[matrix_keywordIndex,matrix_agencyIndex]+currentGrantValue
        

### Plotting the values

We'll reuse much of the same code as we did before, except this tiem we'll be redfining the section where we took in the count matrix.  The interactivity of the resulting plot should be quite the same as the previous one.



In [None]:
#set 0 to nan in the plot to avoid bad log color issues
valueMatrix[valueMatrix==0]=np.nan 

def update_plots(rowSelectName,columnSelectName):
    """
    Performs the updating
    """
    
    rowIndex=keywords.index(rowSelectName)
    colIndex=list(grantAgenciesUnique).index(columnSelectName)
    heatmap_and_text(valueMatrix,rowIndex,colIndex)
    
# update the heatmap
#update_heatmap(None)
# create a dropdown menu for the rows
row_menu = widgets.Dropdown(
    options=keywords,
    #value=,
    description='Row:',
    disabled=False,
)
# create a dropdown menu for the columns
col_menu = widgets.Dropdown(
    options=grantAgenciesUnique,
    #value='',
    description='Column:',
    disabled=False,
    )


%matplotlib inline
from ipywidgets import interact
#establishes interactivity
interact(update_plots,rowSelectName=row_menu,columnSelectName=col_menu)