# USG grants crawl
## Ingest

### Motive

Imagine that we were curious about how the federal governemnt had fostered or otherwise encouraged [open science](https://open.science.gov/) and associated infrastructure in recent years.  How would we even begin to explore this issue?

One potential resource could might be [grants.gov](https://www.grants.gov/web/grants) which serves as an online data resource for government grants in the United States (primarily federal, but also some state).  With this resource it's possible to [explore details about federal grants](https://www.grants.gov/web/grants/search-grants.html), including which agencies are offering them, what they are targeting, and how much funding is available.  It's also possible to [download](https://www.grants.gov/xml-extract.html) much (but not all) of this database for local use.

### Initial database load

Let's begin by loading up the database provided by the website, which is stored in an xml format.

In [1]:
from bs4 import BeautifulSoup
import xmltodict
import sys

# FUTURE NOTE: it may be possible to do a check for a local file meeting the relevant criterion and conditionally 
# download from https://www.grants.gov/extract/ (and extract compressed file) in the event a local target isn't found.
# For the moment though...

# load up the xml file; hard-path to local file.  Adjust as necessary
pathToXML='C://Users//dbullock//Documents//code//gitDir//USG_grants_crawl//inputData//GrantsDBExtract20230113v2.xml'

# open and parse file
with open(pathToXML, 'r') as f:
    govGrantData_raw = f.read()

# convert xml to dictionary
with open(pathToXML) as xml_file:
    govGrantData_dictionary = xmltodict.parse(xml_file.read())

# quick size legibility function generated by code-davinci-002
def convert_bytes(bytes):
    if bytes < 1024:
        return str(bytes) + " B"
    elif bytes < 1048576:
        return str(round(bytes/1024, 1)) + " KB"
    elif bytes < 1073741824:
        return str(round(bytes/1048576, 1)) + " MB"
    elif bytes < 1099511627776:
        return str(round(bytes/1073741824, 1)) + " GB"
    else:
        return str(round(bytes/1099511627776, 1)) + " TB"
    
# terminal reports
print('Dictionary conversion successful')
print('\n' + str(len(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])) + ' grant entries found, totalling '+ convert_bytes(sys.getsizeof(govGrantData_raw)))
#print('\n and with dictionary keys:\n')
#print(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'][0].keys())

Dictionary conversion successful

70330 grant entries found, totalling 256.2 MB


### What does a grant record look like

To get a sense of what any give grant records looks like in the XML structure / metadata scheme / python dictionary, we can select one arbitrarly and view it.

In [2]:
print("{" + "\n".join("{!r}: {!r},".format(k, v) for k, v in  govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'][1].items()) + "}")

{'OpportunityID': '262149',
'OpportunityTitle': 'Eradication of Yellow Crazy Ants on Johnston Atoll NWR',
'OpportunityNumber': 'F14AS00402',
'OpportunityCategory': 'D',
'FundingInstrumentType': 'CA',
'CategoryOfFundingActivity': ['AG', 'ENV', 'NR'],
'CFDANumbers': '15.608',
'EligibleApplicants': '99',
'AdditionalInformationOnEligibility': 'The recipient has already been selected for this award.  Please see attached Notice of Intent to Award for specifics.',
'AgencyCode': 'DOI-FWS',
'AgencyName': 'Fish and Wildlife Service',
'PostDate': '08152014',
'CloseDate': '08222014',
'LastUpdatedDate': '08152014',
'AwardCeiling': '0',
'AwardFloor': '0',
'EstimatedTotalProgramFunding': '0',
'Description': 'Funds under this award are to be used for the eradication of Yellow Crazy Ants from Johnston Atoll National Wildlife Refuge.',
'Version': 'Synopsis 1',
'CostSharingOrMatchingRequirement': 'No',
'ArchiveDate': '08232014',
'AdditionalInformationURL': 'http://www.grants.gov/',
'AdditionalInformation

### Quick inspection of the dataset

Now that the data has been downloaded, let's take a moment to take a look at the the broad scope of the database.  To do this, we'll look at it from an agency-based perspective, and see how many grants are recorded as well as what their total funding is.  

Keep in mind that, although this is a fairly comprehensive database, it may not include all grants, and even the grant records it _does_ contain may all not be formatted in a standard way (and thus may be overlooked by the method we employ here).

We'll also save down this output to the working directory, in a file named 'agencyGrantsSummary.csv'.  

In [3]:
# let's do some initial dataset overview
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import itables
from itertools import compress

#NOTE: need to use int 64 due to overflow of max value for int 32

# generate a vector for the agency names
agencyNameVec=[[] for iGrant in range(len(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])) ]
# iterate through the grants
for iIndex,iListing in enumerate(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0']):
    # this time we're just getting the relevant agency label
    # yes, we're redoing what occured in the previous block
    # why are you like this government agencies
    try:    
    # in the normal case
        nameHold=iListing['AgencyCode'].split('-')[0]
        # set it in the corresponding item in the list
        agencyNameVec[iIndex]=nameHold
    except:
        try:
            # if its not there, get the full name
            agencyName=iListing['AgencyName']
            # and extract the capital letters
            nameHold=([char for char in agencyName if char.isupper()])
            # set it in the corresponding item in the list
            agencyNameVec[iIndex]=nameHold
        except:
            # well, if you can't adhere to a formatting standard, then you get lumped into other
            nameHold='other'
            # set it in the corresponding item in the list
            agencyNameVec[iIndex]=nameHold

#do the same but for the grant values
grantValVec=[[] for iGrant in range(len(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])) ]
for iIndex,iListing in enumerate(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0']):
    try:
        # if you can find the expected program funding value, add it to vector while forcing the string to an int
        grantValVec[iIndex]=np.int64(iListing['EstimatedTotalProgramFunding'])   
    except:
        # if you can't
        try:
            # try and infer a value, if the data is avaialble
            # do this by estimating the mean grant value, and multiplying by the expected number of grant awards
            totalAvgValue=np.multiply(np.divide((np.int64(iListing['AwardCeiling'])+int(iListing['AwardFloor'])),2),iListing['ExpectedNumberOfAwards'])
            # add that value to the val vec
            grantValVec[iIndex]=totalAvgValue
        except:
            # just add zero, as a place holder
            grantValVec[iIndex]=np.int64(0)

# get the unique entries and their counts
unique_elements, counts_elements = np.unique(agencyNameVec, return_counts=True)
#get the proportions thereof
countPortions=np.divide(counts_elements,np.sum(counts_elements))

#use that information to get the total value for each agency
#initialize a vector for the totals
agencyTotals=np.zeros(len(unique_elements))
for iAgencyIndex,iUniqueAgencies in enumerate(unique_elements):
    agencyVecMask=[iUniqueAgencies==iAgencies for iAgencies in agencyNameVec]
    agencyTotals[iAgencyIndex]=np.sum(list(compress(grantValVec,agencyVecMask)))

#get the proportions thereof
valueProportion=np.divide(agencyTotals,np.sum(agencyTotals))
            
#initialize the dataframe
grantCountDF=pd.DataFrame(data=zip(unique_elements,counts_elements,countPortions,agencyTotals,valueProportion),columns=['AgencyName','GrantCount','PortionOfTotal','TotalValue','TotalValuePortion'])

# save it down
grantCountDF.to_csv('agencyGrantsSummary.csv')
# interactive display, we'll display 100 MB worth, max
itables.show(grantCountDF,  maxBytes=104857600)

#from earlier attempt
# quick and dirty pie chart from chat GPT
def plot_pie(list):
    plt.figure(figsize=(12,8), dpi= 100)
    unique_elements, counts_elements = np.unique(list, return_counts=True)
    plt.pie(counts_elements, labels=unique_elements, autopct='%1.1f%%', shadow=True, startangle=140, explode=[.1 for iAgency in unique_elements])
    plt.axis('equal')
    plt.show()
# not a good visualization, so don't do it
# plot_pie(agencyList)

AgencyName,GrantCount,PortionOfTotal,TotalValue,TotalValuePortion
Loading... (need help?),,,,


### Convert XML to csv?

One may be intrested in converting the XML file into a CSV file.  However, as it turns out, this is not as straightforward as one might hope, due to certian fields in grant records containing more complexly structured information (e.g. lists) rather than a simple string or numeral entry.

The code below is provided as an illustration of how to explore this issue, and does not run by default.  As it is currently structured, setting 'defaultOffSiwtch' will result in an error.

In [4]:
# NOTEBOOK USAGE NOTE FOR THIS BLOCK
# due to the heterogenity of the grant xml structure, it's advised that you NOT run this block -- it will error
# certian keys/fields (e.g. 'EligibleApplicants' and 'CFDANumbers') have been observed to have multiple records
# within a single grant.  Thus, the associated value for these (when collapsed into a dictionary-like structure)
# is not a single item (e.g. string, int, or float), but rather a list.
# this makes concatenation difficult to acheive in a principled manner
# (or in a way that permits saving to a conventional output)

# use this as a switch to control whether this box runs or not, 
# when defaultOffSiwtch=False , the content of this box will not run
defaultOffSiwtch=False

if defaultOffSiwtch:
    # theoretically, you might wish to convert the entire grant archive xml to a an output file
    # this is how you would do that
    grantsDataFrame=pd.DataFrame.from_records(govGrantData_dictionary['Grants']['OpportunitySynopsisDetail_1_0'])
    # NOTE THOUGH: some of the fields (e.g. description) include characters that would otherwise be delimiters (e.g. ',')
    # if you want to save this, you'll have to be clever
    
    # in any case
    # replace the na values with 0
    grantsDataFrame=grantsDataFrame.fillna(0)
    
    # next, convert the numeric columns to actual numbers
    # start by getting the column names, you'll be iterating across them
    columnNames=list(grantsDataFrame.columns.values)
    # now iterate across them and see if they are numeric, and do so robustly
    for iColumns in columnNames:
        print(iColumns)
        # if the first value is numeric-ish (e.g. ignoring a single decimal and negative sign)
        if grantsDataFrame[iColumns].iloc[0].replace('.','',1).replace('-','',1).isdigit():
            # convert it in a type-appropriate way
            # if it has a decimal
            if '.' in grantsDataFrame[iColumns].iloc[0]:
                # convert it to float
                grantsDataFrame[iColumns]=grantsDataFrame[iColumns].astype(float)
            else:
                # otherwise, I guess it's an int
                grantsDataFrame[iColumns]=grantsDataFrame[iColumns].astype(int)
            grantsDataFrame[iColumns]
            
    # interactive display
    itables.show(grantsDataFrame)

### What about open source?

In this notebook we have looked at the data contained within the [grants.gov](https://www.grants.gov/web/grants) export.  In the next notebook ('Open Science Overview'), we'll take a first look at this data more specifically through the lense of open science infrastructure.