# USG grants crawl
## Agency-specific replication from  Lee & Chung (2022)

### Previously

In the previous chapter we replicated some of our previous analysis, which was done with a list of open-science terms of our own creation, with an emperically derived list of terms derived from [Lee & Chung (2022)](https://doi.org/10.47989/irpaper949).  While the reults look compelling, and at least appear to plausible relatie to our intuitions, we should always be skeptical of our results until we do some thorough sanity checks.  

In this chapter we'll implement some sanity checks.  These will include confirming that the extreme ends of the term-matching distributions (e.g. many term matches vs no term matches) do indeed reflect the sorts of grants we would expect.  We'll also check to make sure that our quantative analyses of these descriptions actually return the sorts of results we would expect.

### Loading the database once more

Let's begin by loading up the database provided by the website, which is stored in an xml format.

In [1]:
# import our helper functions
import sys
import os
sys.path.insert(0, os.path.abspath('../src'))
sys.path
import grantsGov_utilities as grantsGov_utilities

# local data storage directory
localDataDir='inputData'

grantsDF=grantsGov_utilities.detectLocalGrantData(localPath='../'+localDataDir,forceDownload=True)
grantsDF

Unnamed: 0,OpportunityID,OpportunityTitle,OpportunityNumber,AgencyCode,AgencyName,AwardCeiling,AwardFloor,EstimatedTotalProgramFunding,ExpectedNumberOfAwards,Description
0,262148,Establishment of the Edmund S. Muskie Graduate...,SCAPPD-14-AW-161-SCA-08152014,DOS-SA,Bureau of South and Central Asian Affairs,600000,400000,600000,1,The Office of Press and Public Diplomacy of th...
1,262149,Eradication of Yellow Crazy Ants on Johnston A...,F14AS00402,DOI-FWS,Fish and Wildlife Service,0,0,0,,Funds under this award are to be used for the ...
2,131073,"Cooperative Ecosystem Studies Unit, Piedmont S...",G12AS20003,DOI-USGS1,Geological Survey,0,0,31900,1,The USGS Southeast Ecological Science Center s...
3,131094,Plant Feedstock Genomics for Bioenergy: A Joi...,DE-FOA-0000598,PAMS-SC,Office of Science,500000,200000,6000000,10,The U.S. Department of Energy&apos;s Office of...
4,131095,Management of HIV-Related Lung Disease and Car...,RFA-HL-12-034,HHS-NIH11,National Institutes of Health,400000,,2000000,,This FOA invites clinical trials planning gran...
...,...,...,...,...,...,...,...,...,...,...
70325,262109,2014/2015 Social Responsibility through Englis...,RELO-BP-MOB,DOS-HUN,U.S. Mission to Hungary,152418,152418,152418,1,"In close consultation with RELO Budapest, the ..."
70326,262108,"Notice of Intent to Award - Fort McHenry, Balt...",NPS-14-NERO-0124,DOI-NPS,National Park Service,65000,65000,65000,1,United States Department of the Interior Natio...
70327,262112,Fish and Wildlife Coordination Act,R14AS00070,DOI-BOR,Bureau of Reclamation,525000,525000,525000,1,"To provide financial assistance, through grant..."
70328,131053,USAID/Uganda Literacy and Health Education Pro...,RFA-617-12-000001,USAID-UGA,Uganda USAID-Kampala,57000000,0,57000000,1,Literacy Program is a 5-year program to improv...


## Cleaning
As before, we need to do a bit of cleaning, so lets do a more comprehensive version of that here.

Note:  This may take a moment

In [2]:
grantsDF=grantsGov_utilities.prepareGrantsDF(grantsDF, repair=True)
grantsDF

OpportunityID                    int64
OpportunityTitle                object
OpportunityNumber               object
AgencyCode                      object
AgencyName                      object
AwardCeiling                     int64
AwardFloor                       int64
EstimatedTotalProgramFunding     int64
ExpectedNumberOfAwards           int64
Description                     object
dtype: object
62144 grant agency name or code value records altered


  warn('NOTE: this function CHANGES the values / content of the grantsDF from the information contained on grants.gov, including but not limited to adding data columns, replacing null/empty values, and/or inferring missing values.')


28027 grant funding value records repaired


Unnamed: 0,OpportunityID,OpportunityTitle,OpportunityNumber,AgencyCode,AgencySubCode,AgencyName,AwardCeiling,AwardFloor,EstimatedTotalProgramFunding,ExpectedNumberOfAwards,Description
0,262148,Establishment of the Edmund S. Muskie Graduate...,SCAPPD-14-AW-161-SCA-08152014,DOS,SA,Bureau of South and Central Asian Affairs,600000,400000,600000,1,The Office of Press and Public Diplomacy of th...
1,262149,Eradication of Yellow Crazy Ants on Johnston A...,F14AS00402,DOI,FWS,Fish and Wildlife Service,0,0,0,0,Funds under this award are to be used for the ...
2,131073,"Cooperative Ecosystem Studies Unit, Piedmont S...",G12AS20003,DOI,USGS1,Geological Survey,0,0,31900,1,The USGS Southeast Ecological Science Center s...
3,131094,Plant Feedstock Genomics for Bioenergy: A Joi...,DE-FOA-0000598,PAMS,SC,Office of Science,500000,200000,6000000,10,The U.S. Department of Energy&apos;s Office of...
4,131095,Management of HIV-Related Lung Disease and Car...,RFA-HL-12-034,HHS,NIH11,National Institutes of Health,400000,0,2000000,0,This FOA invites clinical trials planning gran...
...,...,...,...,...,...,...,...,...,...,...,...
70325,262109,2014/2015 Social Responsibility through Englis...,RELO-BP-MOB,DOS,HUN,U.S. Mission to Hungary,152418,152418,152418,1,"In close consultation with RELO Budapest, the ..."
70326,262108,"Notice of Intent to Award - Fort McHenry, Balt...",NPS-14-NERO-0124,DOI,NPS,National Park Service,65000,65000,65000,1,United States Department of the Interior Natio...
70327,262112,Fish and Wildlife Coordination Act,R14AS00070,DOI,BOR,Bureau of Reclamation,525000,525000,525000,1,"To provide financial assistance, through grant..."
70328,131053,USAID/Uganda Literacy and Health Education Pro...,RFA-617-12-000001,USAID,UGA,Uganda USAID-Kampala,57000000,0,57000000,1,Literacy Program is a 5-year program to improv...


In [15]:
# from https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
import gensim
from gensim.utils import simple_preprocess
import nltk
import os
nltk.download('stopwords')
from nltk.corpus import stopwords
# get the stopwords
stop_words = stopwords.words('english')
# load up the custom grant-related stopword list
grantStopwords_filepath = os.path.join(os.path.split(localPath)[0],'grantSpecificStopwords.txt')
with open(grantStopwords_filepath, 'r') as f:
    grantStopwordsDocument=f.read()
grantStopwords=grantStopwordsDocument.split(',')
# add some relevant words to the list
# (old way)
#stop_words.extend(['http','quot','new','notice','please', 'may','award','awards','application','foa','announcement','must','applications','proposal','applications','funding','provide','support','opportunity','grant','include','includes','eligible'])
# (new way)
stop_words.extend(grantStopwords)

# extend by abreviations and possibly agresive
abbrvStopwords_filepath = os.path.join(os.path.split(localPath)[0],'grantSpecificStopwords_abbreviations.txt')
with open(abbrvStopwords_filepath, 'r') as f:
    abbrvStopwordsDocument=f.read()
abbrvStopwords=abbrvStopwordsDocument.split(',')
# add some relevant words to the list
# (old way)
#stop_words.extend(['http','quot','new','notice','please', 'may','award','awards','application','foa','announcement','must','applications','proposal','applications','funding','provide','support','opportunity','grant','include','includes','eligible'])
# (new way)
stop_words.extend(abbrvStopwords)

aggressive=True
if aggressive:
    aggressiveStopwords_filepath = os.path.join(os.path.split(localPath)[0],'grantSpecificStopwords_aggressive.txt')
    with open(aggressiveStopwords_filepath, 'r') as f:
        aggressiveStopwordsDocument=f.read()
    aggressiveStopwords=aggressiveStopwordsDocument.split(',')
    # add some relevant words to the list
    # (old way)
    #stop_words.extend(['http','quot','new','notice','please', 'may','award','awards','application','foa','announcement','must','applications','proposal','applications','funding','provide','support','opportunity','grant','include','includes','eligible'])
    # (new way)
    stop_words.extend(aggressiveStopwords)    

                                       
# create a function to remove punctuation
def sent_to_words(documents):
    for document in documents:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(document), deacc=True))
# create a function to remove stopwords
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]
# get the descriptions as a list
grantDescriptions=list(grantsDF['Description'].values)
# remove the punctuation
grantDescriptions=sent_to_words(grantDescriptions)
# remove the stopwords
grantDescriptions=remove_stopwords(grantDescriptions)
print('Text cleaned')

# now for the stuff that takes a long time
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(grantDescriptions)
# Create Corpus
texts = grantDescriptions
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
print('Corpus creation complete')
# View
print(corpus[:1][0][:30])

from pprint import pprint
# number of topics
num_topics = 60
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dbullock\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Text cleaned
Corpus creation complete
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 4), (29, 1)]
[(37,
  '0.015*"research" + 0.007*"health" + 0.006*"data" + 0.006*"system" + '
  '0.005*"systems" + 0.005*"science" + 0.004*"government" + 0.004*"wildlife" + '
  '0.004*"treatment" + 0.003*"part"'),
 (32,
  '0.033*"research" + 0.017*"education" + 0.013*"start" + 0.012*"head" + '
  '0.012*"science" + 0.009*"families" + 0.009*"children" + 0.009*"services" + '
  '0.005*"training" + 0.005*"additional"'),
 (51,
  '0.009*"research" + 0.006*"cultural" + 0.005*"health" + 0.005*"hsr" + '
  '0.005*"care" + 0.004*"history" + 0.004*"data" + 0.004*"sites" + '
  '0.004*"education" + 0.003*"services"'),
 (9,
  '0.010*"justice" + 0.009*"youth" + 0.009*"system" + 0.008*"research" + '
  '0.007*"data" + 0.006*"

In [None]:
import pyLDAvis.gensim
import pickle 
import pyLDAvis
localPath=os.getcwd()
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_data_filepath = os.path.join(os.path.split(localPath)[0],'results','ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath + '.pkl', 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath + '.pkl', 'rb') as f:
    LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, LDAvis_data_filepath +'.html')
LDAvis_prepared