# USG grants crawl
## Co-Occurrence frequency analysis, within-agency

### Previously

In the previous chapter we looked at how often a selected set of open-science infrastructure related terms from [Lee & Chung (2022)](https://doi.org/10.47989/irpaper949) showed up in [grants.gov](https://www.grants.gov/web/grants) grant descriptions, and which agencies' grants they were showing up in.  

For the purposes of our investigation though, we might be curious how frequently certian words are occuring together, _within a specific agency_.  In this way wwe 

### Loading the database once more

Let's begin by loading up the database provided by the website, which is stored in an xml format.

In [2]:
# import our helper functions
import sys
import os
import glob
import json
import subprocess
# find the head directory of the repo


def getGitRoot():
    return subprocess.Popen(['git', 'rev-parse', '--show-toplevel'], stdout=subprocess.PIPE).communicate()[0].rstrip().decode('utf-8')

sys.path.insert(0, getGitRoot() + '/src')
sys.path.insert(0, getGitRoot() + '/inputData')
# import our helper functions
import grantsGov_utilities as grantsGov_utilities

expectedDataDir=getGitRoot() + '/inputData'

# find the path to the local json file if it exists
nsfGrantsJSONpath=grantsGov_utilities.detectLocalNSFData(dataDirectory=expectedDataDir+'/NSF_grant_data/')

# take the json file path, load it, and convert it to a dataframe
# start by loading the json file   
# next convert the json file to a dataframe
nsfGrantsDF=grantsGov_utilities.NSFjson2DF(nsfGrantsJSONpath)
# print the first 10 rows of the dataframe
nsfGrantsDF.head(10)

The local NSF grant data was found at /media/dan/HD4/coding/gitDir/USG_grants_crawl/inputData/NSF_grant_data/NSF_grants.json.
Attempting load of /media/dan/HD4/coding/gitDir/USG_grants_crawl/inputData/NSF_grant_data/NSF_grants.json
Loading .json file/media/dan/HD4/coding/gitDir/USG_grants_crawl/inputData/NSF_grant_data/NSF_grants.json


Unnamed: 0,AwardTitle,AGENCY,AwardEffectiveDate,AwardExpirationDate,AwardTotalIntnAmount,AwardAmount,AwardInstrument,Organization,ProgramOfficer,AbstractNarration,...,Investigator,Institution,Performance_Institution,ProgramElement,ProgramReference,FUND_OBLG,Appropriation,Fund,POR,FoaInformation
0,Systematics of Arachnids Using Whole Mitochond...,NSF,09/01/2004,08/31/2009,0.0,317548,{'Value': 'Standard Grant'},"{'Code': '08010206', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Judith Skog', 'PO_EMAI': No...",\n\nABSTRACT\nDEB 0416628\nMasta\n\nA grant ha...,...,"[{'FirstName': 'Susan', 'LastName': 'Masta', '...","{'Name': 'Portland State University', 'CityNam...","{'Name': 'Portland State University', 'CityNam...","[{'Code': '1171', 'Text': 'PHYLOGENETIC SYSTEM...","[{'Code': '1171', 'Text': 'PHYLOGENETIC SYSTEM...","[2004~299998, 2005~11550, 2006~6000]",,,,
1,CAREER: Rapid host-parasite evolution and its ...,NSF,09/01/2012,04/30/2017,678721.0,746459,{'Value': 'Continuing Grant'},"{'Code': '08010208', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Douglas Levey', 'PO_EMAI': ...",As rates of parasitism increase and species in...,...,"{'FirstName': 'Meghan', 'LastName': 'Duffy', '...",{'Name': 'Regents of the University of Michiga...,"{'Name': 'University of Michigan Ann Arbor', '...","[{'Code': '1182', 'Text': 'POP & COMMUNITY ECO...","[{'Code': '1045', 'Text': 'CAREER-Faculty Erly...","[2011~82637, 2012~157512, 2013~211950, 2014~15...","[{'Code': '0111', 'Name': 'NSF RESEARCH & RELA...","[{'Code': '01001112DB', 'Name': 'NSF RESEARCH ...","{'DRECONTENT': '<div class=""porColContainerWBG...",
2,DISSERTATION RESEARCH: The influence of wildfi...,NSF,06/01/2014,05/31/2015,19590.0,19590,{'Value': 'Standard Grant'},"{'Code': '08010209', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Henry L. Gholz', 'PO_EMAI':...",Although wildfires are important disturbances ...,...,"[{'FirstName': 'Mažeika', 'LastName': 'Sullivá...","{'Name': 'Ohio State University', 'CityName': ...","{'Name': 'Ohio State University', 'CityName': ...","{'Code': '1181', 'Text': 'ECOSYSTEM STUDIES'}","[{'Code': '9169', 'Text': 'BIODIVERSITY AND EC...",2014~19590,"{'Code': '0114', 'Name': 'NSF RESEARCH & RELAT...","{'Code': '01001415DB', 'Name': 'NSF RESEARCH &...","{'DRECONTENT': '<div class=""porColContainerWBG...",
3,Direct Conversion of Carbon into Q-carbon and ...,NSF,09/01/2017,08/31/2020,238995.0,286495,{'Value': 'Standard Grant'},"{'Code': '03070000', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Lynnette Madsen', 'PO_EMAI'...",NON-TECHNICAL DESCRIPTION: This project focuse...,...,"{'FirstName': 'Jagdish', 'LastName': 'Narayan'...","{'Name': 'North Carolina State University', 'C...","{'Name': 'North Carolina State University', 'C...","{'Code': '1774', 'Text': 'CERAMICS'}","[{'Code': '7237', 'Text': 'NANO NON-SOLIC SCI ...","[2017~238995, 2020~47500]","[{'Code': '0117', 'Name': 'NSF RESEARCH & RELA...","[{'Code': '01001718DB', 'Name': 'NSF RESEARCH ...","{'DRECONTENT': '<div class=""porColContainerWBG...",
4,MRI: Acquisition of a GPU Accelerated Vermont ...,NSF,09/01/2018,08/31/2020,893120.0,893120,{'Value': 'Standard Grant'},"{'Code': '05090000', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Alejandro Suarez', 'PO_EMAI...",This project will enable interdisciplinary sci...,...,"[{'FirstName': 'Joshua', 'LastName': 'Bongard'...",{'Name': 'University of Vermont & State Agricu...,{'Name': 'University of Vermont & State Agricu...,"{'Code': '1189', 'Text': 'Major Research Instr...","[{'Code': '026Z', 'Text': 'NSCI: National Stra...",2018~893120,"[{'Code': '0117', 'Name': 'NSF RESEARCH & RELA...","[{'Code': '01001718DB', 'Name': 'NSF RESEARCH ...","{'DRECONTENT': '<div class=""porColContainerWBG...",
5,The Origins And Impact Of Modern Human Diets,NSF,02/01/2015,01/31/2017,34269.0,34269,{'Value': 'Standard Grant'},"{'Code': '04040000', 'Directorate': {'Abbrevia...","{'SignBlockName': 'John Yellen', 'PO_EMAI': 'j...",This research focuses on the origins and devel...,...,"[{'FirstName': 'Curtis', 'LastName': 'Marean',...","{'Name': 'Arizona State University', 'CityName...","{'Name': 'Arizona State University', 'CityName...","{'Code': '1391', 'Text': 'Archaeology'}","{'Code': '1391', 'Text': 'ARCHAEOLOGY'}",2015~34269,"{'Code': '0115', 'Name': 'NSF RESEARCH & RELAT...","{'Code': '01001516DB', 'Name': 'NSF RESEARCH &...","{'DRECONTENT': '<div class=""porColContainerWBG...",
6,The Application of Resilient Bearings to a Cra...,NSF,10/01/1981,03/31/1982,28775.0,28775,{'Value': 'Standard Grant'},"{'Code': '07040000', 'Directorate': {'Abbrevia...","{'SignBlockName': 'name not available', 'PO_EM...",,...,"{'FirstName': 'Natan', 'LastName': 'Parsons', ...","{'Name': 'Cambridge Collaborative', 'CityName'...","{'Name': None, 'CityName': None, 'StateCode': ...","{'Code': '5370', 'Text': 'SBIR/STTR Operations'}",,1981~28775,,,,"[{'Code': '0106000', 'Name': 'Materials Resear..."
7,A Mathematical Model for Intramolecular Diffusion,,06/01/1993,05/31/1995,28011.0,28011,{'Value': 'Standard Grant'},"{'Code': '08080205', 'Directorate': {'Abbrevia...",{'SignBlockName': 'Deborah A. Joseph'},The award will support the development of a ne...,...,"{'FirstName': 'Fred', 'LastName': 'Cohen', 'Em...",{'Name': 'University of California-San Francis...,,"[{'Code': '1107', 'Text': 'COMPUTATIONAL BIOLO...","[{'Code': '1107', 'Text': 'COMPUTATIONAL BIOLO...",,,,,"[{'Code': '0510301', 'Name': 'Structure & Func..."
8,"Enabling Large-Scale, High-Resolution, and Rea...",NSF,10/01/2009,09/30/2013,38610.0,38610,{'Value': 'Standard Grant'},"{'Code': '05090000', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Irene Qualters', 'PO_EMAI':...",This award facilitates scientific research usi...,...,"[{'FirstName': 'Liqiang', 'LastName': 'Wang', ...","{'Name': 'University of Wyoming', 'CityName': ...","{'Name': 'University of Wyoming', 'CityName': ...","{'Code': '7781', 'Text': 'Leadership-Class Com...","[{'Code': '7781', 'Text': 'PETASCALE - TRACK 1...",2009~38610,"{'Code': '0109', 'Name': 'NSF RESEARCH & RELAT...","{'Code': '01000910DB', 'Name': 'NSF RESEARCH &...",,
9,Collaborative Research: Industry University...,NSF,09/15/2004,08/31/2010,0.0,642605,{'Value': 'Continuing Grant'},"{'Code': '07050000', 'Directorate': {'Abbrevia...","{'SignBlockName': 'Rathindra DasGupta', 'PO_EM...",This collaborative Industry/University Coopera...,...,"[{'FirstName': 'David', 'LastName': 'Goodman',...","{'Name': 'Polytechnic University of New York',...","{'Name': 'Polytechnic University of New York',...","[{'Code': '5761', 'Text': 'IUCRC-Indust-Univ C...","[{'Code': '0000', 'Text': 'UNASSIGNED'}, {'Cod...","[2004~160000, 2006~90000, 2007~150000, 2008~24...","[{'Code': '0108', 'Name': 'NSF RESEARCH & RELA...","[{'Code': '01000809DB', 'Name': 'NSF RESEARCH ...",,"{'Code': '0400000', 'Name': 'Industry Universi..."


### Keywords and terms

Although we aren't going to inspect the keywords and agencies on their own this time, we still need to collect them.  Once we have loaded them, we can determine which words are occuring in which grants, and which agencies those grants are associated with.  The resulting information can be placed in a dictionary, where the relevant information can be accessed by using the [tuple](https://www.w3schools.com/python/python_tuples.asp) corresponding to the desired agency and keyword (e.g. ('agency','keyword')

In [4]:
import json
import seaborn as sns
import itertools
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np

#HERE'S THE CHANGE FROM THE PREVOUS NOTEBOOK
# open the keywords csv file
inputKeywords=pd.read_csv('../OSterms_LeeChung2022.csv')
print(inputKeywords)

# split it into a list.  Each term is kept on a separate line
keywords=inputKeywords['terms'].tolist()

# specify the column corresponding to the grant abstract
abstractColumn='AbstractNarration'
# specify the column corresponding to the grant ID number
grantIDColumn='AwardID'
# specify the the column and dictionary structure corresponding to the directorate
# this is a nested dictionary
directorateColumn='Organization'
directorateFieldFirst='Directorate'
directorateFieldSecond='LongName'

# get a vector of the grant IDs
grantIDs=nsfGrantsDF[grantIDColumn].tolist()
# create a dictionary that maps the dataframe index to the grant IDs
grantID2index={grantIDs[i]:i for i in range(len(grantIDs))}

# find the unique directorates
#uniqueDirectorates=nsfGrantsDF['Organization'].map(lambda x: x[directorateFieldFirst]).unique()
uniqueDirectorates=nsfGrantsDF[directorateColumn].map(lambda x: x.get(directorateFieldFirst).get(directorateFieldSecond)).unique()

# create a dictionary that maps the directorate to a list of grants
directorate2grants={directorate:[] for directorate in uniqueDirectorates}
# loop over the grants using the grantID2index dictionary
for grantID in grantID2index:
    # get the dataframe index for the grant
    index=grantID2index[grantID]
    # get the directorate for the grant
    #directorate=nsfGrantsDF.loc[index]['Organization'][directorateFieldFirst]
    directorate=nsfGrantsDF.loc[index][directorateColumn].get(directorateFieldFirst).get(directorateFieldSecond)
    # add the grant ID to the list of grants for the directorate
    directorate2grants[directorate].append(grantID)

# search the abstracts for the keywords
grantKeywordFindsOut=grantsGov_utilities.searchInputListsForKeywords(nsfGrantsDF[abstractColumn],keywords)

# use evalGrantCoOccurrence to get a co-occurrence tuple dictionary
termBYdirectorate_dictionary=grantsGov_utilities.evalGrantCoOccurrence([grantKeywordFindsOut,directorate2grants])

# use the co-occurrence tuple dictionary to create an array of co-occurrence counts
countArray=grantsGov_utilities.tupleDictionaries_to_NDarray(termBYdirectorate_dictionary,operation=len)

#now make an interactive figure
def plotCoOccurance_Matrix(inputMatrix,inputAxis,axisItemLabels):           
    # mask out the diagonal so it doesn't overwhelm the plot
    diagonalMask=np.eye(len(axisItemLabels),dtype=bool)
    # copy the matrix so it can be modified 
    plotMatrix=copy.deepcopy(inputMatrix)
    # set the diagonal to zero
    plotMatrix[diagonalMask]=np.zeros(len(axisItemLabels))
    sns.heatmap(data=plotMatrix,ax=inputAxis, yticklabels=axisItemLabels,xticklabels=axisItemLabels, cmap='viridis',  norm=LogNorm(),cbar_kws={'label': 'Grant Count\n(log-scaled)'})
    
    # return the plot matrix, if necessary
    return plotMatrix

    
def heatmap_plot(matrix, heatmap_ax, row, column):
    """
    Plots the heatmap with a crosshair at the desired location
    """
    # if row is empty, default to column
    if row == '':
        row = column
    # if column is empty, default to row
    if column == '':
        column = row
    # if both are empty, no outline
    if row == '' and column == '':
        row = 0
        column = 0
    # if both are not empty, only highlight the relevant cell
    if row != '' and column != '':
        row = row
        column = column
    # create the heatmap plot
    # NOTE: grantAgenciesUnique and keywords = calls outside of function inputs
    sns.heatmap(matrix, ax=heatmap_ax, norm=LogNorm(), cmap='viridis', cbar=True, xticklabels=list(uniqueAgencies) , yticklabels=list(keywords), cbar_kws={'label': 'Grant Count\n(log-scaled)'})
    # create the outline
    heatmap_ax.axvline(x=column+.5, color='red', linewidth=2)
    heatmap_ax.axhline(y=row+.5, color='red', linewidth=2)



# create a function that updates the heatmap
def heatmap_and_coOccurance(countMatrix,rowSelect,columnSelect):
    """
    Plots both the heatmap and the textbox of grants in a 1 by 2 subplot
    """
    fig, ax = plt.subplots(2, 1, figsize=(10, 20))

    # sum it along one of the keyword dimensions
    keywordByAgencyArray=countMatrix.sum(axis=0)
    # plot the heatmap
    heatmap_plot(keywordByAgencyArray, heatmap_ax=plt.gcf().get_axes()[0], row=rowSelect, column=columnSelect)
    keyTuple=tuple([col_menu.value,row_menu.value])
    
    coOccurance_Matrix=countMatrix[:,:,columnSelect]
    
    # NOTE: keywords = call outside of function inputs
    if not keywordByAgencyArray[rowSelect,columnSelect] > 0:
        plt.text(0.5, 0.5, 'No grants for\n\n ' + keywords[rowSelect] + ' & ' + uniqueDirectorates[columnSelect], horizontalalignment='center', verticalalignment='center', transform=plt.gcf().get_axes()[1].transAxes,  fontsize=30)
    else:
        plotCoOccurance_Matrix(coOccurance_Matrix, inputAxis=plt.gcf().get_axes()[1], axisItemLabels=keywords)
    
    # change title
    plt.gcf().get_axes()[1].set_title('Term co-occurrences\nfor ' + uniqueAgencies[columnSelect])
    
    # display warning if relevant
    if not keywordByAgencyArray[rowSelect,columnSelect] > 0:
        plt.text(0.5, 0.5, 'No grants for\n\n ' + keywords[rowSelect] + ' & ' + uniqueDirectorates[columnSelect], horizontalalignment='center', verticalalignment='center', transform=plt.gcf().get_axes()[1].transAxes,  fontsize=30)

def update_plots(rowSelectName,columnSelectName):
    """
    Performs the updating
    """
    # NOTE: grantAgenciesUnique and keywords = calls outside of function inputs
    rowIndex=keywords.index(rowSelectName)
    colIndex=list(uniqueDirectorates).index(columnSelectName)
    heatmap_and_coOccurance(countArray,rowIndex,colIndex)
    
    
# link the dropdown menus to the update functions
#row_menu.observe(update_heatmap, names='value')
#col_menu.observe(update_heatmap, names='value')
# display the widgets
#display(row_menu)
#display(col_menu)

# update the heatmap
#update_heatmap(None)
# create a dropdown menu for the rows
row_menu = widgets.Dropdown(
    options=keywords,
    #value=,
    description='Row:',
    disabled=False,
)
# create a dropdown menu for the columns
col_menu = widgets.Dropdown(
    options=uniqueDirectorates,
    #value='',
    description='Column:',
    disabled=False,
    )


%matplotlib inline
from ipywidgets import interact
#establishes interactivity
interact(update_plots,rowSelectName=row_menu,columnSelectName=col_menu)



                                       categories                    terms
0        pre-registrations and registered reports       replication crisis
1        pre-registrations and registered reports              methodology
2        pre-registrations and registered reports          preregistration
3        pre-registrations and registered reports              replication
4        pre-registrations and registered reports       registered reports
5                                       preprints                preprints
6                                       preprints          social sciences
7                                 reproducibility          reproducibility
8                                 reproducibility             transparency
9                                 reproducibility            replicability
10                                reproducibility                 COVID-19
11                                reproducibility                   ethics
12                       

AttributeError: 'function' object has no attribute 'keys'

In [None]:
print(uniqueDirectorates)

['Direct For Biological Sciences'
 'Direct For Mathematical & Physical Scien'
 'Direct For Computer & Info Scie & Enginr'
 'Direct For Social, Behav & Economic Scie' 'Directorate For Engineering'
 'Office Of The Director' 'Dir for Tech, Innovation, & Partnerships'
 'Directorate For Geosciences' 'Direct For Education and Human Resources'
 'Directorate for STEM Education'
 'Office of Budget, Finance, & Award Management'
 'Natl Nanotechnology Coordinating Office' None
 'Office Of Information & Resource Mgmt'
 'Directorate for Computer & Information Science & Engineering'
 'Directorate for Social, Behavioral & Economic Sciences'
 'Office Of Polar Programs' 'Directorate for Biological Sciences'
 'National Coordination Office' 'Directorate for Geosciences'
 'OFFICE OF THE DIRECTOR' 'Directorate for Engineering'
 'Directorate for Mathematical & Physical Sciences'
 'Directorate for Education & Human Resources']


In [None]:

# find the grants that are associated with thse keywords
grantFindsOut=grantsGov_utilities.searchGrantsDF_for_keywords(grantsDF,keywords)
# find the agencies associated with these
grantAgenciesOut=grantsGov_utilities.grants_by_Agencies(grantsDF)

# get a dataframe with the keyword by agency information
keywordsByAgency_dictionary=grantsGov_utilities.evalGrantCoOccurrence([grantFindsOut,grantFindsOut,grantAgenciesOut],formatOut='dictionary')
# get the counts for all of these
#keywordsByAgency_count_DF=keywordsByAgency_DF.applymap(lambda x: len(x))

NameError: name 'grantsDF' is not defined

### A small wait

Because the previous analysis isn't coded particularly efficient, it can take a moment to complete.  Part of this has to do with the inefficiency required to index back in to the database, as well as the inefficient storage method for the information we are getting (i.e. appending to lists in a large dictionary)

In any case, once we have the relevant data structure we can look at which agencies are using which terms, and also receive an ouput of the [grants.gov](https://www.grants.gov/web/grants) IDs associated with those grants

In [None]:
# chat-davinci-002 prompt
# an iteractive jupyer notebook widget that returns two subplot windows.  The input is a numerical matrix.  The interface features two dropdown menus that allow you to select a row (i) and column (j) from the matrix.  On the left side of the subplot outputs, a matrix heatmap plotting the numerical data.  On the right side of the subplot outputs, a blank plot that is used to display text indicating the value found in the specific matrix (i,j) entry selected in the dropdown menus.

import ipywidgets as widgets
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import pandas as pd
import seaborn as sns
from IPython.display import clear_output
from matplotlib.colors import LogNorm
import copy

def tupleDictionaries_to_NDarray(tupleDictionary,operation=len):
    """
    This function coverts a dictionary with permuted tuples as the keys (e.g. keys = [list 1, list 2, list 3, etc.])
    and converts it to a count ND array (e.g. len(tupleDictionary[iKey]) for iKeys in list(tupleDictionary.keys()))
    
    Think of this as pandas.DataFrame.applymap(), but for dictionaries.


    Parameters    ----------
    tupleDictionary: dictionary
        A dictionary with permuted tuples as the keys (e.g. keys = [list 1, list 2, list 3, etc.])
    
    Returns
    -------
    ndArrayHolder : numpy array
        A N-dimensional count array

    See Also
    --------
   
    """
    import numpy as np
    # convert the keys to an array
    keysArray=np.asarray(list(tupleDictionary.keys()))
    # create a list to hold the unique labels
    uniqueDimLabels=[]
    # iterate through the sets of key elements
    for iDims in range(keysArray.shape[1]):
        # append the unique key values for each dimension to the holder
        uniqueDimLabels.append(list(np.unique(keysArray[:,iDims])))
    # create a array holder for this 
    ndArrayHolder=np.zeros([len(iDems) for iDems in  uniqueDimLabels],dtype=np.int32)
    # iterate through the keys
    for iKeys in list(tupleDictionary.keys()):
        # get the current coords associated with the given key
        indexCoords=[ uniqueDimLabels[iCoords].index(iKeys[iCoords]) for iCoords in range(len(iKeys))]
        # do the relevant operation and place the output it in the relevant space
        ndArrayHolder[tuple(indexCoords)]=operation(tupleDictionary[iKeys])
    return ndArrayHolder
        
#get the count array
countArray=tupleDictionaries_to_NDarray(keywordsByAgency_dictionary,operation=len)
# get the unique agency names
uniqueAgencies=grantsDF['AgencyCode'].unique()




: 

### Interacting with the plot 

The widget should allow you to select which terms to work with.  For the moment (i.e. early stages of this notebook) the interface is relatively rudamentary but the heatmap plot should feature a crosshair indicating which agency and term you are looking at.  The plot beneath that should inclde a list of the grant.gov IDs.  In many cases no grants are found meeting the criteria, and so a large text indicator should appear stating this.  However in the event that grants are found, they should be listed.  Currently the text scaling for this feature is rudamentary, and so if too many are found their font might be extremely small (future [modifications](https://stackoverflow.com/questions/55729075/matplotlib-how-to-autoscale-font-size-so-that-text-fits-some-bounding-box) could adress this).  Additionally, the text elements themselves may be [capable of being hyperlinks](https://matplotlib.org/stable/gallery/misc/hyperlinks_sgskip.html).

Specific to the plot itself, it's clear to see that the inclusion of "research" is throwing off the analysis.  This is likely because of how generic this term is. 


### How similar are agencies usages of terms?

One question we might ask is if agencies are mentioning these terms in different ways--if the patterns of co-occurrence are _different_ for different agencies.  To answer this question we can take the co-occurrence matrix _for each agency_ and compare them to one another (thus resulting in _another_ matrix, this time agency by agency).  The proper tool for this is called the [cosine distance or cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). This measure provides a measure of distance between two collections of equally sized / shaped quantificaitons (in this case the [unrolled](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html) co-occurrence matrix.  Also, because we don't want the total number of grants for a particular agency to impact this analysis, we'll normalize the vectors (this is likely unnecessary due to how cosine distance works).

#### Interpreting the plot

Overall, what the coloration of the plot will indicate is the degree of similarity or difference in the patterns of term co-occurance for open-science related terms.  Implicitly, we might assume that a high degree of similarity would reflect agencies talking about open science topics in the same way, or focusing on the same aspects.  A high degree of difference would indicate using the terms in differing ways, potentially reflecting differing foci, or even different senses of the words being used (e.g. not in a sense related to open science).  Given that we are plotting _distance_ a value of 0 indicates overlap, and thus maximal similarity.  For this same reason, a value of 1 would the most extreme distance, and thus reflect maximal difference.

In [None]:
import scipy

normalizeVecs=True

# quick definition of normalize function
def normalizeVector(inputVec):
    # note, these are counts so they are necessarily positive
    # if it's not empty
    if not np.sum(inputVec)==0:
        normalizedVector=np.divide(inputVec,np.sum(inputVec))
    else: 
        # otherwise
        normalizedVector=inputVec
    return normalizedVector

# create a holder for the cosine distance analysis
cosineDists_agency=np.zeros([len(uniqueAgencies),len(uniqueAgencies)])

#once the co-occurrence stack is complete, perform the cosine analysis
for iIndexX, iAgenciesX in enumerate(uniqueAgencies):
    for iIndexY, iAgenciesY in enumerate(uniqueAgencies):
        # get the stack slice for each agency
        agencyX_slice=countArray[:,:,iIndexX]
        agencyY_slice=countArray[:,:,iIndexY]
        
        # flatten it into a single vector for each
        agencyX_vec=np.ravel(agencyX_slice)
        agencyY_vec=np.ravel(agencyY_slice)
        
        if (not np.sum(agencyX_vec)==0) and (not np.sum(agencyY_vec)==0):
        
            # if we want to normalize, do that
            if normalizeVecs:
                agencyX_vec=normalizeVector(agencyX_vec)
                agencyY_vec=normalizeVector(agencyY_vec)

            #in either case, perform the cosine analysis
            currentDistance=scipy.spatial.distance.cosine(agencyX_vec,agencyY_vec)
            # set it in the output matrix
        
        
            cosineDists_agency[iIndexX,iIndexY]=currentDistance
        else:
            cosineDists_agency[iIndexX,iIndexY]=np.nan
        
fig = plt.figure(figsize=(10, 10))
fig.suptitle('Agency differences in co-occurrence patterns for\n open science-related terms')
# plot the result
sns.heatmap(data=cosineDists_agency, cmap='viridis', yticklabels=uniqueAgencies,xticklabels=uniqueAgencies,cbar_kws={'label': 'Cosine Distance of\nOS-related term usage'})

: 

: 