# Extract NSF Project Publications

This notebook uses the helper module [`crossref`](./crossref.py) to use the necessary functions to grab citation metadata including **citation counts**, which are used in further analysis.

In [1]:
from crossref import get_metadata, get_doi, get_doi_bib_lookup, get_bib_citation
import time

BE_NICE = .60

# can be deleted

def get_doi(doi_like):
    import re
    pattern = r".*(10\.\d+\/.*)"
    m = re.match(pattern, doi_like)
    if m:
        return m[1]
    else:
        return ''

def get_doi_bib_lookup(citation):
    import requests
    import re
    
    r = requests.get(f"https://api.crossref.org/works?query.bibliographic={citation.strip()}")

    if r.status_code == 200:
        resp = r.json()
        # assigned first item to hit
        hit = resp['message']['items'][0]

        t1 = re.sub(r'[,.:]', ' ', citation.split('~')[1]).lower().split()
        t2 = re.sub(r'[,.:]', ' ', hit['title'][0]).lower().split()

        if not set(t2)-set(t1):
            return hit['DOI'], hit['is-referenced-by-count'], hit        
        else:
            return None, None, None
    else:
        return None, None, None
    
def get_bib_citation(doi):
    import requests
    header = { "Accept": "text/x-bibliography; style=american-meteorological-society" }
    r = requests.get(f"https://doi.org/{doi}", headers=header)
    
    if r.status_code == 200:
        return(r.content)

## NSF EXTRACTION

Now that we have the appropriate code to get citations and validate DOIs, we can being the data extraction from NSF.

This will require the [NSF Awards API]() and in the following computation, we will extract **ONLY the DOIs** found in the NSF data.

In [2]:
import json

### LOAD NSF AWARDS DATA DUMP FILE

In [3]:
nsf_json = json.load(
    open("../outputs/nsf/data_full_dump.json")
)

In [4]:
len(nsf_json.keys())

215

* 215 TOTAL PROJECTS (UNIQUE AWARD IDS)

### BUILD ID -> PUBLICATION MAPPING

Now that we have verified the Award ID count checks out, we will search the NSF data for publications.  We will notice that the dataset includes two keys `publicationResearch` and `publicationConference` which contain the list of publications in similar form to:

```
Hsu, L., B. McElroy, R.L. Martin, W. Kim~Building a Sediment Experimentalist Network (SEN): sharing best practices for experimental methods and data management~The Sedimentary Record~11~2013~9~~doi:10.2110/sedred.2013.4~0~ ~0~ ~03/02/2022 04:30:09.530000000
```

It will be noted the `~` seperate fields and field 8 contains the DOI (if it exists at all).  

* A first pass will extract DOIs that exist and look them up directly. 
* A second pass will perform a **freetext lookup on the title and authors** in subsequent cells. 

In [5]:
%%time

'''
NOTE THIS CELL WILL TAKE TIME TO EXECUTE AS IT WILL LOOKUP 
ALL DOIS FOUND IN THE NSF DATA PAYLOAD.  TYPICAL RUNTIMES
VARY BUT EXPECT BETWEEN 8 AND 15 MINUTES.
'''
id_map = {}

for nsfid in nsf_json.keys():
    data = nsf_json[nsfid]
    
    if data.get('publicationResearch'):
        for p in data['publicationResearch']:
            if id_map.get(nsfid):
                id_map[nsfid].extend(
                [
                    {'citation': p, 
                     'doi': get_doi(p.split('~')[7])
                    } 
                ])
            else:
                id_map[nsfid] = [
                    {'citation': p, 
                     'doi': get_doi(p.split('~')[7])
                    } 
                ]
            print(".", end="")

    if data.get('publicationConference'):
        for p in data['publicationConference']:
            if id_map.get(nsfid):
                id_map[nsfid].extend(
                [
                    {'citation': p, 
                     'doi': get_doi(p.split('~')[7])
                    } 
                ])
            else:
                id_map[nsfid] = [
                    {'citation': p, 
                     'doi': get_doi(p.split('~')[7])
                    } 
                ]
            print(".", end="")


.........................................................................................................................................................................................................................................................................................................................................................................................................................CPU times: total: 0 ns
Wall time: 2.99 ms


In [6]:
id_map.keys()

dict_keys(['1324760', '1340233', '1340265', '1340301', '1343760', '1343785', '1343800', '1343802', '1343811', '1343813', '1354693', '1440133', '1440081', '1440084', '1440109', '1440116', '1440139', '1440195', '1440202', '1440312', '1440293', '1440181', '1440213', '1440294', '1440315', '1440333', '1440323', '1440291', '1440332', '1440327', '1440351', '1540966', '1540979', '1541002', '1541015', '1541008', '1540849', '1541036', '1541029', '1540996', '1541039', '1540998', '1541049', '1541043', '1540542', '1541047', '1540938', '1541044', '1541057', '1541007', '1541010', '1661918', '1541390', '1542058', '1632211', '1639588', '1639614', '1639683', '1639698', '1639707', '1639714', '1639734', '1639749', '1639710', '1639716', '1639738', '1639748', '1639759', '1639547', '1639696', '1639764', '1639557', '1639775', '1740595', '1740581', '1740648', '1740693', '1740694', '1740704', '1740627', '1743321', '1927578', '1928288', '1928369', '1928315', '1928305', '1928389', '1928403', '1928406', '2026932',

In [7]:
id_map['2126449']

[{'citation': "Xie, Yiqun and Jia, Xiaowei and Bao, Han and Zhou, Xun and Yu, Jia and Ghosh, Rahul and Ravirathinam, Praveen~Spatial-Net: A Self-Adaptive and Model-Agnostic Deep Learning Framework for Spatially Heterogeneous Datasets~Proceedings of the 29th International Conference on Advances in Geographic Information Systems (SIGSPATIAL'21)~~2021~~~https://doi.org/10.1145/3474717.3483970~10329517~313 to 323~10329517~OSTI~11/08/2022 21:03:23.293000000",
  'doi': '10.1145/3474717.3483970'}]

## JSON FILE CREATION
Once we have the data, we will improve it by including the citation counts from the crossref using their [content negotiation](https://citation.crosscite.org/docs.html) code as well as the crossref metadata.  

Ultimately, the output file ([outputs/nsf/nsf_mapping.json](outputs/nsf/nsf_mapping.json)) will be **the definitive source** for an individual run.

The sample below shows the format of the mapping file, where each key contains the NSF ID and the value being a list of found publications which include the original NSF citation (the "source" citation), the DOI the crossref citations, the crossref metadata (the "source" metadata) and the formatted AMS citation.  The allows some auditing with the source, as well as the rapid retrieval of citation counts and styled / formatted citation.

```json
{
   "1324760": [
        {
            "citation": "Hsu, L., R. Martin, B. McElroy, W. Kim~Data management, sharing, and reuse in experimental geomorphology: challenges, strategies, and scientific opportunities.~Geomorphology~244~2015~180~~~0~ ~0~ ~29/01/2019 18:06:33.923000000",
            "doi": "10.1016/j.geomorph.2015.03.039",
            "cites": 19,
            "cr_meta": { ... 
            },
            "ams_bib": "Hsu, L., R. L. Martin, B. McElroy, K. Litwin-Miller, and W. Kim, 2015: Data management, sharing, and reuse in experimental geomorphology: Challenges, strategies, and scientific opportunities. Geomorphology, 244, 180\u2013189, https://doi.org/10.1016/j.geomorph.2015.03.039.\n"
        }, 
        ...
}
```

In [8]:
%%time
'''
NOTE: THIS IS LIVE DATA AND REQUIRES RELOADING IF THE LATEST DATA IS NECESSARY.  
EXPECT THIS CELL TO RUN FOR SOME TIME (~25-40min).
'''

# get the cites AND cr_meta for all the data
for k in id_map.keys():
    header = { "Accept": "application/vnd.citationstyles.csl+json" }
    for i in id_map[k]:
        if i['doi'].strip():
            doi = i['doi'].strip() 
            cr_meta = get_metadata(doi)
            
            if cr_meta:
                try:
                    i['cites'] = cr_meta['is-referenced-by-count']
                except:
                    i['cites'] = -1

                i['cr_meta'] = cr_meta
                i['ams_bib'] = get_bib_citation(doi).decode('utf-8')
                
                print(".", end="")
                time.sleep(BE_NICE)
        else:
            doi, cites, cr_meta = get_doi_bib_lookup(i['citation'])
            if doi:
                i['doi'], i['cites'] = doi, cites
                i['cr_meta'] = cr_meta
                i['ams_bib'] = get_bib_citation(doi).decode('utf-8')

                print(".", end="")
            else:
                print("X", end="")
print()

......X.....XXX..X........X.XX.XX.......XX....XXX....XX
[warn]: crossref API sent code = 400
	- input: Meng, H., Kommineni, R., Pham, Q., Gardner, R., Malik, T., & Thain, D.~An invariant framework for conducting reproducible computational science.~Journal of Computational Science~9~2015~137~~~0~ ~0~ ~03/02/2022 04:30:09.53000000
X...X.X..
[warn]: crossref API sent code = 400
	- input: Regalia, B., Janowicz, K., & Gao, S.~VOLT: A Provenance-Producing, Transparent SPARQL Proxy for the On-Demand Computation of Linked Data and its Application to Spatiotemporally Dependent Data.~ESWC: Extended Semantic Web Conference 201~~2016~~~~0~ ~0~ ~01/09/2016 15:06:21.460000000
X....
[warn]: crossref API sent code = 400
	- input: Vardeman II, C. F., Krisnadhi, A. A., Cheatham, M., Janowicz, K., Ferguson, H., Hitzler, P., & Buccellato, A. P.~An Ontology Design Pattern and Its Use Case for Modeling Material Transformation~Semantic Web~~2016~~~~0~ ~0~ ~13/09/2017 04:51:26.273000000
X
[warn]: crossref API

## STORE FINAL DATA

Final output file 

* [`outputs/nsf/nsf_mapping.json`](/outputs/nsf/nsf_mapping.json).

In [9]:
with open("../outputs/nsf/nsf_mapping.json", "w") as fo:
    json.dump(id_map, fo, indent=4)

---

### TODO: DELETE ALL BELOW

### GET CITATIONS FOR ALL FOUND DOI

Now that we have crucial data, we will store an [AMS style citation](https://www.ametsoc.org/index.cfm/ams/publications/author-information/formatting-and-manuscript-components/references/) along with the metadata from crossref.

**For convenience, we  dump all NSF ID, DOI and citation counts to:**:

* FILE [`outputs/citations.tsv`](/outputs/citations.tsv) (DUPLICATES INCLUDED) 

In [10]:
# MOVED TO 00g_full_data_mapping.ipynb

# '''
# THIS LOADS THE DATA STORED FROM THE FILE IN THE PREVIOUSLY 
# EXECUTED CELL
# '''

# from datetime import datetime
# import json
# # datetime.now().strftime('%Y%d%m_%H%M%S')}
# f = json.load(open("../outputs/nsf/nsf_mapping.json"))
# with open(f"../outputs/citations.tsv", "w", encoding='utf-8') as fo:
#     for k in f.keys():
#         for c in f[k]:
#             if c['doi']:
#                 if c['ams_bib'] and c['ams_bib'][:5] != '<!DOC':
#                     fo.write(f"{k}\t{c['doi']}\t{c['cites']}\n") 
#                     print(".", end="")
#                 else:
#                     fo.write(f"{k}\t{c['doi']}\t-1\n")
#                     print(".", end="")