# Full Data Mapping

Because there are two input sources: the data directly from the NSF, and the data from the EC community publication survey, it is important to reconcile the difference.

This notebook takes as input:

* the original DOIs found and their project ID mapping: [nsf_doi_project_summary.tsv](../outputs/nsf/nsf_doi_project_summary.tsv)
* the original survey spreadsheet DOIs: [ec_survey_project_doi_mapping.tsv](../outputs/ec_survey_project_doi_mapping.tsv)
* the validated DOI list from all found DOIs: [doi_project_program_detail.csv](../outputs/doi_project_program_detail.csv)
* the original NSF mapping (the authoritative source of all metadata): [nsf_mapping.json](../outputs/nsf/nsf_mapping.json)


In [1]:
import pandas as pd
import datacite
import crossref
import json

## Finding Survey DOIs not in NSF list

In [2]:
df_nsf = pd.read_csv("../outputs/nsf/nsf_doi_project_summary.tsv", sep='\t')
df_survey = pd.read_csv("../outputs/ec_survey_project_doi_mapping.tsv", sep='\t')

In [3]:
df_all_found_dois = \
    pd.read_csv("../outputs/nsf/nsf_doi_project_summary.tsv", sep='\t')\
    .loc[:,['doi', 'cites']]\
    .drop_duplicates()
df_all_found_dois.cites = df_all_found_dois.cites.replace('--', -1).astype(int)

In [4]:
# dois only in the original nsf extracion which had -1 flag
orphan_doi_list  =  df_all_found_dois.query("cites <0").doi.drop_duplicates().dropna().to_list()

In [5]:
# dois only in the survey that weren't in the original nsf extraction
survey_only_doi_list =  set(df_survey.doi).difference(set(df_all_found_dois.doi))

### THIS IS WHAT I WAS MISSING!!!
set(dc_original_list).difference(set(orphan_doi_list).union(survey_only_doi_list))

In [6]:
# the difference is the set we're looking for
missing_dois = set(orphan_doi_list).union(survey_only_doi_list)

# #for testing only
# dc_original_list = list(json.load(open("../inputs/datacite_metadata_20220525174139.json")).keys())
# set(dc_original_list).difference(missing_dois)

len(missing_dois) # 36 missing DOIs

35

### (MOVE)FINDING DATACITE DOIS

* find those things in the survey NOT in NSF data

for doi in survey_only_doi_list:
    dc_meta = datacite.get_metadata(doi)
    if dc_meta:
        print(f"{doi}: has datacite meta (NSF: {df_survey[df_survey[1]==doi][0].values})")
    else:
        cr_meta = crossref.get_metadata(doi)
        if cr_meta:
            print(f"{doi}: has crossref meta (NSF: {df_survey[df_survey[1]==doi][0].values})")
        else:
            print(f"{doi}: NO meta")

## Restore missing DOIs to back to the original NSF map data

In [7]:
df_project_mapping = json.load(open("../outputs/nsf/nsf_mapping.json"))

In [8]:
%%time
for doi in missing_dois:
    nsfid = str(df_survey[df_survey.doi==doi].nsfid.values[0])
    if not nsfid:
        print(doi, "does not have an nsfid")
    try:
        current_dois = [d['doi'].lower() for d in df_project_mapping[nsfid]] 
        if doi in current_dois:
            print(f"[skipping] {doi} is a duplicate")
        else:
            cr_meta = crossref.get_metadata(doi)
            if cr_meta: 
                item = {}

                item['citation'] = "--"
                item['doi'] = doi
                item['cr_meta'] = cr_meta
                item['cites'] = cr_meta['is-referenced-by-count']
                item['ams_bib'] = crossref.get_bib_citation(doi).decode('utf-8')

                df_project_mapping[nsfid].append(item)
                print(f"+{nsfid} ==> {doi}")
            else:
                print("[] not a cr DOI")
    except:
        cr_meta = crossref.get_metadata(doi)
        if cr_meta: 
            item = {}

            item['citation'] = "--"
            item['doi'] = doi
            item['cr_meta'] = cr_meta
            item['cites'] = cr_meta.get('is-referenced-by-count', -1)
            item['ams_bib'] = crossref.get_bib_citation(doi).decode('utf-8')

            if df_project_mapping.get(nsfid): 
                df_project_mapping[nsfid].append(item)
            else:
                df_project_mapping[nsfid] = []
                df_project_mapping[nsfid].append(item)
                
            print(f"*{nsfid} ==> {doi}")      
        else:
            print("[] not a cr DOI")            

*1639750 ==> 10.1142/s1793351x22400025
[skipping] 10.6084/m9.figshare.14848713.v1 is a duplicate
[skipping] 10.5555/3319379.3319381 is a duplicate
*1541390 ==> 10.18739/a24m9198b
*1929773 ==> 10.3389/fclim.2021.763420
+1743321 ==> 10.3389/fspas.2022.816523
*1440066 ==> 10.13140/rg.2.1.4908.4561
+1343785 ==> 10.1111/1752-1688.12437
*1639675 ==> 10.1109/igarss.2017.8126974
[skipping] 10.1594/ieda/100691 is a duplicate
[skipping] 10.2110/sedred.2013.4 is a duplicate
*1639764 ==> 10.1594/pangaea.892680
[skipping] 10.5281/zenodo.5496306 is a duplicate
[] not a cr DOI
+1740595 ==> 10.1002/rse2.240
*1440351 ==> 10.6084/m9.figshare.4272164.v1
[skipping] 10.1594/ieda/100709 is a duplicate
[skipping] 10.1002/hyp10899 is a duplicate
+1541057 ==> 10.1051/swsc/2020011
+1343802 ==> 10.1109/bigdata.2015.7363976
+1324760 ==> 10.1029/2021ef002088
*1928393 ==> 10.5281/zenodo.4558266
*1639694 ==> 10.5065/p2jj-9878
+1324760 ==> 10.1029/2017jf004576
*1440221 ==> 10.1007/978-3-319-33245-1
+1440066 ==> 10.17

### Check for malformed or unresolving DOIs

In [9]:
resolved_doi_list = []

for k in df_project_mapping.keys():
    for p in df_project_mapping[k]:
        if p['doi'] and p.get('cr_meta'):
            # t.append(p['cr_meta']['type'])
            resolved_doi_list.append(p['doi'].lower())
        else:
            if p['doi']:
                print(k, "==>", p['doi'])
                
resolved_doi_list = set(resolved_doi_list)
print(len(resolved_doi_list), "total DOIs checked")

1343802 ==> 10.1002/hyp10899
1639759 ==> 10.5555/3319379.3319381
255 total DOIs checked


Two DOIs are unresolved, malformed or missing:

* `10.1002/hyp10899`
* `10.5555/3319379.3319381`

The first is a typo:
    
* `10.1002/hyp.10899`


The second is not a registered DOI with crossref, but here is the link to it at ACM/DL:

* https://dl.acm.org/doi/10.5555/3319379.3319381

Which seems to **look** like a DOI, but **does NOT resolve at**:

* https://doi.org/10.5555/3319379.3319381

We will fix the first and skip the second.

In [10]:
old_doi = '10.1002/hyp10899'
new_doi = '10.1002/hyp.10899'

cr_meta = crossref.get_metadata(new_doi)

item = {}
item['citation'] = "--"
item['doi'] = new_doi
item['cr_meta'] = cr_meta 
item['cites'] = cr_meta.get('is-referenced-by-count', -1)

cr_bib = crossref.get_bib_citation(new_doi).decode('utf-8')
if cr_bib:
    item['ams_bib'] = cr_bib # = crossref.get_bib_citation(doi)

df_project_mapping['1343802'].append(item)

df_project_mapping['1343802'] = \
    [d for d in df_project_mapping['1343802'] if d['doi']!=old_doi]

### Add Datacite metadata to the project mapping

This will check all DOIs for DataCite metadata and insert 

In [11]:
%%time
for k in df_project_mapping.keys():
    for i in df_project_mapping[k]:
        if i['doi']:
            dc_meta = datacite.get_metadata(i['doi'].lower().strip())
            if dc_meta:
                i['dc_meta'] = dc_meta
                print(k, "==>", i['doi'], "datacite meta added")

1440351 ==> 10.1594/IEDA/100709 datacite meta added
1440351 ==> 10.1594/IEDA/100691 datacite meta added
1440351 ==> 10.6084/m9.figshare.4272164.v1 datacite meta added
1541390 ==> 10.18739/a24m9198b datacite meta added
1639683 ==> 10.6084/m9.figshare.14848713.v1 datacite meta added
1639764 ==> 10.1594/pangaea.892680 datacite meta added
1639557 ==> 10.1594/PANGAEA.892680 datacite meta added
1928406 ==> 10.5281/zenodo.5496306 datacite meta added
1440066 ==> 10.13140/rg.2.1.4908.4561 datacite meta added
1928393 ==> 10.5281/zenodo.4558266 datacite meta added
1928393 ==> 10.5281/zenodo.6369184 datacite meta added
1639694 ==> 10.5065/p2jj-9878 datacite meta added
CPU times: total: 9.03 s
Wall time: 3min 31s


In [12]:
doi_list = []
for k in df_project_mapping.keys():
    for i in df_project_mapping[k]:
        if i['doi']  and i.get('cr_meta'):
            doi_list.append(i['doi'].lower())

In [13]:
len(set(doi_list))

256

In [14]:
with open("../outputs/full_data_map.json", "w") as fo:
    json.dump(df_project_mapping, fo)

In [15]:
dc_doi = []
nsfid_type_map = {}

for k in df_project_mapping.keys():
    for i in df_project_mapping[k]:
        dc_meta = i.get('dc_meta') 
        if dc_meta:
            t_general = dc_meta['attributes']['types'].get('resourceTypeGeneral', '--').lower()
            t_schemaorg = dc_meta['attributes']['types'].get('schemaOrg', '--').lower()
            
            doi = i['doi'].lower()
            if doi not in dc_doi:
                dc_doi.append(doi)
                
                if nsfid_type_map.get(k):
                    if not nsfid_type_map[k].get(doi):
                        nsfid_type_map[k][doi] =  {}                    
                        nsfid_type_map[k][doi]['cr_meta'] = i['cr_meta']
                        nsfid_type_map[k][doi]['dc_meta'] = dc_meta
                        
                else:
                    nsfid_type_map[k] = {doi: {}}                    
                    nsfid_type_map[k][doi]['cr_meta'] = i['cr_meta']
                    nsfid_type_map[k][doi]['dc_meta'] = dc_meta

with open("../outputs/datacite/datacite_data_map.json", "w") as fo:
    json.dump(nsfid_type_map, fo)

### STORE `citations.csv` FOR FUTURE USE

In [16]:
'''
THIS LOADS THE DATA STORED FROM THE FILE IN THE PREVIOUSLY 
EXECUTED CELL
'''

from datetime import datetime
import json

f = json.load(open("../outputs/full_data_map.json"))
with open(f"../outputs/citations.tsv", "w", encoding='utf-8') as fo:
    for k in f.keys():
        for c in f[k]:
            if c['doi']:
                if c.get('ams_bib') and c['ams_bib'][:5] != '<!DOC':
                    fo.write(f"{k}\t{c['doi']}\t{c['cites']}\n") 
                    print(".", end="")
                else:
                    fo.write(f"{k}\t{c['doi']}\t-1\n")
                    print(".", end="")

.........................................................................................................................................................................................................................................................................................................................................................................

### STORE `full_nsf_doi_project_summary.tsv` FOR FUTURE USE

In [17]:
f = json.load(open("../outputs/full_data_map.json"))
with open("../outputs/full_nsf_doi_project_summary.tsv", "w", encoding='utf-8') as fo:
    fo.write("{}\t{}\t{}\t{}\t{}\t{}\n"
             .format("nsfid", "doi", "title", "ams_bib", "cites", "year"))
    for k in f.keys():
        print(".", end="")
        for i in f[k]:
            if i.get('cr_meta'):
                if not i.get('cites'):
                    cites =  i['cr_meta'].get('is-referenced-by-count', -1)
                else:
                    cites = i['cites']
                    
                if type(i['cr_meta']['title']) is list:
                    try:
                        title = i['cr_meta']['title'][0]
                    except:
                        title = i['cr_meta']['title']
                else:
                    title = i['cr_meta']['title'] 

            try:
                fo.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(
                  k, i['doi'].lower(), 
                  title.replace("\n", ""), 
                  i['ams_bib'].replace("\n", ""), cites, i['cr_meta']['issued']['date-parts'][0][0])
                 )
            except Exception as e:
                # the title is empty or there is no legit DOI or cr_meta
                pass

........................................................................................................