### Generating `publications.json` partitions

This is a template notebook for generating metadata on publications - most importantly, the linkage between the publication and dataset (datasets are enumerated in `datasets.json`)

Process goes as follows:
1. Import CSV with publication-dataset linkages. Your csv should have at the minimum, fields (spelled like the below):
    * `dataset` to hold the dataset_ids, and 
    * `title` for the publication title. 

Update the csv with these field names to ensure this code will run.  We read in, dedupe and format the title
2. Match to `datasets.json` -- alert if given dataset doesn't exist yet
3. Generate list of dicts with publication metadata
4. Write to a publications.json file

#### Import CSV containing publication-dataset linkages

Set `linkages_path` to the location of the csv containg dataset-publication linkages and read in csv

In [24]:
import pandas as pd
import os
import datetime

In [14]:
file_name = 'snap_linkages_cleaned.csv'
rcm_subfolder = '20190717_usda_snap'

linkages_path =  os.path.join('/Users/andrewnorris/RichContextMetadata/metadata',rcm_subfolder,file_name)
# linkages_path =  os.path.join(os.getcwd(),'SNAP_DATA_DIMENSIONS_SEARCH_DEMO.csv')
linkages_csv = pd.read_csv(linkages_path)

Format/clean linkage data - apply `scrub_unicode` to `title` field.

In [15]:
import unicodedata

In [16]:
def scrub_unicode (text):
    """
    try to handle the unicode edge cases encountered in source text,
    as best as possible
    """
    x = " ".join(map(lambda s: s.strip(), text.split("\n"))).strip()

    x = x.replace('“', '"').replace('”', '"')
    x = x.replace("‘", "'").replace("’", "'").replace("`", "'")
    x = x.replace("`` ", '"').replace("''", '"')
    x = x.replace('…', '...').replace("\\u2026", "...")
    x = x.replace("\\u00ae", "").replace("\\u2122", "")
    x = x.replace("\\u00a0", " ").replace("\\u2022", "*").replace("\\u00b7", "*")
    x = x.replace("\\u2018", "'").replace("\\u2019", "'").replace("\\u201a", "'")
    x = x.replace("\\u201c", '"').replace("\\u201d", '"')

    x = x.replace("\\u20ac", "€")
    x = x.replace("\\u2212", " - ") # minus sign

    x = x.replace("\\u00e9", "é")
    x = x.replace("\\u017c", "ż").replace("\\u015b", "ś").replace("\\u0142", "ł")    
    x = x.replace("\\u0105", "ą").replace("\\u0119", "ę").replace("\\u017a", "ź").replace("\\u00f3", "ó")

    x = x.replace("\\u2014", " - ").replace('–', '-').replace('—', ' - ')
    x = x.replace("\\u2013", " - ").replace("\\u00ad", " - ")

    x = str(unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("utf-8"))

    # some content returns text in bytes rather than as a str ?
    try:
        assert type(x).__name__ == "str"
    except AssertionError:
        print("not a string?", type(x), x)

    return x

Scrub titles of problematic characters, drop nulls and dedupe

In [17]:
linkages_csv.head()

Unnamed: 0,dataset,doi,journal,title,url
0,dataset-026,10.1016/j.jneb.2017.06.002,Journal of Nutrition Education and Behavior,SNAP-Based Incentive Programs at Farmers' Mark...,www.doi.org/10.1016/j.jneb.2017.06.002
1,dataset-026,10.1016/j.ypmed.2018.03.015,Preventive Medicine,Where do U.S. households purchase healthy food...,www.doi.org/10.1016/j.ypmed.2018.03.015
2,dataset-026,10.1016/j.amepre.2017.10.005,American Journal of Preventive Medicine,Doubling Up on Produce at Detroit Farmers Mark...,www.doi.org/10.1016/j.amepre.2017.10.005
3,dataset-026,10.1377/hlthaff.2018.05265,Health Affairs,Loss Of SNAP Is Associated With Food Insecurit...,www.doi.org/10.1377/hlthaff.2018.05265
4,dataset-026,10.1016/j.jneb.2014.12.008,Journal of Nutrition Education and Behavior,Farmers' Markets and the Local Food Environmen...,www.doi.org/10.1016/j.jneb.2014.12.008


In [18]:
linkages_csv = linkages_csv.loc[pd.notnull(linkages_csv.dataset)].drop_duplicates()
linkages_csv = linkages_csv.loc[pd.notnull(linkages_csv.title)].drop_duplicates()
linkages_csv['title'] = linkages_csv['title'].apply(scrub_unicode)

In [19]:
pub_metadata_fields = ['title']
original_metadata_cols = list(set(linkages_csv.columns.values.tolist()) - set(pub_metadata_fields)-set(['dataset']))

#### Generate list of dicts of metadata

Read in `datasets.json`. Update `datasets_path` to your local.

In [20]:
import json

In [21]:
datasets_path = '/Users/andrewnorris/RCDatasets/datasets.json'

with open(datasets_path) as json_file:
    datasets = json.load(json_file)

Create list of dictionaries of publication metadata. `format_metadata` iterrates through `linkages_csv` dataframe, splits the `dataset` field (for when multiple datasets are listed); throws an error if the dataset doesn't exist and needs to be added to `datasets.json`.

In [22]:
def create_pub_dict(linkages_dataframe,datasets):
    pub_dict_list = []
    for i, r in linkages_dataframe.iterrows():
        r['title'] = scrub_unicode(r['title'])
        ds_id_list = [f for f in [d.strip() for d in r['dataset'].split(",")] if f not in [""," "]]
        for ds in ds_id_list:
            check_ds = [b for b in datasets if b['id'] == ds]
            if len(check_ds) == 0:
                print('dataset {} isnt listed in datasets.json. Please add to file'.format(ds))
        required_metadata = r[pub_metadata_fields].to_dict()
        required_metadata.update({'datasets':ds_id_list})
        pub_dict = required_metadata
        if len(original_metadata_cols) > 0:
            original_metadata = r[original_metadata_cols].to_dict()
            original_metadata.update({'date_added':datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')})
            pub_dict.update({'original':original_metadata})
        pub_dict_list.append(pub_dict)
    return pub_dict_list

Generate publication metadata and export to json

In [25]:
linkage_list = create_pub_dict(linkages_csv,datasets)

Update `pub_path` to be: 
`<name_of_subfolder>_publications.json`

In [26]:
json_pub_path = os.path.join('/Users/andrewnorris/RCPublications/partitions/',rcm_subfolder+'_publications.json')

In [27]:
with open(json_pub_path, 'w') as outfile:
    json.dump(linkage_list, outfile, indent=2)