### Generating `publications.json` partitions

This is a template notebook for generating metadata on publications - most importantly, the linkage between the publication and dataset (datasets are enumerated in `datasets.json`)

Process goes as follows:
1. Import CSV with publication-dataset linkages. Format and clean
2. Match to `datasets.json` -- alert if given dataset doesn't exist yet
3. Generate list of dicts with publication metadata
4. Write to a `publications.json` file

Questions? Email or slack Sophie at sr2661@nyu.edu

#### Import CSV containing publication-dataset linkages

Set `linkages_path` to the location of the csv containg dataset-publication linkages and read in csv

In [1]:
import pandas as pd

In [2]:
linkages_path =  '/Users/sophierand/RichContextMetadata/metadata/20190913_usda_excel/producing_metadata/usda_linkages.csv'
linkages_csv = pd.read_csv(linkages_path)

Format/clean linkage data
1. Update `pub_metadata_fields` with the fields that exist in your linkage file, eg: 
    * `title` (required)
    * `dataset_id` (required)
    * `url` OR `doi` (at least one is required for every linkage)
    * `journal`

2. Remove possible null values in the `dataset_id` field (update field name to match your csv if needed)
3. Deduplicate
4. Apply `scrub_unicode` to `title` field.

In [23]:
import unicodedata

In [24]:
def scrub_unicode (text):
    """
    try to handle the unicode edge cases encountered in source text,
    as best as possible
    """
    x = " ".join(map(lambda s: s.strip(), text.split("\n"))).strip()

    x = x.replace('“', '"').replace('”', '"')
    x = x.replace("‘", "'").replace("’", "'").replace("`", "'")
    x = x.replace("`` ", '"').replace("''", '"')
    x = x.replace('…', '...').replace("\\u2026", "...")
    x = x.replace("\\u00ae", "").replace("\\u2122", "")
    x = x.replace("\\u00a0", " ").replace("\\u2022", "*").replace("\\u00b7", "*")
    x = x.replace("\\u2018", "'").replace("\\u2019", "'").replace("\\u201a", "'")
    x = x.replace("\\u201c", '"').replace("\\u201d", '"')

    x = x.replace("\\u20ac", "€")
    x = x.replace("\\u2212", " - ") # minus sign

    x = x.replace("\\u00e9", "é")
    x = x.replace("\\u017c", "ż").replace("\\u015b", "ś").replace("\\u0142", "ł")    
    x = x.replace("\\u0105", "ą").replace("\\u0119", "ę").replace("\\u017a", "ź").replace("\\u00f3", "ó")

    x = x.replace("\\u2014", " - ").replace('–', '-').replace('—', ' - ')
    x = x.replace("\\u2013", " - ").replace("\\u00ad", " - ")

    x = str(unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("utf-8"))

    # some content returns text in bytes rather than as a str ?
    try:
        assert type(x).__name__ == "str"
    except AssertionError:
        print("not a string?", type(x), x)

    return x

In [7]:
pub_metadata_fields = ['title','pub_url','dataset_id']

In [22]:
linkages_csv = linkages_csv[pub_metadata_fields]
linkages_csv = linkages_csv.loc[pd.notnull(linkages_csv.dataset_id)].drop_duplicates()
linkages_csv['title'] = linkages_csv['title'].apply(scrub_unicode)

#### Generate list of dicts of metadata

Read in `datasets.json`. Update `datasets_path` to your local.

In [12]:
import json

In [15]:
datasets_path = '/Users/sophierand/RCDatasets/datasets.json'

with open(datasets_path) as json_file:
    datasets = json.load(json_file)

Create list of dictionaries of publication metadata. `format_metadata` iterrates through `linkages_csv` dataframe, splits the `dataset_id` field (for when multiple datasets are listed); throws an error if the dataset doesn't exist and needs to be added to `datasets.json`.

In [27]:
def format_linkages(linkages_csv,datasets):
    linkage_list = []
    for i,r in linkages_csv.iterrows():
        ds_id_list = [d.strip() for d in r['dataset_id'].split(",")]
        ds_id_dict_list = [{'dataset_id':d} for d in ds_id_list]
        pub_dict = {'title':r['title'],'url':r['pub_url'],'related_dataset':ds_id_dict_list}
        for ds in ds_id_list:
            check_ds = [b for b in datasets if b['id'] == ds]
            if len(check_ds) == 0:
                print('dataset {} isnt listed in datasets.json. Please add to file'.format(ds))
        linkage_list.append(pub_dict)
    return linkage_list

Generate publication metadata and export to json

In [31]:
linkage_list = format_linkages(linkages_csv,datasets)

Update `json_pub_path` to be: 
`<name_of_subfolder>_publications.json`

In [None]:
import os
json_pub_path = '20190920_excel_usda_publications.json'

In [None]:
with open(json_pub_path, 'w') as outfile:
    json.dump(linkage_list, outfile, indent=2)