# Recipe: Reusing a Chemical Dataset. Part 1- Finding Datasets
## Scenario
Our group has a set of thermophysical data on over 8000 chemical substances.  We want to integrate into this dataset another physical property dataset so that we can do an analysis of the correlations of the thermophysical data with the chosen physical property of the substances (that are common to both sets).

Criteria for picking the physical property dataset: high quality, trusted, large, available with an open license, so I can publish the results and make the derived dataset open.
- **High quality means**: unambiguous identification of each chemical substance, enough contextual information (metadata) to make the values scientifically useful, i.e., at least the composition of the solvent, the temperature and for volatile substances the pressure.
- **Trusted means**: the provenance chain is reported with the data, and it shows that the data comes from a reputable source(s) and any aggregation and/or processing is documented in enough detail that the community can understand how the dataset has been created/provided.

## Searching PubChem for datasets
Pubchem houses a lot of data about chemical substances, compounds and bioassays.  Over time external organizations have worked with
PubChem to include data, in one of a couple of ways:
- data that has been integrated into PubChem pages (e.g., [CCDC](https://pubchem.ncbi.nlm.nih.gov/source/941) -> [example]('https://pubchem.ncbi.nlm.nih.gov/compound/241))
- data that is not available in a PubChem page but is available via the data sources section of the site as 'annotations' (e.g. [RCSB PDB](https://pubchem.ncbi.nlm.nih.gov/source/15751) -> [Example](https://pubchem.ncbi.nlm.nih.gov/source/15751#data=Annotations))

The data available is may not be structured and or clearly described, however if the source has a website with an API then you are likely to get better quality metadata from the linked site.

In [6]:
import requests
import json
# This URL is the metadata about the data sources in PubChem
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/sourcetable/all/JSON/?response_type=display'
response = requests.get(url)
srcs = response.json()
results = []
search = 'Curation Efforts'  # i.e., a repository, or other type of data source (this is in index 8 of the data list for each source)
rows = srcs['Table']['Row']
for row in rows:
    if row['Cell'][8].find(search) != -1:
        hit = {}
        hit.update({'name': row['Cell'][0]})
        hit.update({'url': row['Cell'][9]})
        results.append(hit)
print(json.dumps(results, indent=4))

[
    {
        "name": "Alliance of Genome Resources",
        "url": "https://www.alliancegenome.org/"
    },
    {
        "name": "Barrie Walker, BARK Information Services",
        "url": "https://uk.linkedin.com/in/barrie-walker-85b4a510"
    },
    {
        "name": "BindingDB",
        "url": "http://www.bindingdb.org/rwd/bind/"
    },
    {
        "name": "BioCyc",
        "url": "https://biocyc.org/"
    },
    {
        "name": "BioGRID",
        "url": "https://thebiogrid.org/"
    },
    {
        "name": "CAMEO Chemicals",
        "url": "https://cameochemicals.noaa.gov/"
    },
    {
        "name": "Catalogue of Life (COL)",
        "url": "https://www.catalogueoflife.org/"
    },
    {
        "name": "CCSbase",
        "url": "https://ccsbase.net/"
    },
    {
        "name": "CDC-ATSDR Toxic Substances Portal",
        "url": "https://www.atsdr.cdc.gov/features/toxicsubstances/index.html"
    },
    {
        "name": "Cell Line Ontology (CLO)",
        "url": "http

## Searching FAIR Sharing for datasets
FAIRSharing is a database of FAIR resources and per se a database of datasets, however you might find a repository here that
has the kind of data you are looking for.

In [None]:
from hiddensettings import *
import requests
# see https://fairsharing.org/API_doc for instructions on how to search the API
# user login
url = 'https://api.fairsharing.org/users/sign_in'
hdrs = {'Accept': 'application/json','Content-Type': 'application/json'}
data = {'user': {'login': fs_user, 'password': fs_pass}}
response = requests.request("POST", url, headers=hdrs, data=data)
searchurl ='https://api.fairsharing.org//search/fairsharing_records?q=chemistry'
search = requests.request("POST", searchurl, headers=hdrs)
print(search.content)
# not working yet

Author: Stuart Chalk
Date: Dec. 12, 2022