# Intercoversion between PubChem records

S. Kim, J. Cuadros
September 24th, 2019

Hunter Tiner

## Introduction

PUG-REST can be used to retrieve PubChem records related to another PubChem records. Basically, PUG-REST takes an input list of records in
one of the three PubChem databases (Compound, Substance, and BioAssay) and returns a list of the related records in the same or different
database. Here, the meaning of the relationship between the input and output records may be specified using an optional parameter. This allows
one to do various tasks, including (but not limited to) looking up:
- Depositor-provided records (i.e., substances) that are standardized to a given compound.
- Mixture compounds that contain a given component compound.
- Stereoisomers/isotopomers of a given compound.
- Compounds that are tested to be active in a given assay.
- Compounds that have similar structures to a given compound.

## 1. Getting depositor-provided records for a given compound

First let’s retrieve the substance record associated with a given CID (CID 129825914).

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/cid/129825914'
pugoper <- 'sids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
readLines(url)

It is also possible to provide a comma seprated list of CIDs as input identifiers.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/cid/129825914,129742624,129783988'
pugoper <- 'sids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
readLines(url)

In the example above, the input list has three CIDs, but the PUG-REST request returned five SIDs. It means that some CID(s) must be associated
with multiple SIDs, but it is hard to see which CID it is. Therefore, we want the SIDs grouped by the corresponding CIDs. This can be done using
the optional parameter list_return=grouped and changing the output format to json .

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/cid/129825914,129742624,129783988'
pugoper <- 'sids'
pugout <- 'json'
pugopt <- "list_return=grouped"

url <- paste(paste(pugrest,pugin,pugoper,pugout,sep="/"),pugopt,sep="?")
readLines(url)

This JSON can be imported to an r data.frame by using the jsonlite package.

In [None]:
if(!require("jsonlite",quietly=TRUE)) {
 install.packages("jsonlite", repos="https://cloud.r-project.org/",
 quiet=TRUE, type="binary")
 library("jsonlite")
}

library("jsonlite")

In [None]:
lisData <- fromJSON(url)
lisData[[1]][[1]]

Note that the json output format is used in the above request. The txt output format in PUG-REST returns data into a single column but the
result from the above request cannot fit well into a single column.
If you want output records to be “flattened”, rather than being grouped by the input identifiers, use list_return=flat . This is what happens
behind the scenes for the txt output format.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/cid/129825914,129742624,129783988'
pugoper <- 'sids'
pugout <- 'json'
pugopt <- "list_return=flat"

url <- paste(paste(pugrest,pugin,pugoper,pugout,sep="/"),pugopt,sep="?")
readLines(url)

The default value for the “list_return” parameter is: - flat when the output format is TXT, or - grouped when the output format is JSON and XML.
It is also possible to specify the input list implicitly, rather than providing the input identifiers explicitly. For example, the following example uses a
chemical name to specify the input list.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/name/d-lactose'
pugoper <- 'cids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
cids <- readLines(url)
paste("# CIDs returned:", length(cids))

In [None]:
paste(cids, collapse=", ")

In this case, CIDs are provided using a chemical name.
Then, these CIDs can be used to obtain SIDs

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/name/d-lactose'
pugoper <- 'sids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
sids <- readLines(url)
paste("# SIDs returned:", length(sids))

A third alternative is using the chemical to identify the susbtance.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'substance/name/d-lactose'
pugoper <- 'sids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
sids <- readLines(url)
paste("# SIDs returned:", length(sids))

The above example illustrates how the list conversion works.
- In the first request, the name “d-lactose” is searched for against the Compound database and the resulting 4 CIDs are returned. - If you change
the operation part from “cids” to “sids” (as in the second request), the same name search is done first against the Compound database, followed
by the list conversion from the resulting 3 CIDs to associted 198 SIDs. - In the third request, the name search is performed against the Substance
database, and the resulting 13 SIDs are returned.

**Exercise 1:** Statins are a class of drugs that lower cholesterol levels in the blood. Retrieve in JSON the substance records associated with the
compounds whose names contain the string “statin”.
- Make only one PUG-REST request.
- For partial name matching, set the name_type parameter to “word” (See the PUG-REST document for an example).
- Group the substances by the corresponding compound records.

In [None]:
# Write your code here

## 2. Getting mixture/component molecules for a given molecule.

The list interconversion may be used to retrieve mixtures that contain a given molecule as a component. To do this, the input molecule should be a
single-component compound (that is, with only one covalently-bound unit), and the optional parameter cids_type=component should be provided.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'compound/name/tylenol'
pugoper <- 'cids'
pugout <- 'txt'
pugopt <- 'cids_type=component'

url <- paste(paste(pugrest,pugin,pugoper,pugout,sep="/"),pugopt,sep="?")
cids <- readLines(url)
paste("# CIDs returned:", length(cids))

In [None]:
paste(cids, collapse=", ")

It should be noted that, if the input molecule is a multi-component compound, the option
cids_type=component```` returns the components of that compound. For example, the following example shows how to get all components of the fir
list generated in the previous example.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- paste('compound/cid/', cids[1], sep="")
pugoper <- 'cids'
pugout <- 'txt'
pugopt <- 'cids_type=component'

url <- paste(paste(pugrest,pugin,pugoper,pugout,sep="/"),pugopt,sep="?")
component_cids <- readLines(url)
paste("CID:", cids[1])

In [None]:
paste("# of components:", length(component_cids))

In [None]:
paste(component_cids, collapse=", ")

**Exercise 2a:** Many over-the-counter drugs contain more than one active ingredients. In this exercise, we want to find component molecules that
occur with three common pain killers (aspirin, tylenol, advil) as a mixture.

Step 1. Define a list that contains three drug names (aspirin, tylenol, advil).

In [None]:
# Write your code here

**Step 2.** Using a for loop, retrieve PubChem CIDs corresponding to the three drugs and store them in a new list. In order not to overload the
PubChem servers, stop the program for 0.2 second for each iteration in the for loop (using Sys.sleep ).

In [None]:
# Write your code here

**Step 3.** Using another for loop, do the following things for each drug: - Get the PubChem CIDs of the mixture compounds that contain each drug
and store them in a list. - Get the PubChem CIDs of the components that occur in any of the returned mixtures, by setting the “list_return”
parameter to “flat”. This can be done with a single request.
- Print all the components. - Stop the code for 0.2 second using Sys.sleep each time a PUG-REST request is made.

In [None]:
# Write your code here

## 3. Getting compounds tested in a given assay

PUG-REST may be used to retrieve compounds tested in a given assay. For example, the following code cell shows how to get all compounds
tested in AID 1207599.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'assay/aid/1207599'
pugoper <- 'cids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
cids <- readLines(url)
paste("# CIDs returned:", length(cids))

In [None]:
paste(cids, collapse=", ")

If you are interested in only the compounds that are tested “active” in a given assay, set the cids_type parameter to active , as shwon in the
code below.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'assay/aid/1207599'
pugoper <- 'cids'
pugout <- 'txt'
pugopt <- 'cids_type=active'

url <- paste(paste(pugrest,pugin,pugoper,pugout,sep="/"),pugopt,sep="?")
cids <- readLines(url)
paste("# CIDs returned:", length(cids))

In [None]:
paste(cids, collapse=", ")

It is also possible to specify the input assay list implicitly. For example, the following code cell retrieves compounds tested in any assays targeting
human Carbonic anhydrase 2 (CA2), whose accession number is P00918.

In [None]:
pugrest <- 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
pugin <- 'assay/target/accession/P00918'
pugoper <- 'cids'
pugout <- 'txt'

url <- paste(pugrest,pugin,pugoper,pugout,sep="/")
cids <- readLines(url)

paste("# CIDs returned:", length(cids))

**Exercise 3a:** Find compounds that are tested to be active against human acetylcholinesterase (accession: P08173) and retrieve SMILES strings
for those compounds.
- Split the CID list into smaller chunks (with a chunk size of 100). - Place the retrieved data in a data.frame (CID and SMILES strings in the first and
second columns, respectively).

In [None]:
# Write your code here