In [1]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)


# Chemical Data Sources

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* What kinds of chemical data are available online?

* How can I use Python to retrieve information from chemical databases?
    
* What is a REST API?

Objectives:

* Learn about chemical data sources available on the web.
    
* Use Python requests and web APIs to programmatically retrieve information.

</div>

## Web Databases

In this lesson, we will explore online chemical databases and learn how to use Python to retrieve information from them. These databases offer data on chemical structures, properties, biological activities, and more. By the end of this lesson, you will be familiar with several popular chemical data sources and be able to access their data using Python and REST APIs.

Some popular chemical databases along with their websites are listed below:

[PubChem](https://pubchem.ncbi.nlm.nih.gov/): PubChem is a comprehensive public database of chemical molecules and their biological activities. It is maintained by the National Center for Biotechnology Information (NCBI) and contains information on compound structures, properties, bioassays, and more.

[ChEMBL](https://www.ebi.ac.uk/chembl/): ChEMBL is a large-scale bioactivity database for drug discovery. It is maintained by the European Molecular Biology Laboratory (EMBL) and contains information on bioactive molecules, targets, and assay data, mainly curated from scientific literature.

[Protein Data Bank (PDB)](https://www.rcsb.org/): The Protein Data Bank is a repository for 3D structural data of proteins, nucleic acids, and their complexes. It provides information on atomic coordinates, experimental details, and biological context. The PDB is managed by the Research Collaboratory for Structural Bioinformatics (RCSB).

[Materials Project](https://materialsproject.org/): The Materials Project is a collaborative effort to provide an open-source database of materials properties using high-throughput computational methods. It offers a variety of data, including crystal structures, electronic properties, and phase diagrams for various materials.

[ChemSpider](http://www.chemspider.com/): ChemSpider is a free chemical structure database owned by the Royal Society of Chemistry. It aggregates data from various sources, providing information on chemical structures, properties, and associated biological activities.

[NIST Chemistry WebBook](): The Chemistry WebBook is maintained by the National Institutes of Standards and Technology (NIST) and provides chemical and physical property data for molecular systems.

[ZINC](http://zinc.docking.org/): ZINC is a free database of commercially available compounds for virtual screening. It contains information on millions of purchasable molecules, including 3D structures, properties, and vendor information. 

[DrugBank](https://www.drugbank.ca/): DrugBank is a comprehensive database containing information on drugs and drug targets. It combines detailed drug data with comprehensive drug target information, including sequence, structure, and pathway data.


[Crystallography Open Database (COD)](http://www.crystallography.net/cod/): COD is an open-access database of crystal structures, containing information on organic, inorganic, and metal-organic compounds. It provides crystallographic data, such as atomic coordinates, unit cell parameters, and space groups.

There are many more examples of chemical data sources which we were not able to list here.

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>In the next section, we will work with PubChem programmatically. First, it will be useful for you to familiarize yourself with the PubChem website. 
        Go to <a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem </a> and look up aspirin. Retrieve the molecular weight, number of hydrogen bond
        acceptors, and the topological polar surface area and save them in variables.</p>
    <p>Next retrieve the same information using RDKit.</p>
</div>

In [2]:
aspririn_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
#aspirin_mw = 
#aspirin_h_donors = 
#aspirin_tpsa = 

In [None]:
# Retrieve the same information with RDKit


All of these databases mentioned so far are accessible on the web through your browser. 
Some of them also provide application program interfaces (APIs) that allow you to access them programmatically.
Today, we will be access databases which provide REST APIs using a Python library for retrieving information from the web.

## REST APIs
Many of these databases can be accessed programmatically through something called a REST API.
REST stands for **R**epresentational **S**tate **T**ransfer. API stands for **A**pplication **P**rogramming **I**nterface. 
A REST API is a type of web API that is used to allow different software systems to communicate with each other over the internet. 

Usually a REST API is accessed by varying parameters in a URL.

We will work with the PubChem REST API first to demonstrate what this means.

PubChem has a few REST APIs and the first we will look at is called [PUG REST](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=URL-based-API).

To retrieve information using PUG REST, you add parameters to the base URL


```
https://pubchem.ncbi.nlm.nih.gov/rest/pug
```

For example, to retrieve information about aspirin using SMILES, we would add

`/compound/smiles/{aspirin_smiles}`

The full URL will be

```
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O
```

Go to this URL in your browser.
The output will be information about the molecule. 
The output format is something called XML. However, we will want our data to be in something called JSON.
Thus, we will add `/JSON` to the end of our URL.
