<a href="https://colab.research.google.com/github/D3TaLES/databases_demo/blob/main/notebooks/no_sql_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**To view this demonstration, simply click the play button beside each code cell to run the cell. Note that these cells should be run in order. It is also recomended that this notebook is run in Colab.**

# Demonstration of a No-SQL Database
In this notebook, we demonstrate a No-SQL database. No-SQL structures contain one or more collections of records (called a document in many types of No-SQL). Within a collection, all documents share a schema. Schemas have a tree-branch structure. Each document contains a series of attributes (branches in the tree), each of which may contain a value or list. An attribute may also contain embedded attributes, e.g., smaller branches off the main branch. The figure below shows the nested nature of a No-SQL schema for example `UVVis_Data`. In this schema, each document corresponds to a molecule with attributes such as `smiles` and `molecular_weight`. Some attributes such as `uvvis_data` are nested. Accordingly 'uvvis_data' has sub-attributes such as `absorbance_data` and `optical_gap`. 

**Note**: This schema is not a complete picture of the schema for the data in this notebook; it is only a partial schema. 

<img src='https://drive.google.com/uc?export=view&id=1UzghTXD3Kjh5brw_Zk-Pdol8kRUfwdcW' width="800" height="300">

After first initializing the database, we load the No-SQL schema and show how it can validate experimental data. We then insert validated data into the database for various moelcules of varying data types (computational and experimental). Finally, we give examples database queries and show how to easily plot queried data. 

## Install and Import Needed Code

Here we use `apt install` and `pip install` to install several packages for use in this notebook. We also pull the file processing code and the example data files from our [GitHub repoisitory](https://github.com/D3TaLES/databases_demo/). Then we import the packages so they can be used. 

**This may take a few minutes.**

**Note**: Colab normally has [pandas](https://pandas.pydata.org/), [numpy](https://numpy.org/), [matplotlib](https://matplotlib.org/), [scipy](https://scipy.org/), and [jsonschema](https://json-schema.org/) pre-installed. If you do not have these packages installed, you will need to install them. 

In [None]:
%%capture
! apt install mongodb > log  # Install the No-SQL database arcitecture MongoDB 
! service mongodb start  # Start MongoDB
! pip install pymatgen  # Install Pymatgen for Gaussian file parsing 
! pip install pubchempy  # Install PubChem python API for moleucle information
! pip install rdkit-pypi  # Install RdKit for molecule transformations

In [None]:
! rm -r databases_demo/ # Remove database_demo directory if it already exists
! git clone https://github.com/D3TaLES/databases_demo.git # Get Processing code from GitHub

In [None]:
# Import required packages (many of which you just installed)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pymongo import MongoClient
from jsonschema import validate
from databases_demo.file_parser import *

## 1. Initialize the database 
Here we initalize a [MongoDB](https://www.mongodb.com/) database, which is a document-based No-SQL database. This involves connecting to the MongoDB client, then initialize a database called `test_db`. 

In [None]:
# Create database
client = MongoClient()
db = client['test_db']

## 2. Load Schema and use it to validate example data
Here we begin the Extract, Transform, and Load (ETL) process to insert data into our newly created database. In this step we complete the extraction and transformation. (In step 3, we will load.) 

<img src='https://drive.google.com/uc?export=view&id=1pp7NHtPk7n4lTwGlb6MiXdgB0nPtwUNA' width="800" height="300">

We must first generate general molecular information about our molecule, in this case biphenyl. We use the `GenerateMolInfo` module defined in our [processing code](https://github.com/D3TaLES/databases_demo/blob/main/file_parser.py). This module both generates (or extracts) general moleucle information for biphenyl and transforms it to our defined schema. Next we use [jsonschema](https://json-schema.org/) and our No-SQL schema to validate the data. Then we use the same process to generate synnonym data for biphenyl.

In [None]:
# EXTRACT/generate basic data for biphenyl and TRANSFORM to schema
mol_data = GenerateMolInfo(smiles="C1=C(c2ccccc2)CCCC1", source='our_lab', names=['biphenyl'], sql=False).data
# Get the generated primary key for biphenyl
bp_id = mol_data.get('_id')

In [None]:
# Get Schema by extracting schema from schema file
with open('databases_demo/schema/no-sql_schema.json') as fn:
    schema = json.load(fn)
# Validate data
validate(instance=mol_data, schema=schema)

In [None]:
mol_data

## 3. Insert validated data ino the database
In this section, coplete the final step of the ETL process by loading generated data (from section 2) into the database. This includes adding data to the `Molecules` table and the `Synonyms` table seperately. We also must commit the data after we add it. 

Then, we repeat the entire ETL process for three additional moleucles. And finally, we complete the ETL process for different types of data, namely, computaitonal data and experimental UV-Vis data. Inserting computation/experimental data into the database requires both raw data files and more complex file parsing. Here we use [these raw data files](https://github.com/D3TaLES/databases_demo/tree/main/raw_data), and the parsing is performed by the `ProcessDFT` and `ProcessUvVis` modules defined in our [processing code](https://github.com/D3TaLES/databases_demo/blob/main/file_parser.py). Basic processing demonstrations that reflect the parsing done in these emodules can be found [in this Colab notebook](https://github.com/D3TaLES/databases_demo/blob/main/notebooks/processing_notebook.ipynb).

In [None]:
# LOAD molecule into database
db["molecules"].insert_one(mol_data)

### ETL for different molecules

Here we loop through a dictionary of molecule names and their SMILES strings, and for each, we generate general molecule data, validate the data, and insert them to the database. 

In [None]:
# ETL for Benzene, Nitrobenzene, and Anthracene
extra_mols = {'benzene': "C1=CC=CC=C1", 'nitrobenzene': "C1=CC=C(C=C1)[N+](=O)[O-]", 'anthracene': "C1=CC=C2C=C3C=CC=CC3=CC2=C1"}
extra_mol_ids = {}
for name, smiles in extra_mols.items():
  # Extract and transform 
  mol_data = GenerateMolInfo(smiles, source='our_lab', names=[name], sql=False).data
  validate(instance=mol_data, schema=schema)
  # Load
  db["molecules"].insert_one(mol_data)

  # Record moleucle id
  extra_mol_ids[name] = mol_data.get('_id')


### ETL for different types of data

Here we extract and transform computational data from a Gaussian DFT [log file](https://github.com/D3TaLES/databases_demo/tree/main/raw_data/tddft_biphenyl.log), then load the data to the database.

In [None]:
# EXTRACT and TRANSFORM Gaussian DFT data
gaussian_data = ProcessDFT('databases_demo/raw_data/tddft_biphenyl.log', mol_id=bp_id, sql=False).data

# Validate data
validate(instance={"_id": bp_id, "dft_data": gaussian_data}, schema=schema)

# LOAD DFT data into database
db["molecules"].update_one({"_id": bp_id}, {"$set": {"dft_data": gaussian_data}}, upsert=True)

Here we extract and transform experimental UV-Vis data from a UV-Vis [output CSV file](https://github.com/D3TaLES/databases_demo/tree/main/raw_data/uvvis_biphenyl.csv), then load the data to the database. A demonstration of the parsing done here can be found [in this Colab notebook](https://github.com/D3TaLES/databases_demo/blob/main/notebooks/processing_notebook.ipynb). 

In [None]:
# Insert UV-Vis data
uvvis_data = ProcessUvVis('databases_demo/raw_data/uvvis_biphenyl.csv', mol_id=bp_id, sql=False).data

# Validate data
validate(instance={"_id": bp_id, "uvvis_data": uvvis_data}, schema=schema)

# # Insert molecule into database
db["molecules"].update_one({"_id": bp_id},  {"$set": {"uvvis_data": uvvis_data}}, upsert=True)

Here we loop through a dictionary of molecule names and their SMILES strings, and for each, we generate computational and experimental data, validate the data, and insert them to the database.

In [None]:
# Insert DFT and UV-Vis data for other molecules 

for name, mol_id in extra_mol_ids.items(): 
  # Extract and transform data
  gaussian_data = ProcessDFT('databases_demo/raw_data/tddft_'+name+'.log', mol_id=mol_id, sql=False).data
  uvvis_data = ProcessUvVis('databases_demo/raw_data/uvvis_'+name+'.csv', mol_id=mol_id, sql=False).data
  # Validate 
  validate(instance={"_id": mol_id, "dft_data": gaussian_data, "uvvis_data": uvvis_data}, schema=schema)
  # Load
  db["molecules"].update_one({"_id": mol_id}, {"$set": {"dft_data": gaussian_data, "uvvis_data": uvvis_data}}, upsert=True)

## 4. Query the database

### Basic Queries

Here we demonstrate basic database queries and basic data plotting using [pandas](https://pandas.pydata.org/) and [matplotlib](https://matplotlib.org/). A basic query contains two parts: selection and projection. The selection portion filters the data record(s) (documents for No-SQL) that will be returned. The projection specifies the record attribute(s) (fields for No-SQL) that will be shown. For example, imagine a researcher wants to know the SMILES strings for all molecules in a database that have a molecular weight more than 100 g/mol. The selection would stipulate only data records with a molecular weight greater than 100 g/mol, while the projection would specify the return of the SMILES attribute. Alternatively, the researcher might like to list the lowest-lying excited state energy for every molecule or find and count all molecules with more than ten atoms. Basic queries like this are quick and easy in both SQL and No-SQL databases, even when tens of thousands of molecules are present. 

<img src='https://drive.google.com/uc?export=view&id=1UMVTFtGqeuZUiS5tgIr-GsKOOO6s1KDX' width="700" height="500">

In [None]:
# Get Molecules data
query = db["molecules"].find({})

# Use Pandas DataFrame package to view the results of your query 
pd.DataFrame(list(query))

In [None]:
# Count the number of molecules in the database
db["molecules"].count_documents({})

In [None]:
# Get molecules with more than 10 atoms
query = db["molecules"].find({"number_of_atoms": { "$gt": 10}})

# Use Pandas DataFrame package to view the results of your query 
pd.DataFrame(list(query))

In [None]:
# Get molecules with greater than 10 atoms, showing only molecule IDs
query = db["molecules"].find({"number_of_atoms": { "$gt": 10}}, {"_id": 1})

# Use Pandas DataFrame package to view the results of your query 
pd.DataFrame(list(query))

In [None]:
# Get all the SMILES string in the molecules database where the molecular weight is greater than 100 
query = db["molecules"].find({"molecular_weight": {"$gt": 100}}, {"smiles": 1})

# Use Pandas DataFrame package to view the results of your query 
pd.DataFrame(list(query))

In [None]:
# Search for all single excitation values in the database
query = db["molecules"].find({}, {"dft_data.first_excitation": 1})
pd.DataFrame(list(query))

### Plotting

Here we demonstrate the data analysis examples from the paper: (1) Comparing computationally-estimated singlet excitation and experimentally-measured optical gap and (2) plotting spectrum only when the singlet excitation energy is greater than 4 eV.

In [None]:
# Get the absorption spectrum data for cyclohexen-eylbenzene
query = db["molecules"].find({"_id":"cyclohexen-1-ylbenzene"}, {"uvvis_data.absorbance_data": 1})
# Convert data to a Pandas DataFrame for plotting
df = pd.DataFrame(query[0]['uvvis_data']['absorbance_data'])
# Plot data
df.plot(x='wavelength', y='absorbance')

#### EXAMPLE 1: Comparing computationally-estimated singlet excitation and experimentally-measured optical gap

<img src='https://drive.google.com/uc?export=view&id=1x3aewF8CECgWGbpmTEZ7wFrmm54Q43bu' width="400" height="550">

In [None]:
# Gather data
query = db["molecules"].find({}, {"dft_data.first_excitation": 1, 
                                  "uvvis_data.optical_gap": 1})
# Plot data
fig, ax = plt.subplots(figsize=(4,3))
for mol in query: 
  ax.scatter(mol["uvvis_data"]['optical_gap'], mol["dft_data"]['first_excitation'], label=mol['_id'])

# Add plot labels 
plt.legend()
plt.xlabel('Optical Gap (eV)')
plt.ylabel('Singlet Excitation (eV)')
plt.tight_layout()
plt.savefig('plot1.png', dpi=300)

####EXAMPLE 2: Plotting spectrum for only molecules where the singlet excitation is greater than 4 eV 

<img src='https://drive.google.com/uc?export=view&id=1rK6b6dph5siCMWoP_Sz59VDfhB93O2nj' width="400" height="550">

In [None]:
# Search for all singlet excitation values in the database
query = db["molecules"].find({}, {"dft_data.first_excitation": 1})
pd.DataFrame(list(query))

In [None]:
# Get the molecules wtih a singlet excitation greater than 4
query = db["molecules"].find({"dft_data.first_excitation": {"$gt": 4}})

# Plot absorption spectra for the molecules queried 
fig, ax = plt.subplots(figsize=(4.2,3))
for mol in query: 
  plot_df = pd.DataFrame(mol["uvvis_data"]['absorbance_data'])
  ax.plot(plot_df.wavelength, plot_df.absorbance, label=mol['_id'])
plt.legend()

# Add details 
plt.legend()
plt.xlabel('Wavelength (nm)')
plt.ylabel('Absorption')
plt.tight_layout()
plt.savefig('abs2.png', dpi=300)

# !!! Reset Database !!!

In [None]:
client.drop_database('test_db')