# Example 1: Recompile MolNet

## Objectives

In this notebook we show the workflow that compiles the data from one published dataset
Key references
1. [Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022). ](https://doi.org/10.1038/s41597-022-01288-4)
2. [ Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, "GEOM", , Harvard Dataverse, V4; molecule_net.tar.gz [fileName] ](https://doi.org/10.7910/DVN/JNGTDF)

## Prerequisites

- `pandas`
- `py3Dmol`

No additional files, besides this notebook, will be required.
However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.

## Hardware Specification for Rerun

Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system.

In [12]:
# Imports required to execute this notebook
import molli as ml
try:
    import ujson as json
except:
    import json
import pickle
from pathlib import Path
from tqdm.notebook import tqdm
from pathlib import Path
import tarfile
ml.visual.configure(bgcolor="white")


## Step 1. Download the `molecule_net.tar.gz` archive

In [13]:
# Definitions of key paths
molnet_targz = Path("molecule_net.tar.gz")
molnet_root = Path("molecule_net")

Download the required molecule_net dataset. This is done *manually* in this notebook to make sure the workflow would be reproducible on both Windows ans Linux

In [14]:
if not molnet_targz.is_file():
    import requests
    with requests.get("https://dataverse.harvard.edu/api/access/datafile/5858506", stream=True) as rq:
        rq.raise_for_status()
        with open(molnet_targz, "wb") as f:
            for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks
                f.write(chunk)

In [15]:
if not molnet_root.is_dir():
    with tarfile.open(molnet_targz, "r:gz") as tf:
        tf.extractall()

## Step 2. Convert the data to molli `.clib` format

Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:
1. Lightweight file format (the reinterpreted data has the same disk footprint as the compressed `.tar.gz` archive) 
2. Molecular properties are stored *within* the molecule objects in the `ensemble.attrib` attribute of the `ConformerEnsemble` instance.

In [16]:
from molli.external.rdkit import from_rdmol
library = ml.ConformerLibrary("molnet.clib", overwrite=False, readonly=False)

with open(molnet_root / "summary.json", "rt") as f:
    summary = json.load(f)

with library.writing():
    for i, (smi, entry) in tqdm(
        enumerate(summary.items()),
        total=len(summary),
        desc="Importing molecule_net rdkit molecular data",
    ):
        pkl_path = Path(entry["pickle_path"])

        # In lieu of better naming for the files, we opted to use the
        # pickle file names. This is totally not necessary, and the user may choose
        # their own optimal naming scheme.
        name = pkl_path.stem
        if not name:
            continue
        
        # This step is a guard in case we are trying to import a file that already exists in the destination.
        if name in library.keys():
            continue

        with open(molnet_root / pkl_path, "rb") as f:
            pkl = pickle.load(f)

        charge = pkl["charge"]
        
        # Each rdkit molecule conformer is now converted into molli.chem.Molecule instance
        conformers = [from_rdmol(c["rd_mol"]) for c in pkl["conformers"]]

        weights = [c["boltzmannweight"] for c in pkl["conformers"]]

        pkl_attrib = {
            k: v for k, v in pkl.items() if k not in {"charge", "conformers"}
        } | entry

        ensemble = ml.ConformerEnsemble(
            conformers, name=name, charge=charge, weights=weights, attrib=pkl_attrib
        )

        # This step writes the ensemble into the library file.
        library[name] = ensemble

Importing molecule_net rdkit molecular data:   0%|          | 0/16865 [00:00<?, ?it/s]

## Step 3. Enjoy the concise syntax for operating with the molecule objects

In [17]:
!molli stats "m.n_conformers" molnet.clib

count    16865.000000
mean       179.863979
std        442.747245
min          1.000000
25%          8.000000
50%         47.000000
75%        173.000000
max       7461.000000
dtype: float64


In [18]:
# This jupyter magic will show a given conformer ensemble
%clib_view molnet.clib AAAQFGUYHFJNHI-VGUBEVBKNA-N