# AiiDA COF data query


## AiiDA profile setup
The first step is to download the required AiiDA packages onto your machine and then set up a profile. installation and profile setup guide: [https://aiida.readthedocs.io/projects/aiida-core/en/latest/intro/install_system.html#intro-get-started-system-wide-install](https://aiida.readthedocs.io/projects/aiida-core/en/latest/intro/install_system.html#intro-get-started-system-wide-install)

## downloading the data
The COF database can be downloaded from MaterialCloud: [https://archive.materialscloud.org/record/2021.100](https://archive.materialscloud.org/record/2021.100). Check at the bottom of the webpage to make sure that you are on the latest version. The data to be downloaded is in the .aiida file located in the Files section in the middle of the page.

Once you download the .aiida file you must import the archive into your profile. To do so, enter this command from the command line (you must be sourced into your aiida virtual enviornment)

$ verdi archive import /path/to/filename.aiida

This process will take a pretty long time, especially if you do not have a solid state drive. It will also take up ~40gb of space so make sure that you clear enough disk space. 

after this import is complete we can test if it worked by doing a simple query

In [49]:
from aiida import load_profile
from aiida.orm import Dict

load_profile()

qb = QueryBuilder()
qb.append(Dict)
qb.first()

[<Dict: uuid: 00006e52-015c-410e-aebb-4ffacb137c5a (pk: 5)>]

If this returns an empty array then you failed to import the data.

# Querying Data

I will run through a quick example on how to query data, but more information (maybe too much) can be found here [https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/query.html#how-to-query](https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/query.html#how-to-query)(basic querying) and here [https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/database.html](https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/database.html) (advanced querying)

### imports needed

In [2]:
# imports
from aiida import load_profile
from aiida.orm import QueryBuilder
from aiida.orm import load_node
from aiida.orm import Node, Group, Data, Dict, CifData
import pandas as pd
import numpy as np
# load AiiDA profile
load_profile()

<aiida.manage.configuration.profile.Profile at 0x7f2412c512e0>

### Query Builder and Groups
the QueryBuilder() is the main tool we will use to fetch data from the database. Once we have a QueryBuilder() object we can append nodes from our database onto it that we are interested in retrieving.

In the COF data base each Group contains the data and caluculation nodes for an individual COF (currnetly there are 648 Groups for 648 cofs). We will start by doing a simple query of all COFs (Groups)

In [68]:
# make QueryBuilder() object
qb = QueryBuilder()

# append Group(all COFs)
# filter out core.imports (these are other Groups attached to my profile that have no useful COF information)
qb.append(Group, filters={'type_string': {'!==': 'core.import'}})

#iterate over all nodes (cof groups) in our query
cof_list = []
for x in qb.iterall():
    cof_list.append(x)
cof_list[0:10] # show just the first 10

[[<Group: "discover_curated_cofs/16032N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/20150N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/16142N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/16370N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/21130N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/19142N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/19250N3" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/20142N2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/16410C2" [type core], of user daniele.ongari@gmail.com>],
 [<Group: "discover_curated_cofs/16240C2" [type core], of user daniele.ongari@gmail.com>]]

In [64]:
print(len(cof_list), "total COFs")

648 total COFs


Above we queried for all Cofs in the curated Cof Database. The code that represents each cof can be understood like this:
![](https://raw.githubusercontent.com/danieleongari/CURATED-COFs/master/images/figure1.gif)

Next we will explore some of the different properties that we might want to query from a COF and then we will iterate over every COF and filter the properties into a DataFrame

## querying for .cif file name
not really a cof property but you might want it, and its a good first example.

First we Query all COFs and give it a 'group' tag for later reference.

In the next append we are querying all CifData nodes that are members of the Group query we did above (with_group='group')

finally the project='attributes' argument specifies that we want to return the attributes of the CifData nodes. Other things we could project from these Nodes are 'description', 'id', 'extras', 'label'... The best way I could find to see the possible things to project is to put in an invalid value (project="invalid"). The error will return all possible values that can be projected.

In [53]:
qb = QueryBuilder()
qb.append(Group, tag='group')  # tag group gives us a reference to use later
qb.append(CifData, project='attributes', with_group='group')   # all possible attributes we could query on
qb.limit(3) # just want .cif for first 3 COFs
qb.all()


[[{'md5': '4daedb3510d03703d51342d2713f6a29',
   'filename': '20610N2.cif',
   'formulae': [None],
   'scan_type': 'standard',
   'parse_policy': 'eager',
   'spacegroup_numbers': [None]}],
 [{'md5': '571897abbb7701c08cf087e326ac99ab',
   'filename': 'tmpDog3zJ.cif',
   'formulae': None,
   'scan_type': 'flex',
   'parse_policy': 'lazy',
   'spacegroup_numbers': None}],
 [{'md5': 'db94a7abf5c4e44e232a70b980e86c29',
   'filename': '21052N2.cif',
   'formulae': [None],
   'scan_type': 'standard',
   'parse_policy': 'eager',
   'spacegroup_numbers': [None]}]]

For some reason this is returning two cif files for the first COF (20610N2.cif and tmpDog3zJ.cif). But we only want one, so we add the filter {'extras.tag4': 'orig_cif'}  

also we will just want the file name so we will use the project='attributes.filename' 

In [54]:
# using the 'extras.tag4': 'orig_cif' filter to just get one set of cif data.
# getting the cif filename with project='atributes.filename'.

qb = QueryBuilder()
qb.append(Group, tag='group')  
qb.append(CifData, project='attributes.filename', with_group='group', filters={'extras.tag4': 'orig_cif'})
qb.limit(3)
qb.all()

[['20610N2.cif'], ['21052N2.cif'], ['20473N2.cif']]

should return the file names for 3 COFs 

these files can be found here https://github.com/danieleongari/CURATED-COFs/tree/master/cifs

## querying for COF property data

This .cif file name will be good to add to our data frame in case someone wants to look more closely at the files but more important is the actual properties themselves! Below we will show how to query for COF properties that we want. It will be very similar to what we just did only a few more filters.

In [66]:
qb = QueryBuilder()
qb.append(Group, tag='group')

# extras.tag4 will be used to filter Nodes that just have this tag in them
# for example we can use this to filter select only the o2_isotherm data if for every COF if thats what we want.
qb.append(Node, with_group='group', project='extras.tag4') 
tag4 = qb.all()

# remove duplicates 
tag4list = []
for element in tag4:
    tag4list.append(element[0])
tag4_no_dup = list(dict.fromkeys(tag4list))

print(tag4_no_dup)

#list of all possible properties

['appl_ch4storage', 'orig_cif', 'orig_zeopp', 'opt_cif_ddec', 'isot_o2', 'isot_n2', 'appl_pecoal', 'kh_h2o', 'isot_ch4', 'appl_h2sh2osel', 'appl_o2storage', 'kh_kr', 'dftopt', 'opt_zeopp', 'kh_h2s', 'appl_h2storage', 'kh_xe', 'isotmt_h2', 'appl_xekrsel', 'isot_co2', 'appl_peng']


Say we want to gather h2 isotherm data, so we will use the filter 'extras.tag4': 'isotmt_h2' and project='attributes'

In [67]:
qb = QueryBuilder()
qb.append(Group, tag='group')
qb.append(Node, with_group='group', project='attributes',
         filters={'extras.tag4': 'isotmt_h2', 'attributes.is_porous': True})
qb.first()

# this should return all of the h2 isotherm data for the fist cof.

[{'Density': 0.529438,
  'Input_ha': 'DEF',
  'POAV_A^3': 5640.19,
  'isotherm': [{'pressure': [1.0, 5.0, 25, 50, 75, 100],
    'pressure_unit': 'bar',
    'loading_absolute_dev': [0.05443631561662,
     0.11904642877507,
     0.14654898125614,
     0.26874135955225,
     0.15836664597714,
     0.12787611781091],
    'loading_absolute_unit': 'mol/kg',
    'loading_absolute_average': [8.3756258759282,
     16.257456585865,
     27.086870821482,
     32.843406010691,
     36.557411232671,
     38.954840526852],
    'enthalpy_of_adsorption_dev': [0.27347804577919,
     0.42885499654952,
     0.32330705145793,
     0.04489093977867,
     0.2423067604533,
     0.26538314501176],
    'enthalpy_of_adsorption_unit': 'kJ/mol',
    'enthalpy_of_adsorption_average': [-4.9238424251495,
     -4.0401085031553,
     -2.8370616846674,
     -2.8232868557365,
     -2.8302521762989,
     -2.8983591496013]},
   {'pressure': [1.0, 5.0, 25, 50, 75, 100],
    'pressure_unit': 'bar',
    'loading_absolute_dev

## Querying properties for all COFs

Now we have all the QueryBuilder tools required to put together a query for a property for every COF in the database. For these next two examples we will query the o2 isotherm data and the co2 henry coeficient data for all COFs in the database.

###  o2 5bar (example)

In [39]:
# set up a dataframe to store data
o2_df = pd.DataFrame({'cof': [], 'o2_5bar': []})

sum = 0 # used to count missing cofs

# get a list of uuids for all cofs (used to iterate over every cof one at a time)
qb = QueryBuilder()
qb.append(Group, tag='group', project='uuid', filters={'type_string': {'!==': 'core.import'}})

# iterate over all COFs using uuid.
for cof_uuid in qb.iterall():
    # for each uuid make a new query on that cof
    qb = QueryBuilder()
    qb.append(Group, tag='group', filters={'uuid': cof_uuid[0]})
    
    # get (project) the .cif filename
    qb.append(CifData, with_group='group', project='attributes.filename', filters={'extras.tag4': 'orig_cif'})
    
    # get (project) the O2 isotherm data
    qb.append(Node, with_group='group', project='attributes', filters={
        'extras.tag4': 'isot_o2',
        'attributes.is_porous': True})
    
    res = qb.all()
    try:
        # res[0][0] has file name. 
        # [res[0][1]['isotherm']['loading_absolute_average'][1]] has o2_5bar data.
        d = {'cof': [res[0][0]], 'o2_5bar': [res[0][1]['isotherm']['loading_absolute_average'][1]]}
        temp = pd.DataFrame(data=d)
        o2_df = o2_df.append(temp)
    except:
        sum += 1 
print(sum, "missing cofs")

57


for some reason there are 57 missing values for o2_5bar...

In [34]:
o2_df

Unnamed: 0,cof,o2_5bar
0,16032N2.cif,1.254577
0,20150N2.cif,1.382063
0,16142N2.cif,1.476718
0,16370N2.cif,2.175336
0,21130N2.cif,1.416540
...,...,...
0,20050N2.cif,1.275172
0,14080N2.cif,1.215057
0,20120N2.cif,2.030808
0,21271N3.cif,2.145694


### co2 Henry

In [44]:
# set up the dataframe
co2_df = pd.DataFrame({'cof': [], 'co2_henry': []})

sum = 0

qb = QueryBuilder()
qb.append(Group, tag='group', project='uuid', filters={'type_string': {'!==': 'core.import'}})

# iterate over all COFs using uuid
for cof_uuid in qb.iterall():
    # for each uuid make a new query
    qb = QueryBuilder()
    qb.append(Group, tag='group', filters={'uuid': cof_uuid[0]})
    
    # get the .cif filename
    qb.append(CifData, with_group='group', project='attributes.filename', filters={'extras.tag4': 'orig_cif'})
    
    # get the co2 isotherm data by filtering nodes with 'isot_co2'
    qb.append(Node, with_group='group', project='attributes', filters={
        'extras.tag4': 'isot_co2',
        'attributes.is_porous': True})
    
    res = qb.all()
    try:
        d = {'cof': [res[0][0]], 'co2_henry': [res[0][1]['henry_coefficient_average']]}
        temp = pd.DataFrame(data=d)
        co2_df = co2_df.append(temp)
    except:
        sum += 1 
print(sum, "missing cofs")

56


In [45]:
co2_df

Unnamed: 0,cof,co2_henry
0,16032N2.cif,0.000022
0,20150N2.cif,0.000019
0,16142N2.cif,0.000009
0,16370N2.cif,0.000092
0,21130N2.cif,0.000017
...,...,...
0,20050N2.cif,0.000011
0,14080N2.cif,0.000006
0,20120N2.cif,0.000051
0,21271N3.cif,0.000072
