# Structure Property visualizer

This is the second of two deliverables for the SiSc-Lab2020 project.

Authors = 

Supervisors: Dr. Jens Bröder, Dr. Daniel Wortmann, Johannes Wasmer, Prof. Dr. Stefan Blügel.

# Instructions by supervisors

## Jens

You have to implement this notebook

In the end only Markup text and output results should be seen if one hides the code cells (hide_code extension)

That is can easily exported into a nice pdf file.

Also the amount of python code in this notebook should be minimal.

Rather export the functions you use from a different file.

optional dump query results in a file, from which results will be reread for speed, i.e cache results

______________

An example of what we in the end aim for your find here.

https://www.materialscloud.org/discover/mofs#mcloudHeader

Clicking at one of the prerendered plots will open an interactive bokeh app.

Code for this you find here: https://github.com/materialscloud-org/structure-property-visualizer/blob/master/figure/main.py

For you sisc project this is to much, so do not view this as something you have to deliver but something that with the help of your work
in this project we will build out of it for our purposes.

So do not worry about an app.

_________________

**Your tasks are as follows:**

1. Implement a general interactive bokeh scatter plots with linked histograms (see static version examples/MP_convergence_scf_clean_150_240.png). 

   We gladly help you with it. (this goes into helpers.py and you import it here for usage). This function should not contain any aiida methods.
2. Extract float data you find in certain 'Dict' nodes into a/several pandas object, which will be then the datasource for this notebook. 

   So this notebook should not directly depend on any aiida methods (beyond load_profile)


## Johannes

Nothing much to add.

I'd just have bonus: if all this works, one could think about some actual data analysis, like clustering analysis or dimension reduction as we learned in Data Analysis & Visualization. For example, do [PCA](https://blog.exploratory.io/an-introduction-to-principal-component-analysis-pca-with-2018-world-soccer-players-data-810d84a14eab) with the quantities magnetic moment, energy, band gap, fermi energy, (structure, core levels). Ie, are some of these quantities (linearly) correlated in a data subset (e.g. all output nodes)? Since the data is already preprocessed mostly, this should be relatively easy, just throw the data into [scikit-learn](https://scikit-learn.org/stable/).

# Imports

In [1]:
# magics:
# # autoreload imports. 
# # intent: if i change sth in import, i don't have to restart kernel. enable only for development.
%load_ext autoreload
%autoreload 2
# # choose matplotlib backend. backend 'notebook' allows interactive plots if your env allows it.
%matplotlib notebook

In [2]:
# python imports:
from collections import Counter
import time
import numpy as np
import pandas as pd
#from pprint import pprint

In [8]:
# aiida imports:
from aiida import load_profile
profile = load_profile()

# ggf add further imports

In [4]:
# project imports:
import helpers
# from .helpers import bokeh_struc_prop_vis
# from .helpers import generate_structure_property_pandas_source

# equivalent ('.' is the sisc_lab directory):
# from . import helpers
# alternative:
# from .helpers import print_bold
# from .helpers import * ('*' import everything; use of '*' is considered bad style)

In [10]:
# (example:)
helpers.print_bold(f"This notebook/dashboard will visualize the contents from the database of profile {profile.name}")

[1mThis notebook/dashboard will visualize the contents from the database of profile seconddb[1m


# Subastk D2.a: Data acquisition

Task: check which output dict nodes returned by workflows, which had `StructureData` nodes as inputs are there in the database.

For example, for a successful `FleurScfWorkChain`, there are two return `Dict` nodes, one is linked with
`last_fleur_calc_output` and one with `output_scf_wc_para`.

If a `StructureData` is an input of such a workflow you can extract the formula, `uuid` and other information you need from the 
`StructureData` which is always linked into workflows via the link name `structure`.

All the user should have to say:
```python
source = generate_structure_property_pandas_source('<workflow_name>')
```


### Check workflows and versions

In [25]:
# Preprocessing: Set formula attributes for all the structure nodes
helpers.set_structure_formula()

In [30]:
# workflow_name = 'fleur_scf_wc' # Filter workflow
# workflowdictlst = helpers.get_structure_workflow_dict(workflow_filters={'attributes.process_label':workflow_name})
#or
workflow_name = None # No restriction. Querying by default
workflowdictlst = helpers.get_structure_workflow_dict(timing=True, check_version=True)

print("Nuumber of the workflows: ", len(workflowdictlst), '\n')
print("Workflows: ")
workflowdictlst[:2]

Elapsed time:  1.8527133464813232 s

Versions and frequency:
 [('0.4.2', 149), ('0.2.2', 130), ('AiiDA Fleur Parser v0.3.0', 100), ('AiiDA Fleur Parser v0.3.1', 41), ('AiiDA Fleur Parser v0.3.2', 3), ('0.3.0', 2)] 

Nuumber of the workflows:  425 

Workflows: 


[{'structure': ['0ccacebb-861b-4909-8bc1-83d3187bf56b', 'Al4'],
  'workflow': ['82d8046e-9bea-4d92-8f90-02832e5bc565', 'FleurScfWorkChain'],
  'dict': ['f453f49c-da1c-42f0-a9ba-e4806ef5fd2b', '0.4.2', None]},
 {'structure': ['0ccacebb-861b-4909-8bc1-83d3187bf56b', 'Al4'],
  'workflow': ['82d8046e-9bea-4d92-8f90-02832e5bc565', 'FleurScfWorkChain'],
  'dict': ['939a487f-e06a-4a0b-af71-37e8ee333d34',
   None,
   'AiiDA Fleur Parser v0.3.0']}]

### Check attributes

In [None]:
# dict_project=['uuid','attributes'] # Attributes of dict nodes
# workflowdictlst = helpers.get_structure_workflow_dict(dict_project=dict_project)
# workflowdictlst[:20]
#or
# structure_project=['uuid', 'extras','attributes.kinds'] # Attributes of structure nodes
# workflowdictlst = helpers.get_structure_workflow_dict(structure_project=structure_project)
# workflowdictlst[:20]

Available dict nodes attributes for different versions of workflow


1. workflow 0.4.2
- 'workflow_version': '0.4.2',
- 'total_energy': -971.2916432694,
- 'force_largest': 0.0,
- 'distance_charge': None,
- 'total_wall_time': 176,
- 'total_energy_units': 'Htr',
- 'distance_charge_units': 'me/bohr^3',
- 'total_wall_time_units': 's'

2. Parser
- 'parser_info': 'AiiDA Fleur Parser v0.3.0',
- 'energy': -26430.191843004,
- 'bandgap': 0.0177798418,
- 'walltime': 176,
- 'energy_units': 'eV',
- 'fermi_energy': 0.2778502713,
- 'bandgap_units': 'eV',
- 'energy_hartree': -971.2916432694,
- 'walltime_units': 'seconds',
- 'fermi_energy_units': 'Htr',
- 'energy_hartree_units': 'Htr'

3. workflow 0.2.2
- 'workflow_version': '0.2.2',
- 'force': 1.241e-06,
- 'energy': -15784.56376617,
- 'energy_units': 'eV'

### Structure nodes

In [23]:
structure_project=['uuid', 'extras.formula']
structurenodes = helpers.generate_structure_property_pandas_source(
            workflow_name, 
            structure_project=structure_project,
            filename='structure_property.json')
structurenodes.head()

Unnamed: 0,structure_uuid,formula
0,0ccacebb,Al4
1,0ccacebb,Al4
2,0ccacebb,Al4
3,0ccacebb,Al4
4,2c639ddf,Fe2


### Dict nodes

In [14]:
# Dict nodes with workflow_version=0.4.2
dict_project_wf042=['uuid', 'attributes.workflow_version', 'attributes.total_energy',
                    'attributes.total_energy_units', 'attributes.distance_charge',
                    'attributes.distance_charge_units', 'attributes.total_wall_time',
                    'attributes.total_wall_time_units']
dictnodes_wf042 = helpers.generate_dict_property_pandas_source(
        workflow_name, 
        dict_project=dict_project_wf042, 
        filename='dict_property_workflow042.json')
dictnodes_wf042.head()

Unnamed: 0,dict_uuid,workflow_version,total_energy,total_energy_units,distance_charge,distance_charge_units,total_wall_time,total_wall_time_units
0,d8cab742,0.4.2,-2545.579023,Htr,4.9e-05,me/bohr^3,15.0,s
1,98265865,,,,,,,
2,93bbe25d,0.4.2,-971.290635,Htr,,me/bohr^3,7.0,s
3,1e42c371,,,,,,,
4,5bc3c4d9,0.2.2,,,,,,


In [20]:
# Dict nodes with parser of any versions
dict_project_parser=['uuid', 'attributes.parser_info', 'attributes.energy', 'attributes.energy_units', 
                     'attributes.fermi_energy', 'attributes.fermi_energy_units', 'attributes.energy_hartree', 
                     'attributes.energy_hartree_units', 'attributes.bandgap', 'attributes.bandgap_units', 
                     'attributes.walltime', 'attributes.walltime_units']
dictnodes_parser = helpers.generate_dict_property_pandas_source(
        workflow_name, 
        dict_project=dict_project_parser, 
        filename='dict_property_parser.json')
dictnodes_parser.head()

Unnamed: 0,dict_uuid,parser_info,energy,energy_units,fermi_energy,fermi_energy_units,energy_hartree,energy_hartree_units,bandgap,bandgap_units,walltime,walltime_units
0,d8cab742,,,,,,,,,,,
1,98265865,AiiDA Fleur Parser v0.3.1,-69268.734001,eV,0.353498,Htr,-2545.579023,Htr,0.007221,eV,15.0,seconds
2,93bbe25d,,,,,,,,,,,
3,1e42c371,AiiDA Fleur Parser v0.3.0,-26430.16442,eV,0.299326,Htr,-971.290635,Htr,0.163845,eV,7.0,seconds
4,5bc3c4d9,,-26430.080872,eV,,,,,,,,


### Combine two kind of nodes

In [21]:
# Combined nodes with workflow_version=0.4.2
combinednodes_wf042 = helpers.generate_combination_property_pandas_source(
        workflow_name, 
        dict_project=dict_project_wf042, 
        structure_project=structure_project,
        filename='combined_property_wf042.json')
combinednodes_wf042.head()

Unnamed: 0,dict_uuid,workflow_version,total_energy,total_energy_units,distance_charge,distance_charge_units,total_wall_time,total_wall_time_units,structure_uuid,formula
0,d8cab742,0.4.2,-2545.579023,Htr,4.9e-05,me/bohr^3,15.0,s,02e6640d,
1,98265865,,,,,,,,02e6640d,
2,93bbe25d,0.4.2,-971.290635,Htr,,me/bohr^3,7.0,s,03bc06be,
3,1e42c371,,,,,,,,03bc06be,
4,5bc3c4d9,0.2.2,,,,,,,03bc06be,


In [22]:
# Combined nodes with parser of any versions
combinednodes_parser = helpers.generate_combination_property_pandas_source(
        workflow_name, 
        dict_project=dict_project_parser, 
        structure_project=structure_project,
        filename='combined_property_parser.json')
combinednodes_parser.head()

Unnamed: 0,dict_uuid,parser_info,energy,energy_units,fermi_energy,fermi_energy_units,energy_hartree,energy_hartree_units,bandgap,bandgap_units,walltime,walltime_units,structure_uuid,formula
0,d8cab742,,,,,,,,,,,,02e6640d,
1,98265865,AiiDA Fleur Parser v0.3.1,-69268.734001,eV,0.353498,Htr,-2545.579023,Htr,0.007221,eV,15.0,seconds,02e6640d,
2,93bbe25d,,,,,,,,,,,,03bc06be,
3,1e42c371,AiiDA Fleur Parser v0.3.0,-26430.16442,eV,0.299326,Htr,-971.290635,Htr,0.163845,eV,7.0,seconds,03bc06be,
4,5bc3c4d9,,-26430.080872,eV,,,,,,,,,03bc06be,


# Subtask D2.b: Interactive plot

Allow the user to choose, which properties to plot on what axis.

```python
xdata = source['distance']
ydata = source['energy']
```

Single bokeh scatter plot with histpgrams on both sides, hover tool should show 'input structure, formula, 
structure_uuid and dictnode uuid', as long as this information is available.

```python
bokeh_struc_prop_vis(xdata, ydata, src=source, **kwargs)
```

### Check data source before plotting

In [8]:
df = helpers.read_json_file('combined_property_parser.json')
df

Unnamed: 0,dict_uuid,parser_info,energy,energy_units,fermi_energy,fermi_energy_units,energy_hartree,energy_hartree_units,bandgap,bandgap_units,walltime,walltime_units,structure_uuid,formula
0,f453f49c,,,,,,,,,,,,0ccacebb,Al4
1,939a487f,AiiDA Fleur Parser v0.3.0,-26430.191843,eV,0.277850,Htr,-971.291643,Htr,0.017780,eV,176.0,seconds,0ccacebb,Al4
2,fd402121,,-26430.120856,eV,,,,,,,,,0ccacebb,Al4
3,fd402121,,-26430.120856,eV,,,,,,,,,0ccacebb,Al4
4,1fde0902,AiiDA Fleur Parser v0.3.0,-69269.571275,eV,0.351294,Htr,-2545.609813,Htr,0.005679,eV,96.0,seconds,2c639ddf,Fe2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,54ffd65a,,-15784.558154,eV,,,,,,,,,fb3d7bd9,Si2
421,7c6a9ff9,,-15784.559902,eV,,,,,,,,,fcb8cd9d,Si2
422,89a75332,,,,,,,,,,,,fcb8cd9d,Si2
423,e77ebc42,AiiDA Fleur Parser v0.3.0,-15784.723533,eV,0.177246,Htr,-580.077895,Htr,1.094787,eV,36.0,seconds,fcb8cd9d,Si2


In [9]:
filtered_df, xdata, ydata = helpers.filter_missing_value(df,'energy', 'fermi_energy')
filtered_df

Unnamed: 0,dict_uuid,parser_info,energy,energy_units,fermi_energy,fermi_energy_units,energy_hartree,energy_hartree_units,bandgap,bandgap_units,walltime,walltime_units,structure_uuid,formula
0,939a487f,AiiDA Fleur Parser v0.3.0,-26430.191843,eV,0.277850,Htr,-971.291643,Htr,1.777984e-02,eV,176.0,seconds,0ccacebb,Al4
1,1fde0902,AiiDA Fleur Parser v0.3.0,-69269.571275,eV,0.351294,Htr,-2545.609813,Htr,5.679005e-03,eV,96.0,seconds,2c639ddf,Fe2
2,36e8e184,AiiDA Fleur Parser v0.3.0,-69269.511590,eV,0.410214,Htr,-2545.607620,Htr,2.286816e-03,eV,8.0,seconds,2e6d2ce2,Fe2
3,c7187212,AiiDA Fleur Parser v0.3.1,-69268.734001,eV,0.353498,Htr,-2545.579023,Htr,7.221355e-03,eV,16.0,seconds,3a6a57f6,Fe2
4,73261351,AiiDA Fleur Parser v0.3.0,-26430.121907,eV,0.255820,Htr,-971.289065,Htr,1.011752e-02,eV,205.0,seconds,48475c16,Al4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,e6ecbc0f,AiiDA Fleur Parser v0.3.1,-69268.638764,eV,0.399115,Htr,-2545.575523,Htr,1.464338e-02,eV,2.0,seconds,f540a37f,Fe2
140,bba5ea0a,AiiDA Fleur Parser v0.3.0,-26430.091615,eV,0.334367,Htr,-971.287960,Htr,1.140000e-08,eV,1895.0,seconds,f67d62d7,Al4
141,9142422f,AiiDA Fleur Parser v0.3.0,-69269.472305,eV,0.359201,Htr,-2545.606176,Htr,9.668811e-04,eV,74.0,seconds,f999a276,Fe2
142,984358bd,AiiDA Fleur Parser v0.3.0,-15784.731259,eV,0.193149,Htr,-580.078179,Htr,9.299812e-01,eV,23.0,seconds,fb3d7bd9,Si2


### Interactive plot by Bokeh

In [10]:
# Workflow_version=0.4.2
helpers.bokeh_struc_prop_vis('combined_property_wf042.json','total_energy', 'distance_charge',"vis_wf042.html")

In [11]:
# Parser
helpers.bokeh_struc_prop_vis('combined_property_parser.json','energy', 'fermi_energy',"vis_parser.html")

### Interactive plot using Bokeh server application

In [24]:
# In vscode terminal:
# bokeh serve --show --port 5001 bokehplotting.py