# Structure Property visualizer

This is the second of two deliverables for the SiSc-Lab2020 project.

Authors = 

Supervisors: Dr. Jens Bröder, Dr. Daniel Wortmann, Johannes Wasmer, Prof. Dr. Stefan Blügel.

# Instructions by supervisors

## Jens

You have to implement this notebook

In the end only Markup text and output results should be seen if one hides the code cells (hide_code extension)

That is can easily exported into a nice pdf file.

Also the amount of python code in this notebook should be minimal.

Rather export the functions you use from a different file.

optional dump query results in a file, from which results will be reread for speed, i.e cache results

______________

An example of what we in the end aim for your find here.

https://www.materialscloud.org/discover/mofs#mcloudHeader

Clicking at one of the prerendered plots will open an interactive bokeh app.

Code for this you find here: https://github.com/materialscloud-org/structure-property-visualizer/blob/master/figure/main.py

For you sisc project this is to much, so do not view this as something you have to deliver but something that with the help of your work
in this project we will build out of it for our purposes.

So do not worry about an app.

_________________

**Your tasks are as follows:**

1. Implement a general interactive bokeh scatter plots with linked histograms (see static version examples/MP_convergence_scf_clean_150_240.png). 

   We gladly help you with it. (this goes into helpers.py and you import it here for usage). This function should not contain any aiida methods.
2. Extract float data you find in certain 'Dict' nodes into a/several pandas object, which will be then the datasource for this notebook. 

   So this notebook should not directly depend on any aiida methods (beyond load_profile)


## Johannes

Nothing much to add.

I'd just have bonus: if all this works, one could think about some actual data analysis, like clustering analysis or dimension reduction as we learned in Data Analysis & Visualization. For example, do [PCA](https://blog.exploratory.io/an-introduction-to-principal-component-analysis-pca-with-2018-world-soccer-players-data-810d84a14eab) with the quantities magnetic moment, energy, band gap, fermi energy, (structure, core levels). Ie, are some of these quantities (linearly) correlated in a data subset (e.g. all output nodes)? Since the data is already preprocessed mostly, this should be relatively easy, just throw the data into [scikit-learn](https://scikit-learn.org/stable/).

# Imports

In [1]:
# magics:
# # autoreload imports. 
# # intent: if i change sth in import, i don't have to restart kernel. enable only for development.
%load_ext autoreload
%autoreload 2
# # choose matplotlib backend. backend 'notebook' allows interactive plots if your env allows it.
%matplotlib notebook

In [2]:
# python imports:
from collections import Counter
import time
import numpy as np
import pandas as pd
#from pprint import pprint

In [3]:
# aiida imports:
from aiida import load_profile
profile = load_profile()

# ggf add further imports

In [4]:
# project imports:
import helpers
# from .helpers import bokeh_struc_prop_vis
# from .helpers import generate_structure_property_pandas_source

# equivalent ('.' is the sisc_lab directory):
# from . import helpers
# alternative:
# from .helpers import print_bold
# from .helpers import * ('*' import everything; use of '*' is considered bad style)

In [5]:
# (example:)
helpers.print_bold(f"This notebook/dashboard will visualize the contents from the database of profile {profile.name}")

[1mThis notebook/dashboard will visualize the contents from the database of profile generic[1m


# Subastk D2.a: Data acquisition

Task: check which output dict nodes returned by workflows, which had `StructureData` nodes as inputs are there in the database.

For example, for a successful `FleurScfWorkChain`, there are two return `Dict` nodes, one is linked with
`last_fleur_calc_output` and one with `output_scf_wc_para`.

If a `StructureData` is an input of such a workflow you can extract the formula, `uuid` and other information you need from the 
`StructureData` which is always linked into workflows via the link name `structure`.

All the user should have to say:
```python
source = generate_structure_property_pandas_source('<workflow_name>')
```


In [6]:
#!verdi plugin list aiida.workflows

In [7]:
#workflow_name = 'fleur_scf_wc'
#workflowdictlst = helpers.get_structure_workflow_dict(workflow_filters={'attributes.process_label':workflow_name})
# or
workflow_name = None
workflowdictlst = helpers.get_structure_workflow_dict()

print(len(workflowdictlst))
workflowdictlst[:2]

98


[{'structure': ['Be2', 'b11cbf68-b4d7-4670-870d-ccd655bc24c3'],
  'workflow': ['fleur_scf_wc', '2ce04141-4b50-4df2-bce7-155bc2126a89'],
  'dict': ['dbb7f256-b633-4e1f-9bb6-bf5b9eaecf3e']},
 {'structure': ['Be2', 'b11cbf68-b4d7-4670-870d-ccd655bc24c3'],
  'workflow': ['fleur_scf_wc', '2ce04141-4b50-4df2-bce7-155bc2126a89'],
  'dict': ['001114f1-6fa4-435a-a00d-d45b00985145']}]

In [8]:
# Some dict attributes: energy, total_energy, fermi_energy, bandgap, charge_density, distance_charge, 
dict_project=['uuid', 'attributes.energy', 'attributes.total_energy', 'attributes.distance_charge',  'attributes.bandgap']
dictpd = helpers.generate_dict_property_pandas_source(workflow_name, dict_project=dict_project)
dictpd.to_json('dict_property.json', orient='records')
dictpd.head()

Unnamed: 0,uuid,energy,total_energy,distance_charge,bandgap
0,dbb7f256-b633-4e1f-9bb6-bf5b9eaecf3e,-803.817208,,,0.0817051
1,001114f1-6fa4-435a-a00d-d45b00985145,,-29.539738,1.1e-05,
2,7c20771d-9130-4ff0-83c8-a8658417e73c,,-48075.622249,,{'Be4Ti2': 0.0095235898}
3,b0154180-34f7-4898-bcfa-2cde3bd56c20,-53303.114313,,,0.00640117
4,76825505-1e7c-4b2f-a244-91d43456f242,,-1958.853337,1.7e-05,


In [9]:
structure_project=['uuid', 'extras.formula']
structurepd = helpers.generate_structure_property_pandas_source(workflow_name, structure_project)
structurepd.to_json('structure_property.json', orient='records')
structurepd.head()

Unnamed: 0,uuid,formula
0,b11cbf68-b4d7-4670-870d-ccd655bc24c3,Be2
1,b11cbf68-b4d7-4670-870d-ccd655bc24c3,Be2
2,51b53332-12f2-4014-8c5b-0e3669369d49,Be4Ti2
3,4bda0721-9202-4b1e-ae35-53d59d63d153,Be17Ti2
4,4bda0721-9202-4b1e-ae35-53d59d63d153,Be17Ti2


# Subtask D2.b: Interactive plot

Allow the user to choose, which properties to plot on what axis.

```python
xdata = source['distance']
ydata = source['energy']
```

Single bokeh scatter plot with histpgrams on both sides, hover tool should show 'input structure, formula, 
structure_uuid and dictnode uuid', as long as this information is available.

```python
bokeh_struc_prop_vis(xdata, ydata, src=source, **kwargs)
```

In [10]:
# Read dataset as DataFrame and deal with NaN values
dictall = pd.read_json('dict_property.json', orient='records')
dictdata = dictall[['total_energy','distance_charge']]
dictdata.dropna(axis=0, how='any', inplace=True)
dictdata.reset_index(drop=True, inplace=True)

xdata = dictdata['total_energy']
ydata = dictdata['distance_charge']
#print(xdata, '\n')
#print(ydata)

In [11]:
from helpers import bokeh_struc_prop_vis
bokeh_struc_prop_vis(xdata, ydata)