# Statistical birds eye view of the contents in an AiiDAdb

This is the first of two deliverable for the SiSc-Lab2020 project.

Authors = 

Supervisors: Jens Bröder, Dr. Daniel Wortmann, Johannes Wasmer, Prof. Dr. Stefan Blügel.

In [None]:
# Instructions by supervisors

## Jens
a= """
You have to implement this notebook.

In the end only text (markdown) cells and output results of code cells should be seen if one hides the code cells (hide_code extension).

That can easily exported into a nice pdf file (google it, probably will find sth with `nbconvert`).

Also the amount of python code in this notebook should be minimal.

Rather, export the functions you use to python file(s) and import them here (hide complexity).

Optional dump query results in a file, from which results will be reread for speed, i.e cache results.
"""

## Johannes
a = '''
After talking with Jens about it, here are some additional thoughts on the code structure and implementation, for both deliverables.

The **primary** goal is of course that the code should work, produce nice output (and helpful error messages), obviously.

The **secondary** goal is speed. How long do you expect your code to run on a dataset of a given size? Are there multiple paths to a goal, but with differing performance?

You can break the runtime down into several steps: data acquisition, data transformation (or preprocessing), data analysis, data visualization. In this project, we will rename/replace these steps to: **querying, de-/serialization, analysis=visualization**.

**Querying the database.** Performance considerations:
- Performance measurement: use the magics `%time` and `%timeit`.
- Query evaluations: queries (in general) use 'lazy evaluation'.
  - *Query building* methods build the query but do not execute it. These are chainable methods like `append()`, `get_outgoing()`, etc.
  - *Query execution* methods send the query to the database to be evaluated. There are two kinds:
    - non-iterator methods: e.g. `all()`, `first()`, etc. These return a result `list`: all items are loaded into memory.
    - iterator methods: e.g. `iterall()`, `iterdict()`. These return a result `Generator`: only one item at a time is loaded into memory.
    
**De-/serialization**, i.e. writing and reading it to/from a file. *Keep in mind: if you come to the conclusion this is unnecessary, then justify it!* Considerations:
- Necessity: we assume 'yes'. So you need serialization/deserialization routine(s).
- Code design: we recommend to write a serializer that moves *all* data needed from aiida to file (perform query & serialization). Then the visualization methods are decoupled from aiida and load data from that file. Advantages: a) only needs to be called when data in database changed, b) similar queries for different visualizations can be performed only once. One design option is this:
  ```python
  serialize = sisclab.Serializer(profile)
  serialize.to_file(filepath)
  visualize = sisclab.Visualizer(filepath)
  visualize.histogram(cumulative=True, plot_options)
  # plots histogram
  ```
- Serialization format: there are two practical options (maybe more):
  - `dict`: tree-like. JSON format. One `dict` per file. choose key-value (nested?) based on use-case. in general, `uuid` is a good key.
  - `pandas.Dataframe`: could be preferrable in some cases.
- Serialization location:
  - one file or several files?
  - we recommend to de/serialize from/to `sisclab/data/` folder. It is included in the project's `.gitignore` file, so nothing in it gets committed to/from git (git is for code, not for data; the code generates the data).
- Transformation:
  - if needed, decide where to put needed data transformations (before serialization or after deserialization) to minimize them.
- Deserialization: 
  - a class (as above) might help to define the deserialization format only once for all visualization methods.
  

**Visualization**:
- Prefer `bokeh` to `matplotlib` or other libs wherever possible, unless you have a good justification.
- In `D1`, static plots are okay, interactive plots are a bonus.
- Lists results (when plot is overkill) will look nicer in a notebook if they are a `pandas.Series` or `pandas.Dataframes`.
- Think about function signatures. Can you generalize them to make a nice interface? For example, a signature for SubtaskD1.c might look like this:
  ```python
  def node_type_summary(user_list : list = [], node_basetype : Node = Data,
                        chart_type : bokeh.chart_type = bokeh.pie_chart, plot : bool = True):
    """
    :param user_list: list of users. empty list = all users = default.
    :param node_basetype: subdivides chart into subtypes. Valid base type examples: ProcessNode, CalculationNode, WorkflowNode, Data, ArrayData.
    :param chart_type: bokeh visualization type. pie chart = default.
    :param plot: True: show plot, don't return data. False: don't plot, return data.
    :return: stats: a dictionary {node_subtype : node_count}, insertion-order sorted in descending order.
    :rtype: dict.
    """
  ```
'''

In [None]:
# Imports

In [None]:
# magics:
# # autoreload imports. 
# # intent: if i change sth in import, i don't have to restart kernel. enable only for development.
%load_ext autoreload
%autoreload 2
# # choose matplotlib backend. backend 'notebook' allows interactive plots if your env allows it.
%matplotlib notebook


In [None]:
# python imports:
from collections import Counter
import time
#from pprint import pprint

#%pylab inline
#figuresize=(18, 4)
from collections import Counter
from math import pi
import pandas as pd
from bokeh.io import output_file,output_notebook, show
from bokeh.layouts import column
from bokeh.palettes import Category20c
from bokeh.plotting import figure
from bokeh.transform import cumsum
from bokeh.models import Legend,LegendItem

# aiida imports:
from aiida import load_profile
profile = load_profile()

# ggf add futher imports
from aiida.orm import QueryBuilder as QB
from aiida.orm import Node, User, CalcJobNode, Computer, Code
from aiida.plugins import DataFactory

from aiida.common.constants import elements as PeriodicTableElements

# project imports:
import helpers
# from aiida_jutools.sisc_lab import helpers
# from aiida_jutools.sisc_lab import HelpersPackage
# equivalent ('.' is the sisc_lab directory):
# from . import helpers
# alternative:
# from .helpers import print_bold
# from .helpers import * ('*' import everything; use of '*' is considered bad style)

In [None]:
output_notebook()

In [None]:
# (example:)
helpers.print_bold(f"This notebook/dashboard will visualize the contents from the database of profile {profile.name}")

# Database overview:

In [None]:
# SubtaskD1.a: Node information
#Task:

In [None]:
# query for all nodes
print('Information on nodes in the DB: \n')
now = time.strftime("%c")
print('last executed on {}'.format(now))
q = QB()
q.append(Node, project=['id', 'ctime', 'mtime', 'node_type'], tag='node')
q.append(User, with_node='node', project='email')
# TODO: execute query here
t = time.time()
elapsed = time.time() - t
res = q.all()
totalnodes = len(res)
print("Total number of nodes in the database: {} (retrieved in {} s.)".format(totalnodes, elapsed))

## User information:

In [None]:
# SubtaskD1.b: Users
a = '''
Task: print out a list of Users and how many nodes belong to them

for example

```
Users:
- j.broeder@fz-juelich.de created 182 nodes
- tests@aiida.mail created 104 nodes
```
'''

In [None]:
users = Counter([r[4] for r in res])
print("Users:")
for count, email in sorted((v, k) for k, v in users.items())[::-1]:
    print("* {} created {} nodes".format(email, count))

## Node types distribution:

In [None]:
# SubtaskD1.c: Node types
a = '''
Task: plot node information in two pie chart plots

One showing what data nodes there (with their lowest class names(node_type)) I.e Dict, K-pointsData, CifData, FleurinpData...

And one chart showning the process nodes, (with their lowest class names(process_type) i.e CalcjobNodes: FleurCalcjob, FleurinputgenCalcjob, ...

WorkChain nodes: FleurSCFWorkchain, FleurBandDosWorkchain, ..., calcfunctions, and workfunction nodes are fine to not show the lowest class names
'''

In [None]:
#node types
types = Counter([r[3] for r in res])
print("Node types:")

for count, typestring in sorted((v, k) for k, v in types.items())[::-1]:
    print("* {}: {} nodes".format(typestring, count))

In [None]:
#split data nodes and process nodes
labelst_1,labelst_2=[],[]
sizest_1,sizest_2=[],[]
#labelst = [label.split('.')[0]=='data' for label in types.keys()]
#sizest = [nnodes for nnodes in types.values()]
for k,v in types.items():
    if k.split('.')[0]=='data':
        labelst_1.append(k.split('.')[-2])
        sizest_1.append(v)
    elif k.split('.')[0]=='process':
        labelst_2.append(k.split('.')[-2])
        sizest_2.append(v)
        
#plot data nodes
#output_file("pie.html")
output_notebook()
x = dict(zip(labelst_1,sizest_1)) 
data=pd.DataFrame.from_dict(dict(x),orient='index').reset_index().rename(index=str,columns={0:'value','index':'data_nodes'})
data['angle'] = data['value']/sum(x.values()) * 2*pi
data['color'] = Category20c[len(x)]
p = figure(plot_height=350, title="Data Nodes", toolbar_location=None,
           tools="hover", tooltips="@data_nodes: @value")
p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='data_nodes', source=data)
p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None 
#show(p)                        

#plot process node

x1 = dict(zip(labelst_2,sizest_2)) 
data=pd.DataFrame.from_dict(dict(x1),orient='index').reset_index().rename(index=str,columns={0:'value','index':'process_nodes'})
data['angle'] = data['value']/sum(list(x1.values())) * 2*pi
data['color'] = Category20c[len(x1)]
p1 = figure(plot_height=350, title="Process Nodes", toolbar_location=None,
           tools="hover", tooltips="@process_nodes: @value")
p1.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='process_nodes', source=data)
p1.axis.axis_label=None
p1.axis.visible=False
p1.grid.grid_line_color = None 
show(column(p,p1))


## Database time evolution:

In [None]:
# SubtaskD1.d: Histogram
# Task: Cumulative Histogram/ or line plot by ctime & mtime of all nodes over time

In [None]:
# line plot by ctime & mtime

ctimes = sorted(r[1] for r in res)
mtimes = sorted(r[2] for r in res)
num_nodes_integrated = range(len(ctimes))
df = pd.DataFrame({'A':ctimes,"B":mtimes})

#print(df.head())
#df = pd.DataFrame({'A':np.random.randn(100).cumsum(),"B":np.random.randn(100).cumsum()})

#plot multiline
p = figure(plot_width=900, plot_height=300, x_axis_type='datetime')
r=p.multi_line([df['A'], df['B']],  
               [df.index, df.index],   
               color=["firebrick", "navy"],   
               alpha=[0.8, 0.6],     
               line_width=[2,1],     
              )

legend=Legend(items=[
    LegendItem(label="ctime",renderers=[r],index=0),
    LegendItem(label="mtime",renderers=[r],index=1),
])
p.add_layout(legend)
p.xaxis.axis_label = 'Date'
p.yaxis.axis_label = 'Number of nodes'

show(p)

## Codes:

In [None]:
# SubtaskD1.e: Codes
#Task: List Code names, sorted by by how many calcjobs where run with each

In [None]:
codes = Code.objects.all()
result = {code.label : len(code.get_outgoing(node_class=CalcJobNode).all_nodes()) for code in codes}
result_df=pd.Series(result).sort_values(ascending=False)
result_df

## Groups:

In [None]:
# SubtaskD1.f: Groups
#Task: List all group names with how many nodes they contain (verdi group list -C) (exclude import and export groups)

In [None]:
def ListGroup(Group : list, exclude: list=[]):
    """ return the group names and nodes they contain
    
    :the list of all groups
    :the list of all excluded groups name
    """
    print('{:<52}{:6}'.format('Group names:','sizes:'))
    for a in Group:
        flag=0
        type = a[0].type_string
        for ex in exclude:
            if ex in type:
                flag=1
        if(flag):
            continue     
        else:
            ## the line below contains all the properties
            ##print(a[0].label,' ',a[0].user,' ',a[0].type_string,' ',a[0].description)

            print('{:<50}|{:5}'.format(a[0].label,len(a[0].nodes)))

In [None]:
#load_profile()
#!verdi group list --all
from aiida.orm import QueryBuilder
from aiida.orm import load_node, Node, Group, Computer, User, CalcJobNode, Code, StructureData

qb = QueryBuilder()
qb.append(Group)
#print(qb.all())

group = qb.all()
ListGroup(group,exclude=['export','import'])

## Structure Analysis:

In [None]:
# SubtaskD1.g: Structures
a = '''
Task: Further analyze what structures are in the DB

Number of structureData node versus how many atoms they contain. 

here interactive with bokeh hover tool showing the structure formula and uuid

Number of StructureData nodes versus elements bokeh bar chart, since there are over 
100 elements in the periodic table you can split it over several plots, or just use the charge number as in 
'example/element_content.png' but then make it interactive that once one hovers 
with the mouse over a bar it tells you what element it is and how many structures there are containing this element-
'''

In [None]:
from aiida.orm import QueryBuilder
from aiida.orm import load_node, Node, Group, Computer, User, CalcJobNode, Code, StructureData
from HelperPackage import DataProcessing
from DataProcessing.DataVisu import StrucDataForm

def NumStructureNode():
    #### return the pd.DataFrame including elements and number of each element
    import pandas as pd
    StrucList = []
    qb = QueryBuilder()
    qb.append(StructureData)
    print('number of StructureData Nodes:',qb.count())
    qb.count()

    for struc, in qb.all()[:]:
        form = struc.get_formula()
        struct = StrucDataForm(form)
        StrucList = StrucList+ [struct.FormAnalyse()]
    return pd.DataFrame(StrucList).fillna(0)

def ShowElements(Data):
    from bokeh.io import output_file, show
    from bokeh.models import ColumnDataSource,HoverTool
    from bokeh.plotting import figure
    from bokeh.io import output_notebook
    from bokeh.palettes import inferno

    output_file("ShowingElements.html")
    #data = NumStructureNode()
    data = Data
    elements = list(data.columns)
    counts = list(data.astype(bool).sum(axis=0))
    #print(counts)
    #print(elements)
    
    source = ColumnDataSource(data=dict(elements=elements, counts=counts,color=inferno(len(elements))))
    
    TOOLTIPS = [
    ("element", "@elements"),
    ("(x,y)", "($x, $y)"),
    ("Number of Structures containing this element", "@counts"),
    ]

    p = figure( y_range=elements,x_range=(0,500), plot_width=800, plot_height=800, title="Elements Counts",tools = [HoverTool(mode='hline')], tooltips=TOOLTIPS)
    #print('step figure done')
    p.hbar(y="elements", right="counts", height=0.5, left=0, color='color',  source=source)
    #print('step hbar done')
    
    output_notebook()
    p.xgrid.grid_line_color = None
    #p.legend = False
    show(p)
    
    
def ShowFormula(Data):
    from bokeh.io import output_file, show
    from bokeh.models import ColumnDataSource,HoverTool
    from bokeh.plotting import figure
    from bokeh.io import output_notebook
    from bokeh.palettes import inferno

    data = Data
    elements = list(data.keys())
    counts = list(len(data[key])/2 for key in data.keys())
    formulas = list(data[key] for key in data.keys())

    length = len(elements)
    source = ColumnDataSource(data=dict(elements=elements, counts=counts,formulas=formulas,color=inferno(length)))
    
    TOOLTIPS = [
    ("Number of Atoms", "@elements"),
    ("(x,y)", "($x, $y)"),
    ("Number of Nodes", "@counts"),
    ("id and formula", "@formulas"),
    ]

    p = figure( y_range=(0,160),x_range=(0,1000), plot_width=800, plot_height=800, title="Atoms Count",tools = [HoverTool(mode='hline')], tooltips=TOOLTIPS)
    #print('step figure done')
    p.hbar(y="elements", right="counts", height=0.5, left=0, color='color',  source=source)
    #print('step hbar done')
    
    output_notebook()
    p.xgrid.grid_line_color = None
    #p.legend = False
    show(p)
    
def GetFormulaDict():
    ## may be not necessary, return the formulas and nodes containing this formula
    import numpy as np
    from aiida.orm import QueryBuilder,StructureData
    qb = QueryBuilder()
    qb.append(StructureData)
    Mydict = qb.all()
    Mydict = np.ravel(Mydict)

    Newdict = {}
    for dict in Mydict:
        if dict.get_formula in Newdict.keys():
            Newdict[dict.get_formula()].append(dict.uuid)
        else:
            Newdict[dict.get_formula()] = [dict.uuid]
    return Newdict

def AtomsNumNodes():
    from aiida.orm import QueryBuilder,StructureData
    import numpy as np
    qb = QueryBuilder()
    qb.append(StructureData)
    StructDatas = qb.all()
    Newdict = {}
    
    for data, in StructDatas:
        CompositionDict = data.get_composition()
        NumAtom = np.sum(list(CompositionDict.values()))
        if NumAtom in Newdict.keys():
            Newdict[NumAtom].append(data.uuid)
            Newdict[NumAtom].append(data.get_formula())
        else:
            Newdict[NumAtom] = [data.uuid,data.get_formula()]
    return Newdict

In [None]:
dic = AtomsNumNodes()
ShowFormula(dic)

In [None]:
data = NumStructureNode()
ShowElements(data)

In [None]:
# initialize the Structure data
from aiida.orm import QueryBuilder
from aiida.orm import load_node, Node, Group, Computer, User, CalcJobNode, Code, StructureData
from HelperPackage import DataProcessing
from DataProcessing.DataVisu import StrucDataForm

qb = QueryBuilder()
qb.append(StructureData)

structures = qb.all()

for structure in structures[:]:
    formula = structure[0].get_formula()
    struct = StrucDataForm(formula)
    structure[0].get_composition()
    print(struct.FormAnalyse())

## Processes:

In [None]:
# SubtaskD1.h: Calculations
a = '''
Task: more detail analysis of Calculations

`print('\n\nMore detailed analysis of Calculations \n')`

List, stacked Histogram of Calculations types and the state it ended up finished, failed, exit codes, exit messages

more detail analysis of WorkChains

`print('\n\nMore detailed analysis of WorkChains \n')`

List,  stacked Histogram for each Workchain type and the state it ended up in finished, failed, exit codes, exit messages
'''

In [None]:
def GetCalNode():
    exit_state = []
    exit_message = []
    index = []
    exit_state_digit = []
    for node, in CalcNode:
        print(str(node.process_state))
        print(node.pk)
        #print(str(node.exit_message))
        exit_state = exit_state + [str(node.process_state)]
        exit_message = exit_message + [str(node.exit_message)]
        index = index + [node.pk]
        if node.is_finished_ok:
            exit_state_digit = exit_state_digit + [1]
        else:
            exit_state_digit = exit_state_digit + [0]
    return exit_state,exit_message,index,exit_state_digit

def ShowCalNode(exit_state,exit_message,index,exit_state_digit):
    from bokeh.io import output_file, show
    from bokeh.models import ColumnDataSource
    from bokeh.plotting import figure
    from bokeh.io import output_notebook
    from bokeh.palettes import inferno

    output_file("ShowingCal.html")

    index = index
    exit_message = exit_message
    exit_state_string = exit_state
    exit_state_digit = exit_state_digit
      
    source = ColumnDataSource(data=dict(index=index, exit_state_digit=exit_state_digit,exit_message = exit_message,exit_state_string = exit_state_string, color=inferno(len(index))))
    
    TOOLTIPS = [
    ("Exit information", "@exit_message"),
    ("(x,y)", "($x, $y)"),
    ("Node index", "@index"),
    ("Node status", "@exit_state_string"),   
    ]

    p = figure( y_range=(0,1), x_range=(0,3000), plot_width=800, plot_height=800, title="CalcNode Information", tooltips=TOOLTIPS)
    #print('step figure done')
    p.vbar(x="index", top="exit_state_digit", bottom=0, width=1, color='color',  source=source)
    #print('step hbar done')
    
    output_notebook()
    p.xgrid.grid_line_color = None
    #p.legend = False
    show(p)
    
def GetWorkflowNode(Name):
    from aiida.orm import WorkflowNode
    from aiida.orm import QueryBuilder
    qb = QueryBuilder()
    qb.append(Name)
    WNode = qb.all()

    Newdict = {}
    for node, in WNode:
        if node.is_finished_ok:
            Newdict[node.node_type+'_succeed'] = Newdict.get(node.node_type+'_succeed',0) + 1
            Newdict[node.node_type+'_not_succeed'] = Newdict.get(node.node_type+'_not_succeed',0) + 0
        else:
            Newdict[node.node_type+'_not_succeed'] = Newdict.get(node.node_type+'_not_succeed',0) + 1
            Newdict[node.node_type+'_succeed'] = Newdict.get(node.node_type+'_succeed',0) + 0
    return Newdict
    

In [None]:
from aiida.orm import CalcJobNode
from aiida.orm import QueryBuilder
qb = QueryBuilder()
qb.append(CalcJobNode)
CalcNode = qb.all()

exit_state,exit_message,index,exit_state_digit = GetCalNode()


In [None]:
ShowCalNode(exit_state,exit_message,index,exit_state_digit)

In [None]:
Newdict1 = GetWorkflowNode(WorkflowNode)
Newdict2 = GetWorkflowNode(CalcJobNode)
print(Newdict1)
print(Newdict2)

In [None]:
def ShowWorkflow(WorkflowDict):
    from bokeh.io import output_file, show
    from bokeh.models import ColumnDataSource,HoverTool
    from bokeh.plotting import figure
    from bokeh.io import output_notebook
    from bokeh.palettes import inferno

    output_file("ShowingCal.html")

    index = list(WorkflowDict.keys())
    counts = list(WorkflowDict.values())
    #exit_message = exit_message
    #exit_state_string = exit_state
    #exit_state_digit = exit_state_digit
      
    source = ColumnDataSource(data=dict(index=index, counts=counts, color=inferno(len(index))))
    
    TOOLTIPS = [
    ("Node number", "@counts"),
    ("(x,y)", "($x, $y)"),
    ("Node status", "@index"),   
    ]
   
    HT = HoverTool(
    tooltips=TOOLTIPS,

    mode='vline'
    )
    
    p = figure( y_range=(0,50), x_range=index, plot_width=800, plot_height=800, title="Process Node Information",tools = [HoverTool(mode='vline')],tooltips=TOOLTIPS)
    #print('step figure done')
    p.vbar(x="index", top="counts", bottom=0, width=1, color='color',  source=source)
    #print('step hbar done')
    
    output_notebook()
    p.xgrid.grid_line_color = None
    #p.legend = False
    show(p)

ShowWorkflow(Newdict1)
ShowWorkflow(Newdict2)

# Data provenance health indicators:

In [None]:

# SubtaskD1.i: Provenance
#Task: Database and provenance health: display the number of nodes who have no incomming and outgoing links, no incomming links (any number outgoing), and no outgoing links (any number incomming)

In [None]:
def Count_In_Out():
    from aiida.orm import QueryBuilder
    qb = QueryBuilder()
    qb.append(Node)
    Nodes = qb.all()
    Namelist = ['No_Incoming','No_Outgoing','No_In&Out']

    Mydict = {}
    for n, in Nodes:
        Incoming_flag,Outgoingflag = False,False
        if(n.get_incoming().all_nodes() == []):
            Incoming_flag = True
            Mydict[Namelist[0]] = Mydict.get(Namelist[0],0)+1
        if(n.get_outgoing().all_nodes() == []):
            Outgoingflag = True
            Mydict[Namelist[1]] = Mydict.get(Namelist[1],0)+1
        if(Incoming_flag and Outgoingflag):
            Mydict[Namelist[2]] = Mydict.get(Namelist[2],0)+1

    return Mydict

def Show_In_Out(Mydict):
    from bokeh.io import output_file, show
    from bokeh.models import ColumnDataSource,HoverTool
    from bokeh.plotting import figure
    from bokeh.io import output_notebook
    from bokeh.palettes import Category20

    output_file("Show_In_Out.html")

    index = list(Mydict.keys())
    counts = list(Mydict.values())
      
    source = ColumnDataSource(data=dict(index=index, counts=counts, color=Category20[len(index)]))
    
    TOOLTIPS = [
    ("Node number", "@counts"),
    ("(x,y)", "($x, $y)"),
    ("Node status", "@index"),   
    ]
   
    HT = HoverTool(
    tooltips=TOOLTIPS,

    mode='vline'
    )
    
    p = figure( y_range=(0,6000), x_range=index, plot_width=800, plot_height=800, title="CalcNode Information",tools = [HoverTool(mode='vline')],tooltips=TOOLTIPS)
    #print('step figure done')
    p.vbar(x="index", top="counts", bottom=0, width=1, color='color',  source=source)
    #print('step hbar done')
    
    output_notebook()
    p.xgrid.grid_line_color = None
    #p.legend = False
    show(p)

In [None]:
Mydict = Count_In_Out()
print(Mydict)

In [None]:
Show_In_Out(Mydict)