# Load SiGraph Database to Neo4j

## Core Concepts

### Load a sigraph db xml file
* The binary zip file is read and send to neo4j and parsed with apoc.load.xml which is capable of getting a zip file as input
* The XML is parsed to json and from that the following structure is made <br/> (Database)-[:HAS_FILE]->(File)-[:HAS_ALLOCATION]->(Allocation)
* The Allocation has a allocKey comprised of dbname + "_" + fileName + "_" + allocation.oid
* The parsed to json xml part of the Alllocation is saved on the Allocation Node

### Process Allocation json XML structure
* The Allocation gets the dynClass as a Label if available, otherwise the statClass is used
* All the xml classes and their members below the Allocation xml will be 'flattened' on the Allocation node
* When there are complex member properties then a 'dot' notation is used
  ```
  position.Geo_point.x: 16.00000000
  ```
  This is needed when an Allocation has more 'x' values in the member structure
* In the case of successor and predecessor xlink:href reference only the successor is taken (predecessor is skipped) to create the IS_SUCCESSOR relationship
* All other xlink:href references are processed.
    * The xml element name of the element with the xlink:href attribute is used to create the relationship type. When there is a scoped element (local/remote) the element name of the element with the scope attribute is used.
* When there are no values then the 'property name' is stored on the '_emptyProps' property on an Allocation. example
  ```
  _emptyProps: Pobject,Occurrence_Frame_occ.Frame_occ,Attached_graphic_container_Graphic,attached_ud_occs,attached_ud_occs_std_txt,position.Geo_point.Geo_object.Pobject,Conn_point_occ_Conn_segm_occ,Conn_point_occ_Conn_segm_occ.Conn_segm_occ,Pin_occ_Function_pin.Function_pin,El_pin_occ_Singleline_occ.Singleline_occ,El_pin_occ_Legend_occ.Legend_occ
  ```

## Neo4j configuration

### Memory settings
The process is memory intensive so you will need to assign a proper amount of memory to the neo4j process
The following settings are good to use (below is the minimal settings, better is a heap of 16G and a pagecache of 2G) :
```
dbms.memory.heap.initial_size=6G
dbms.memory.heap.max_size=6G
dbms.memory.pagecache.size=2G
```
Note this is enough to run one database loads at the same time. When you want to do 4 at the same time then you need also more memory reserved. Of course you need to have at least 4 cpu available for the neo4j server.

### transaction log settings
This is for development only. When you are loading big datasets multiple times this will be a big write operation for the database. The default is tx log retention is 2 days. Which means that the tx log files will be kept for two days. With very intensive loads in dev environment you will end up with a lot of gigabites of tx log data. 

```
# dev only!!!
db.tx_log.rotation.retention_policy=1G size
```
With this setting the tx log on disk will not grow bigger then ~1.5G
See for more information here: https://neo4j.com/docs/operations-manual/current/configuration/configuration-settings/#config_db.tx_log.rotation.retention_policy

In [31]:
# initialising
import os
import sys

sys.path.append("../")

# python driver object

from neo4j import GraphDatabase

# simple utility voor het afvuren van cypher statements
from neo4jutils import *

In [43]:
# settings
isDebug = False
isInfo = False
# The uri to connect to the database
dburi = "neo4j://localhost:7687"
dbusr = "neo4j"
dbpwd = "neoneoneo"
# The database name where the data must be loaded to
dbname = "rxml"
cypherOutput = "none"
# to process all the files for a database use
# fileFilter = '.xml.gz'
fileFilter = ".xml.gz"
noValueString = "null"
# when True all the empty member elements become properties with the value "null"
# my advice use False here because the empty elements are now kept in the _emptyProps property on a node
useNoValueString = False


In [44]:
# connect to the database and get the driver object
#
driver = GraphDatabase.driver(dburi, auth=(dbusr,dbpwd ))

In [45]:
# db session
session = driver.session(database=dbname)


In [5]:
# function to do output
def pdebug(msg):
    if isDebug == True:
        print(msg)

def pinfo(msg):
    if isInfo == True:
        print(msg)


# function to read a file and return the binary object
def readXMLFileBinary(path):
    # print("readXMLFile file " + path)
    bfile = open(path, "rb")
    xmlgz = bfile.read()
    # print(" compressed length: " + str(len(xmlgz)) )
    return xmlgz

In [6]:
def get_file_name(file_path):
    return file_path.split("/")[-1]

In [7]:
# the storage_interop_services_source directory
sourceDir = "/Users/keesv/work/bClearer/sigraph/xml-data/"
#dbDirNames = ["content_A_E/AE_2016R3"
#             ]
# "R_2016R3",
dbDirNames = ["R_2016R3",
             ]
#              ,"content_A_E/AE_2016R3"
#              ,"content_F_I/F_2016R3"
#              ,"content_F_I/F_SPS_2016R3"
#              ,"content_F_I/G_2016R3"
#              ,"content_F_I/HK_E_2016R3"
#              ,"content_F_I/HK_MSR_2016R3"
#              ,"content_F_I/HR_2016R3"
#              ,"content_F_I/HSA090_2016R3"
#              ,"content_F_I/HSA100_2016R3"
#              ,"content_F_I/HSA250_2016R3"
#              ,"content_F_I/HSA280_2016R3"
#              ,"content_F_I/ICI_2016R3"
#              ,"content_F_I/IH2-E_2016R3"
#              ,"content_F_I/IMS_2016R3"
#              ,"content_F_I/IS_2016R3"
#              ,"content_F_I/I_2016R3"


In [8]:
# creating constraints
# run_cypher(session,"CREATE CONSTRAINT ukSource IF NOT EXISTS FOR (n:Source) REQUIRE n.key IS UNIQUE")
run_cypher(session,"CREATE CONSTRAINT ukAllocation IF NOT EXISTS FOR (n:Allocation) REQUIRE n.allocKey IS UNIQUE")
run_cypher(session,"CREATE CONSTRAINT ukDatabase IF NOT EXISTS FOR (n:Database) REQUIRE n.name IS UNIQUE")
run_cypher(session,"CREATE INDEX indOID IF NOT EXISTS FOR (n:Allocation) ON (n.oid) ")

run_cypher : CREATE CONSTRAINT ukAllocation IF NOT EXISTS FOR (n:Allocation) REQUIRE n.al...
(no changes, no records)
run_cypher : CREATE CONSTRAINT ukDatabase IF NOT EXISTS FOR (n:Database) REQUIRE n.name I...
(no changes, no records)
run_cypher : CREATE INDEX indOID IF NOT EXISTS FOR (n:Allocation) ON (n.oid)...
(no changes, no records)


In [9]:
#
# In this block we initialise the structures to process data back to neo4j
#
cychildren = []
# { "parent": parent, "props": {} }
cyparentProperties = {}
# { "parent": parent, labels:[] }
cyparentLabels = {}
cysuccessorRels = []
cyinstanceRels = []
cyOtherRels = []
cyEmptyProps = {}

def clearCy():
    pinfo("clearing the prepared data for neo4j")
    cychildren.clear()
    cyparentProperties.clear()
    cyparentLabels.clear()
    cysuccessorRels.clear()
    cyinstanceRels.clear()
    cyOtherRels.clear()
    cyEmptyProps.clear()

def addChild(rec):
    # for now simple
    # if we gonna use this on a higher level for multiple parents we meed another structure
    # usage   addChild(parent, { "parent": parent, "children" : element['_children'], "order": order , "props": props , "elmType" : elmType }):
    pinfo(f" addChild parent rec : {rec['elmType']}")
    cychildren.append(rec)

def addEmptyProp(parent, aString):
    if aString is not None:
        empProps = []
        if parent in cyEmptyProps:
            empProps = cyEmptyProps[parent]
        if aString not in empProps:
            empProps.append(aString)
        cyEmptyProps[parent] = empProps

def addNodeProps(parent, props):
    # {"parent",{props:props}}
    if props is not None:
        parentProps = {}
        if parent in cyparentProperties:
            parentProps = cyparentProperties[parent]
        for key in props.keys():
            parentProps[key] = props[key]
        cyparentProperties[parent] = parentProps

def addNodeLabel(parent, label):
    if label is not None:
        labels = []
        if parent in cyparentLabels:
            labels = cyparentLabels[parent]
        if label not in labels:
            labels.append(label)
        cyparentLabels[parent] = labels

def addSuccessorRel(rec):
    # (parent,{ "parent": parent, "elm": elm, "index" : index, "ref" : ref})
    # the parent is already in the record
    cysuccessorRels.append(rec)

def addInstanceRel(rec):
    # addInstanceRel(parent,{ "parent": parent, "elm": elm, "relProps" : {"rel_name": scopedElement['rel_name'], "dest_attr": scopedElement['dest_attr'] }, "ref" : ref, "rel_type": rel_type })
    cyinstanceRels.append(rec)

def addOtherRel(rec):
    # addOtherRel(parent,{ "parent": parent, "elm": elm, "ref" : ref, "rel_type": rel_type })
    pinfo(f"addOtherRel {rec}")
    cyOtherRels.append(rec)

def processChilds():
    pdebug(f" processChilds start : {len(cychildren)} ")
    # record { "parent": parent, "children" : element['_children'], "order": order , "props": props , "elmType" : elmType }
    if len(cychildren) > 0:
        childCypher = """
           UNWIND $childrecs as row
           MATCH (parent) WHERE elementId(parent) = row.parent 
           CREATE (parent)-[:HAS_CHILD]->(child {order: row.order})
           SET child += row.props
           SET child:PrcElement:PrcStructure
           WITH child, row
           CALL apoc.create.addLabels(child, [row.elmType]) yield node
           WITH child, row
           CALL apoc.convert.setJsonProperty(child, '_children', row.children) 
           RETURN count(*) as cnt
        """
        run_cypher(session, childCypher, { "childrecs":cychildren },cypherOutput)
    else:
        pdebug(" processChilds no children to add ")

def processParentLabelAndProperties():
    pinfo(" processParentProperties start   ")
    # { parent, "props": {} }
    # cyparentProperties = {}
    # { parent, labels:[] }
    # cyparentLabels = {}
    inputrows = []
    propparents = cyparentProperties.keys()
    emptypropparents = cyEmptyProps.keys()
    labelparents = cyparentLabels.keys()

    for key in propparents:
        labels = []
        if key in labelparents:
            labels = cyparentLabels[key]
        prps = cyparentProperties[key]
        if key in emptypropparents:
            prps["_emptyProps"] = cyEmptyProps[key]
        inputrows.append({"parent": key, "props":prps, "labels": labels} )
    for lkey in labelparents:
        if lkey not in propparents:
            lprps = {}
            if lkey in emptypropparents:
                lprps["_emptyProps"] = cyEmptyProps[lkey]
            inputrows.append({"parent": lkey, "props":lprps, "labels": cyparentLabels[lkey]} )

    # should we use apoc.periodic.commit here? the average amount of PrcElement labels < 10000
    updateParentQuery = """
        UNWIND $inputrows AS row
        MATCH (parent) WHERE elementId(parent) = row.parent
        SET parent += row.props
        WITH parent, row
        CALL apoc.create.addLabels(parent, row.labels) YIELD node
        RETURN count(*) AS cnt
    """
    run_cypher(session, updateParentQuery, {"inputrows" : inputrows}, cypherOutput)

def processSuccessorRels():
    pinfo(f" processSuccessorRels start {len(cysuccessorRels)} to process ")

    if len(cysuccessorRels) > 0:
        query = """
            UNWIND $successorRels as row
            MATCH (a:Allocation { allocKey : row.ref })
            WITH a, row
            MATCH (n) where elementId(n) = row.parent
            WITH a, n, row
            MERGE (n)-[r:IS_SUCCESSOR]->(a)
            SET n.type = row.elm
            ,   n.index = row.index
            SET r+= row.relProps
        """
        run_cypher(session, query, { "successorRels": cysuccessorRels },cypherOutput)

def processInstanceRels():
    pinfo(f" instanceRels start {len(cyinstanceRels)} to process")
    if len(cyinstanceRels) > 0:
        # now process the instanceRels
        query = """
           UNWIND $instanceRels as row
           MATCH (a:Allocation { allocKey : row.ref })
           WITH a, row
           MATCH (n) where elementId(n) = row.parent 
           WITH a, n, row
           CALL apoc.create.relationship(n, row.rel_type, row.relProps , a) yield rel
           RETURN count(rel) AS count;
        """
        parameters = {"instanceRels": cyinstanceRels}
        run_cypher(session, query, parameters,cypherOutput)

def processOtherRels():
    pinfo(f" otherRels start {len(cyOtherRels)} to process")
    pdebug(f" otherRels {cyOtherRels} ")
    if len(cyOtherRels) > 0:
        # now process the instanceRels
        query = """
           UNWIND $otherRels as row
           MATCH (a:Allocation { allocKey : row.ref })
           WITH a, row
           MATCH (n) where elementId(n) = row.parent 
           WITH a, n, row
           CALL apoc.create.relationship(n, row.rel_type, row.relProps , a) yield rel
           RETURN count(rel) AS count;
        """
        parameters = {"otherRels": cyOtherRels}
        run_cypher(session, query, parameters, cypherOutput)

def writeToNeo4j(parentIDS):
    pinfo(">>> write to neo in batches ")
    pinfo(f" {len(parentIDS)} PrcElements to process ")
    wtStart = time.perf_counter_ns()
    #
    # remove the PrcElement Label and the _children property from the parent node
    #
    query = """
        UNWIND $parentIds AS pid
        MATCH (parent) WHERE elementId(parent) = pid
        REMOVE parent:PrcElement
        REMOVE parent.`_children`
        RETURN count(*) AS cnt
    """
    run_cypher(session, query, { "parentIds": parentIDS },cypherOutput)
    wt1 = time.perf_counter_ns()
    tremovelabelsduration = ( wt1 - wtStart) // 1000000

    # now we have to update the parent node, e.g. remove the PrcElement label and remove the _chidren property
    processParentLabelAndProperties()
    wt2 = time.perf_counter_ns()
    tprocespropertiesandlabels = ( wt2 - wt1) // 1000000
    # now process the children
    processChilds()
    wt3 = time.perf_counter_ns()
    tproceschilds = ( wt3 - wt2) // 1000000

    # now process the successorRels
    processSuccessorRels()
    wt4 = time.perf_counter_ns()
    tprocesSucessors = ( wt4 - wt3) // 1000000
    # now process the instanceRels
    processInstanceRels()
    wt5 = time.perf_counter_ns()
    tprocesInstanceRels = ( wt5 - wt4) // 1000000
    # now process the OtherRels
    processOtherRels()
    wt6 = time.perf_counter_ns()
    tprocesInstanceRels = ( wt6 - wt5) // 1000000

    wduration = (time.perf_counter_ns() - wtStart) // 1000000
    pinfo(f"""written to neo4j in {wduration} ms :
      - removeLabels {tremovelabelsduration} ms - parentLabelsEnProps {tprocespropertiesandlabels} ms - processChildren {tproceschilds} ms - tprocesSucessors {tprocesSucessors} ms - tprocesInstanceRels {tprocesInstanceRels} ms
    """)
    # now clear the stuff otherwise we get unwanted effects
    clearCy()



In [10]:
def saveTopStructure(path, src, dbsess):
    params = {"xml": readXMLFileBinary(path) , "storage_interop_services_source": src, "fileName" : get_file_name(path) }
    query = """
    CALL {
    	CALL apoc.load.xml($xml, '//*[starts-with(name(),"xdm")]|//*[starts-with(name(),"xsm")]', {compression: 'GZIP'}, false ) yield value
    	return "member" as xtype, value 
    	UNION 
        CALL apoc.load.xml($xml, '//*[starts-with(name(),"xsc")]', {compression: 'GZIP'}, false ) yield value 
        return "xsc" as xtype, value      
        UNION 
        CALL apoc.load.xml($xml, '//*[starts-with(name(),"xdc")]', {compression: 'GZIP'}, false ) yield value 
        return "xdc" as xtype, value      
    } WITH distinct xtype, value
      WITH collect( CASE WHEN xtype = "member" THEN value._type ELSE null END ) as members
      ,    collect( distinct CASE WHEN xtype = "xsc" THEN value._type END) as staticClasses	
      ,    collect( distinct CASE WHEN xtype = "xdc" THEN value._type END) as dynamicClasses	
      WITH members, staticClasses, dynamicClasses
      MERGE (db:Database {name: $storage_interop_services_source})
      MERGE (db)-[:HAS_FILE]->(file:File {name: $fileName} )
      SET file.members = members
      ,   file.staticClasses = staticClasses
      ,   file.dynamicClasses = dynamicClasses
      WITH file, $storage_interop_services_source as src
       call apoc.load.xml($xml, '/*/*/allocation', {compression: 'GZIP'}, false) yield value 
       with file, src
       , value.stat_class as stC
       , coalesce(value.dyn_class,"") as dynC 
       , src + '_' + file.name + '#' + value.oid as allocationKey
       , value.oid as oid
       , src + '_' + value.ref as ref
       , value._children as children
       , value
       MERGE (allocation:Allocation { allocKey: allocationKey })
       SET allocation.statClass = stC
         , allocation.dynClass = dynC
         , allocation.oid = oid
         , allocation.storage_interop_services_source = src
       MERGE (file)-[:HAS_ALLOCATION]->(allocation)  
       FOREACH ( y in CASE WHEN ref IS NULL THEN [] ELSE [] END |
          MERGE (rAlloc:Allocation { allocKey: ref })
          MERGE (allocation)-[:HAS_REFERENCE_TO]->(rAlloc)
        )
        WITH allocation, children, value 
        SET allocation:PrcElement:PrcStructure 
        WITH children, allocation, value 
        CALL apoc.convert.setJsonProperty(allocation, '_children', case when children is null then [value] else children end)  
       return count(*) as allocationCount
    """
    run_cypher(dbsess, query, params, "none")

In [11]:
def getKeys(jsonElement):
    keysList = list(jsonElement.keys())
    rrr = []
    for ee in keysList:
        if ee not in skipJSONKeys():
            rrr.append(ee)
    return rrr


In [12]:
def skipJSONKeys():
    return ["origin", "online"]

In [13]:
def getAttributes(element):
    # everything except _type and _children and more
    # we use now the element name as prefix for the property(this can be changed according to Mesbah's wishes)
    # to catch the event if you have multiple elements with the same attributes overwriting each other
    dict = {}
    keys = getKeys(element)
    elm = element["_type"]
    for k in keys:
        if k not in ["_type","_children","_text","role","origin","scope"]:
            dict[elm + "__" + k] = element[k]
    return dict

In [14]:
def hasValues(element, level):
    # check if there are values behind this we look a bit ahead in the xml structure
    # now there are children
    if level == 4:
        return True
    keys = getKeys(element)
    # pdebug(f"\n\nkeys level  {keys}")
    if "_text" in keys:
        return True
    if "_children" in keys:
        # next level
        for c2 in element["_children"]:
            hav = hasValues(c2, level + 1)
            if hav == True:
                return True
    return False

In [15]:
def addChildNode(parent, elmType, element, order=1):
    pdebug(f"addChildNode parent {parent} elmType {elmType} order {order} ")
    if hasValues(element, 1):
        props = getAttributes(element)
        addChild( { "parent": parent, "children" : element["_children"], "order": order , "props": props , "elmType" : elmType })

In [16]:
def getScopedElement(scope, children):
    for cc in children:
        if cc["_type"] == scope:
            return cc

In [17]:
def processNextLevel(parent, file, element, statClassAlloc, dynClassAlloc, level, siblingCount, sibling, source):
    # we now want to gather the properties of the structure or find the references
    # We also use a default value when the property when there is no value:  'null'
    # we are only interested in the members (properties) of the structure, we don't need to create a HAS_CHILD structure now

    # when an element is a Class element (static or dynamic) we have to go to the next level to get the Members (properties)
    # we taken only the Class element name as a prefix for a member?

    elmType = element["_type"]
    keysList = getKeys(element)
    # this will be a list of childs

    if isDebug == True:
        print("  ")
        print(f"================= ProcessNextLevel ============={level} ===={elmType}")
        print( f"statClassAlloc: {statClassAlloc}" )
        print( f"dynClassAlloc: {dynClassAlloc}" )
        print(f"elmType {elmType} level {level} ")
        print( f"keysList: {keysList}" )

    isDynamicClass = elmType in file["dynamicClasses"]
    isMember = elmType in file["members"]
    isStaticClass = elmType in file["staticClasses"]
    pdebug(f" processNextLevel2: \n siblingCount/sibling={siblingCount}/{sibling} \nelmType={elmType} \nisStaticClass={isStaticClass} \nisDynamicClass={isDynamicClass} \nisMember={isMember} ")

    #
    # Rule 1 If the current element is not a member just go to the next level
    #
    #
    # Rule 2 If the current element is a Member get the properties or the relationship
    #
    if isMember == True:
        # handle Member
        handleMember(parent, element, source)
    else:
        # this is a class go a level deeper
        siblingc = 0
        if "_children" in keysList:
            for child in element["_children"]:
                siblingc = siblingc + 1
                processNextLevel(parent, file, child, statClassAlloc, dynClassAlloc, level + 1, len(element["_children"]),siblingc, source )
        else:
            handleMember(parent, element, source)

In [18]:
def getRef(aRef):
    # when there is an xpointer then we have to extract the oid differently
    if aRef.find("xpointer") > -1:
        # xpointer found
        splOne = aRef.split("#")
        splTwo = splOne[1].split("'")
        return splOne[0] + "#" + splTwo[1]
    return aRef

In [19]:
def handleMemberLevel(parent, children, source, path=""):
    pinfo(f" handleMemberLevel path {path}")
    for inst in children:
        pinfo(f" ref element {inst}")
        instKeys = getKeys(inst)
        instType = inst["_type"]
        thepath = path + "." + instType
        if "xlink:href" in instKeys:
            ref = source + "_" + getRef(inst["xlink:href"])
            index = -1
            if "index" in getKeys(inst):
                index = int(inst["index"])
            rel_type = "HAS_" + instType.upper()
            relProps = {}
            relProps["xmlpath"] = thepath
            relProps["index"] = index
            addOtherRel({"loc": "handleMemberLevel " + path, "parent": parent, "elm": instType, "ref" : ref, "rel_type": rel_type, "relProps": relProps })
        elif "_children" in instKeys:
            handleMemberLevel(parent,inst["_children"], source, thepath)
        elif "_text" in instKeys:
            # there is a value
            props = getAttributes(inst)
            props[thepath] = inst["_text"]
            addNodeProps(parent, props)
        else:
            # no value?
            addEmptyProp(parent,thepath)
            if useNoValueString == True:
                props = getAttributes(inst)
                props[thepath] = noValueString
                addNodeProps(parent, props)


In [20]:
def handleMember(parent, element, source):
    # with the use of static standard classes more then one property can be
    # generated
    # example <xsc:BS_Float_unit origin="0" timestamp="2013-03-25T14:00:17Z" unit="V" group="voltage">+0</xsc:BS_Float_unit>
    # the properties unit and group will be added as an extra property
    # elm + '_' + unit : value
    props = {}
    elementKeys = getKeys(element)
    elm = element["_type"]
    # we now determine what kind of member this is
    # when there are no _children and no _text
    # we skip this member
    if "_children" not in elementKeys and "_text" not in elementKeys:
        if "UID" in elementKeys:
            props = getAttributes(element)
            addNodeProps(parent, props)

        else:
            pdebug(f"handleMember no children or value for {elm} ")
            # add a null here...
            addEmptyProp(parent,elm)
            if useNoValueString == True:
                props = getAttributes(elm)
                props[elm] = noValueString
                addNodeProps(parent, props)
        return
    #
    # it there is a text value then this will be added to the parent
    #
    if "_text" in elementKeys:
        # there is a value, we take the element name as property _text as value
        # when there are aditional attibute value these will also be added TODO
        pdebug(f" value found for {element}")
        props = getAttributes(element)
        props[elm] = element["_text"]
        addNodeProps(parent, props)
        return

    #
    # Now we have children elements
    # there can be two scenarios (where we are interested in)
    # role="successor"
    # Question: there are more roles than successor and predecessor that has a xlink:href, what to do with them?
    if "role" in elementKeys:
        if element["role"] == "successor":
            # bingo a successor role possibly a relationship IS_SUCCESSOR can be created to the xlink:href referenced class
            for inst in element["_children"]:
                pdebug(f" ref element {inst}")
                ref = source + "_" + inst["xlink:href"]
                index = int(inst["index"])
                relProps = {}
                relProps["element"] = inst["_type"]
                relProps["parentElement"]= element["_type"]
                addSuccessorRel({ "parent": parent, "elm": elm, "index" : index, "ref" : ref, "relProps" : relProps})
            return
        if element["role"] != "predecessor":
            # bingo an "other" role possibly a relationship HAS_PARENTELEM can be created to the xlink:href referenced class
            handleMemberLevel(parent, element["_children"], source, element["_type"])
            return
        if element["role"] == "predecessor":
            # Skip because role=predecessor
            pinfo(f" this role (predecessor) is skipped {element['role']}")
            return
        pinfo(f" this role is needs attention XXX {element['role']}")
            #addEmptyProp(parent,elm)
            #if useNoValueString == True:
            #    props = getAttributes(elm)
            #    props[elm] = noValueString
            #    addNodeProps(parent, props)
    #
    # Handle now the scoped elementKeys, rel_name is connected to the scope
    if "scope" in elementKeys:
        scope = element["scope"]
        # get the scoped element local or remote
        scopedElement = getScopedElement(scope, element["_children"])
        # check now if there is a value
        valueElement = scopedElement["_children"][0]
        objectType = valueElement["_type"]
        pdebug(f" scopedElement {scopedElement} ")
        pdebug(f" valueElement {valueElement} \n objectType {objectType} ")
        if "_text" in getKeys(valueElement):
            memberValue = valueElement["_text"]
            valueKeys = getKeys(valueElement)
            pdebug(f" memberValue {memberValue} ")
            if objectType == "BS_Float_unit":
                props[elm] = float(memberValue)
                if "unit" in valueKeys:
                    props[elm + "_unit"] = valueElement["unit"]
                if "group" in valueKeys:
                    props[elm + "_group"] = valueElement["group"]
            elif objectType == "BS_Int32_unit":
                props[elm] = int(memberValue)
                if "unit" in valueKeys:
                    props[elm + "_unit"] = valueElement["unit"]
                if "group" in valueKeys:
                    props[elm + "_group"] = valueElement["group"]
            else:
                props[elm] = memberValue

            addNodeProps(parent, props)
            return
        if "rel_name" in getKeys(scopedElement):
            if objectType == "BS_Instance" and "xlink:href" in getKeys(valueElement):
                # there is a HAS_INSTANCE_RELATION to another object
                # the attibutes of the parent element (remote) will be on the relationship
                # for instance: rel_name="CS_Loop_CS_Key" dest_attr="CS_loop_id"
                ref = source + "_" + valueElement["xlink:href"]
                rel_type = "HAS_" + objectType.upper()
                relProps = {}
                relProps["element"] = objectType
                relProps["parentElement"]= element["_type"]
                relProps["rel_name"]= scopedElement["rel_name"]
                relProps["dest_attr"]= scopedElement["dest_attr"]
                addInstanceRel({ "parent": parent, "elm": elm, "relProps" : relProps , "ref" : ref, "rel_type": rel_type })
                return
            pinfo(f" this member {element} situation needs attention AAA")
            return
        # this is a scoped element local without a value
        # we have to create a property for this
        # pinfo(f" this scoped member {element} situation needs attention BBB ")
        addEmptyProp(parent,elm)
        if useNoValueString == True:
            props[element["_type"]] = noValueString
            addNodeProps(parent, props)
        return
    # hantera dessa på ett speciellt sätt efter lunchen!!!!!
    # check
    # there are children
    # walk through it no
    # this can be nested
    handleMemberLevel(parent, element["_children"], source, element["_type"])
    return




In [21]:
def processLevelElement(parentNodeId):
    tStart = time.perf_counter_ns()
    query="""
    MATCH (x)<-[:HAS_CHILD*0..]-(alloc:Allocation)<-[HAS_ALLOCATION]-(file:File)<-[:HAS_FILE]-(db:Database)
    WHERE elementId(x) = $nid
    WITH x , file, alloc
    ,     apoc.convert.getJsonProperty(x,'_children','') as children 
    ,     alloc.statClass as statClassAlloc
    ,     alloc.dynClass as dynClassAlloc
    ,     db.name as storage_interop_services_source
    ,     alloc.oid as allocationOid
    WITH elementId(x) as parent, file, children, statClassAlloc, dynClassAlloc, allocationOid, storage_interop_services_source
    return parent, children, properties(file) as file, statClassAlloc, dynClassAlloc, allocationOid, storage_interop_services_source
    """
    # this query will return one record for now
    record = session.run(query, { "nid" : parentNodeId }).single()
    parent = record["parent"]
    file = record["file"]
    statClassAlloc = record["statClassAlloc"]
    dynClassAlloc = record["dynClassAlloc"]
    recordchildren = record["children"]
    allocationOid = record["allocationOid"]
    elementCount = len(recordchildren)
    source = record["storage_interop_services_source"]
    pinfo(f"node {parent} on allocation oid={allocationOid} from sigraph file {file['name']} START. ")
    classLabel = "Classlabel"
    if dynClassAlloc != "":
        classLabel = dynClassAlloc
    else:
        classLabel = statClassAlloc

    addNodeLabel(parentNodeId, classLabel)


    #
    #
    # Processing the staticClasses
    #  processNextLevel(parent, file, element, elmType, statClassAlloc, dynClassAlloc):
    # -- If The current _type is the statClass this is then the parent of the dynClass
    pdebug(f" children count {elementCount}")
    sibling = 0
    for child in recordchildren:
        sibling = sibling + 1
        processNextLevel(parent, file, child, statClassAlloc, dynClassAlloc, 1, elementCount, sibling, source)



    duration = (time.perf_counter_ns() - tStart) // 1000000
    pinfo(f"node {parent} on allocation oid={allocationOid} from sigraph file {file['name']} in {duration} ms. ")



In [22]:
def processElements():
    # processing until it is done
    query = """
        MATCH (n:PrcElement) return collect(distinct elementId(n)) as ids
    """
    ids = session.run(query).single()["ids"]
    if len(ids) > 0:
        for id in ids:
            pdebug(f" calling now processLevelElement({id})")
            processLevelElement(id)
        writeToNeo4j(ids)
        processElements()
    return len(ids)


In [23]:
def processAllocations():
    filecount = 0
    tAllStart = time.perf_counter_ns()
    for xdir in dbDirNames:
        xmldirpath = sourceDir + xdir
        sourcename = get_file_name(xdir)
        for entry in sorted(os.listdir(xmldirpath)):
            if os.path.isfile(os.path.join(xmldirpath, entry)):
                tStart = time.perf_counter_ns()
                xmlfile = os.path.join(xmldirpath, entry)
                if xmlfile.endswith(fileFilter):
                    filecount = filecount + 1
                    saveTopStructure(xmlfile, sourcename, session)
                    processElements()
                    # clear_empty_nodes()
                    # clear_elements()
                    duration = (time.perf_counter_ns() - tStart) // 1000000
                    print(f"processed file {sourcename}/{entry} in {duration} ms")
                    #if filecount == 1:
                    #    break

    allduration = (time.perf_counter_ns() - tAllStart) // 1000000
    return "processed " + str(filecount) + " files in " + str(allduration//1000) + "s"


In [24]:
processingmessage = processAllocations()

processed file R_2016R3/data_2153.xml.gz in 15731 ms
processed file R_2016R3/data_2154.xml.gz in 6148 ms
processed file R_2016R3/data_2155.xml.gz in 6044 ms
processed file R_2016R3/data_2156.xml.gz in 6701 ms
processed file R_2016R3/data_2157.xml.gz in 5680 ms
processed file R_2016R3/data_2158.xml.gz in 5683 ms
processed file R_2016R3/data_2159.xml.gz in 6470 ms
processed file R_2016R3/data_216.xml.gz in 5807 ms
processed file R_2016R3/data_2160.xml.gz in 5019 ms
processed file R_2016R3/data_2161.xml.gz in 7797 ms
processed file R_2016R3/data_2162.xml.gz in 6427 ms
processed file R_2016R3/data_2163.xml.gz in 6915 ms
processed file R_2016R3/data_2164.xml.gz in 6707 ms
processed file R_2016R3/data_2165.xml.gz in 7299 ms
processed file R_2016R3/data_2166.xml.gz in 7129 ms
processed file R_2016R3/data_2167.xml.gz in 7318 ms
processed file R_2016R3/data_2168.xml.gz in 6852 ms
processed file R_2016R3/data_2169.xml.gz in 5220 ms
processed file R_2016R3/data_217.xml.gz in 7410 ms
processed fil

In [25]:
print(processingmessage)

processed 4399 files in 34267s


# Analyzing data

## Counting database artfacts

In [37]:
q2="""
MATCH (n) return "node" as entity, count(n) as count
UNION ALL
MATCH ()-[r]->() return "relationship" as entity, count(r) as count
UNION ALL
CALL db.labels() yield label return "label" as entity, count(label) as count
UNION ALL 
CALL db.relationshipTypes() yield relationshipType return "relationshipType" as entity, count(relationshipType) as count
UNION ALL 
CALL db.propertyKeys() yield propertyKey return "propertyKey" as entity, count(propertyKey) as count
"""
run_cypher(session, q2)

run_cypher : MATCH (n) return "node" as entity, count(n) as countUNION ALLMATCH ()-[r]-...
Results available after 49ms, finished query after 742ms


Unnamed: 0_level_0,count
entity,Unnamed: 1_level_1
node,5687323
relationship,16486017
label,324
relationshipType,92
propertyKey,3319


#### Size on disk 4.3G (du -h . in the [neo4j_data_home]/databases directory)
#### It takes about 8 seconds to process a db file with Python to process 5682 files takes about 12.6 hours
When using a compiled language this will be a bit faster

## Checking relationship structures
When eploring the graph we found a lot of bi directional relationships between classes.
The query below will count the possible relationship combination between two nodes.

In [42]:
q1 = """
    match (a:Allocation)-[r1]->(b:Allocation)-[r2]->(a)
    with  a, b, apoc.coll.sort([type(r1),type(r2)]) as types
    return types as typeCombination, count(*) as cnt order by typeCombination
"""
run_cypher(session, q1)

run_cypher : match (a:Allocation)-[r1]->(b:Allocation)-[r2]->(a)    with  a, b, apoc.col...
Results available after 46ms, finished query after 11300ms


Unnamed: 0_level_0,cnt
typeCombination,Unnamed: 1_level_1
"[HAS_ACCESSORY_OCC, HAS_CABINET_OCC]",22
"[HAS_ACCESSORY_OCC, HAS_FRAME_OCC]",22
"[HAS_ACCESSORY_OCC, HAS_FUNCTION_DESCRIPTION_BLOCK]",48
"[HAS_AREA_FRAME_OCC, HAS_AREA_IDENT_BLOCK]",328
"[HAS_BASE_SHEET, HAS_SHEET_INFO]",7332
"[HAS_BASE_SHEET, HAS_VOID]",7318
"[HAS_BR_PIN_OCC, HAS_FUNCTION_OCC]",47420
"[HAS_BS_INSTANCE, IS_SUCCESSOR]",24294
"[HAS_CABINET_OCC, HAS_FUNCTION_DESCRIPTION_BLOCK]",2414
"[HAS_CABINET_OCC, HAS_FUNCTION_IDENT_BLOCK]",3950
