## CI Components LinkML to GraphML Converter
Hard coded converter to move the CI Components Catalog to GraphML. Nodes and relationships are created and can be read in by Neo4j.  

### Next Steps and Alternatives
 - This code does no validation, it expects the users to first validate with linkml and then run this
 - This code could be generalizable, but it will be a chore. User would have to read linkml schema, get relationships, creates nodes as usual, and then with dynamic relationship information, create relationships.
     - I would say if generalization is the goal, creating Pydantic models based off the linkml schema, validating, and creating nodes/relationships is probably the way to go. No entity or component can actually be named, it's all dynamic so that throws in some curveballs.
 - Building off of others, there is this code that can use the schema and data to output to an RDF/OWL formated TURTLE file. It requires annotations however, and I'm not quite sure how it works, I couldn't get it to work so I'm not sure if the annotations are optional or not even.
     - Neo4j can ingest OWL/RDF files, not sure how formatting will work though.
     - https://linkml.io/linkml-owl/usage/


In [6]:
# Data url and linkml schema url
data_url = "https://raw.githubusercontent.com/ICICLE-ai/CI-Components-Catalog/master/components-data.yaml"
schema_url = "https://raw.githubusercontent.com/ICICLE-ai/CI-Components-Catalog/master/ci-component.yaml"

In [18]:
# We parse through the linkml data and create networkx nodes and edges to create a working networkx model.
# Once we have that model we can export to graphml. Neo4j is able to read graphml using the apoc plugin.
import networkx as nx
import yaml
import requests


# Taken from Yamei's code, grabs from github or uses file based on github attr.
def loadYAML(url, github):
    if github:
        r = requests.get(url)
        data_linkml = yaml.safe_load(r.content)
        print("Loaded URL.")
    else:
        with open(url, "r") as stream:
            data_linkml = yaml.safe_load(stream)
    return data_linkml 

github = True
schema = loadYAML(schema_url, github)
data = loadYAML(data_url, github)

G = nx.Graph()

# Node ids have to be referenced to make relationships and to create nodes. So we keep reference of node_id here.
# When we want to define a relationship between two nodes, we use node_id and second_node_id, doing a quick search
# through all classes to find a matching field.
# In this case, we hardcode 'components', 'hasDependentComponents', 'related_to', 'relationship_type', and 'id'.
node_id = 1
for component in data['components']:
    # We add in node_id so that I can search for nodes later and refer to node_id. node_id is not added to the nodes, only used by me in this script.
    component['node_id'] = node_id
    component_vars = component.copy()
    del component_vars['node_id'] # This allows us to use readLabels later to define node type via this attr.
    component_vars['labels'] = ":Component"
    #del component_vars['id']
    relationships = component_vars.pop("hasDependentComponents", [])
    ## Create node with all vars we have minus `node_id` and relationships, in this case `hasDependentComponents`.
    for key, value in component_vars.items():
        if value == None:
            component_vars[key] = ""
    
    G.add_node(node_id, **component_vars)
    print(f"\n\n\nnode#{node_id}: {component_vars}")
    
    ## Get relationship second_node_id
    for relationship in relationships:
        related_to = relationship['related_to']
        relationship_type = relationship['relationship_type']

        # Find node_id of component so we can link node "A" and second_node_id.
        second_node_id = None
        for item in data['components']:
            if item.get('node_id', None):
                if related_to == item['id']:
                    second_node_id = item['node_id']
                    break
        if not second_node_id:
            raise KeyError(f"Could not find item containing id referenced by relationship. Breaking. related_to: {related_to}, node: {node_id}")
        
        ## Create edge between node_id and second_node_id
        # This is a small example, but type could determine directionality
        if relationship_type == 'DependsOn':
            relationship_attrs = {'label': relationship_type} # This allows us to use readLabels later to define relationship type via this attr.
            G.add_edge(node_id, second_node_id, **relationship_attrs)
            print(f"\nrelation: (node#{node_id} -> node#{second_node_id}) attrs: {relationship_attrs}")
    
    node_id += 1
    
# We dump to graphml
nx.write_graphml(G, "temp.graphml", named_key_ids=True) # named_key_ids names keys instead of using references, this allows us to use graphml.import readLabels later to set relationship/node types.
# After that, copy and paste to github, we'll use the raw github url for neo4j ingest

Loaded URL.
Loaded URL.



node#1: {'id': 'TapisBase', 'owner': 'Joe Stubbs', 'primaryThrust': 'core/Software', 'name': 'Base ICICLE Tapis Software', 'status': 'ProductionRelease', 'website': 'tapis-project.org', 'description': 'Hosted, web-based API for managing data and executing software for research computing', 'componentVersion': '1.3.0', 'targetIcicleRelease': '2023-04', 'licenseUrl': 'https://github.com/tapis-project/tapis-shared-java/blob/prod/LICENSE', 'publicAccess': True, 'sourceCodeUrl': 'https://github.com/tapis-project', 'releaseNotesUrl': 'https://github.com/tapis-project/tapis-deployer/blob/main/CHANGELOG.md', 'citation': 'Joe Stubbs, Richard Cardone, Mike Packard, Anagha Jamthe, Smruti Padhy, Steve Terry, Julia Looney, Joseph Meiring, Steve Black, Maytal Dahan, Sean Cleveland, Gwen Jacobs. (2021) Tapis: An API Platform for Reproducible, Distributed Computational Research. In: Arai K. (eds) Advances in Information and Communication. FICC 2021. Advances in Intelligent Sy

In [1]:
# Initialize Neo4j driver
url = "bolt+s://username.pods.icicle.tapis.io:443"
user = "username"
passw = "password"
from neo4j import GraphDatabase
neo = GraphDatabase.driver(url,
                           auth = (user, passw),
                           max_connection_lifetime=30)

In [17]:
# Use apoc plugin to import graphml from a github url (That you should specify after saving the `nx.write_graphml()` result from earlier remotely).
with neo.session() as session:
    result = session.run('CALL apoc.import.graphml("https://raw.githubusercontent.com/NotChristianGarcia/CI-Components-Catalog/master/file.graphml", {readLabels: true})')
    print(result)

<neo4j.work.result.Result object at 0x7f81491f1820>
