Skip to content

Latest commit



204 lines (137 loc) · 4.98 KB

File metadata and controls

204 lines (137 loc) · 4.98 KB


A Source can be implemented for any file, local, and/or remote store that can contains a graph. A Source is responsible for reading nodes and edges from the graph.

A source must subclass kgx.source.source.Source class and must implement the following methods:

  • parse
  • read_nodes
  • read_edges

parse method

  • Responsible for parsing a graph from a file/store
  • Must return a generator that iterates over list of node and edge records from the graph

read_nodes method

  • Responsible for reading nodes from the file/store
  • Must return a generator that iterates over list of node records
  • Each node record must be a 2-tuple (node_id, node_data) where,
    • node_id is the node CURIE
    • node_data is a dictionary that represents the node properties

read_edges method

  • Responsible for reading edges from the file/store
  • Must return a generator that iterates over list of edge records
  • Each edge record must be a 4-tuple (subject_id, object_id, edge_key, edge_data) where,
    • subject_id is the subject node CURIE
    • object_id is the object node CURIE
    • edge_key is the unique key for the edge
    • edge_data is a dictionary that represents the edge properties


Base class for all Sources in KGX.

.. automodule:: kgx.source.source


GraphSource is responsible for reading from an instance of kgx.graph.base_graph.BaseGraph and must use only the methods exposed by BaseGraph to access the graph.

.. automodule:: kgx.source.graph_source


TsvSource is responsible for reading from KGX formatted CSV or TSV using Pandas where every flat file is treated as a Pandas DataFrame and from which data are read in chunks.

KGX expects two separate files - one for nodes and another for edges.

.. automodule:: kgx.source.tsv_source


JsonSource is responsible for reading data from a KGX formatted JSON using the ijson library, which allows for streaming data from the file.

.. automodule:: kgx.source.json_source


JsonlSource is responsible for reading data from a KGX formatted JSON Lines using the jsonlines library.

KGX expects two separate JSON Lines files - one for nodes and another for edges.

.. automodule:: kgx.source.jsonl_source


TrapiSource is responsible for reading data from a Translator Reasoner API formatted JSON.

.. automodule:: kgx.source.trapi_source


ObographSource is responsible for reading data from OBOGraphs in JSON.

.. automodule:: kgx.source.obograph_source


SssomSource is responsible for reading data from an SSSOM formatted files.

.. automodule:: kgx.source.sssom_source


NeoSource is responsible for reading data from a local or remote Neo4j instance.

.. automodule:: kgx.source.neo_source


RdfSource is responsible for reading data from RDF N-Triples.

This source makes use of a custom kgx.parsers.ntriples_parser.CustomNTriplesParser for parsing N-Triples, which extends rdflib.plugins.parsers.ntriples.NTriplesParser.

To ensure proper parsing of N-Triples and a relatively low memory footprint, it is recommended that the N-Triples be sorted based on the subject IRIs.

sort -k 1,2 -t ' ' data.nt > data_sorted.nt
.. automodule:: kgx.source.rdf_source


OwlSource is responsible for parsing an OWL ontology.

When parsing an OWL, this source also adds OwlStar annotations to certain OWL axioms.

.. automodule:: kgx.source.owl_source


SparqlSource has yet to be implemented.

In principle, SparqlSource should be able to read data from a local or remote SPARQL endpoint.

.. automodule:: kgx.source.sparql_source