---


### **KGX Format Overview**

KGX (Knowledge Graph Exchange) is a Python library and set of utilities for exchanging knowledge graphs (KGs) that conform to the Biolink Model. It provides tools for converting, validating, and exchanging knowledge graphs in various formats, including JSON, TSV, RDF, and Neo4j.

#### **Core Features**
- **Property Graph Representation**: Internally represented as a `networkx.MultiDiGraph`.
- **Biolink Model Compliance**: Ensures nodes and edges conform to the Biolink Model, including valid categories, predicates, and properties.
- **Supported Formats**:
  - RDF (read/write) and SPARQL endpoints (read).
  - Neo4j endpoints (read) or dumps (write).
  - CSV/TSV and JSON.
  - Reasoner Standard API format.
  - OBOGraph JSON format.


### **KGX Format Details**

#### **Node Record**
Each node in a KGX graph is represented as a **Node Record** with the following elements:

- **Required Elements**:
  - `id`: A CURIE uniquely identifying the node.
  - `category`: A list of Biolink Model categories describing the node.

- **Optional Elements**:
  - **Biolink Model Properties**: e.g., `name`, `description`, `xref`, `provided_by`.
  - **Non-Biolink Properties**: Custom properties not defined in the Biolink Model.

#### **Edge Record**
Each edge in a KGX graph is represented as an **Edge Record** with the following elements:

- **Required Elements**:
  - `subject`: The source node's `id`.
  - `predicate`: The relationship type (from the Biolink `related_to` hierarchy).
  - `object`: The target node's `id`.

- **Optional Elements**:
  - **Biolink Model Properties**: e.g., `category`, `publications`.
  - **Edge Provenance**: e.g., `primary_knowledge_source`, `supporting_data_source`.


### **KGX Format Examples**

#### **KGX JSON Format**
The JSON format represents the graph as a dictionary with `nodes` and `edges` arrays.



In [1]:
{
  "nodes": [
    {
      "id": "HGNC:11603",
      "name": "TBX4",
      "category": ["biolink:Gene"],
      "provided_by": ["MonarchArchive:gwascatalog"]
    },
    {
      "id": "MONDO:0005002",
      "name": "chronic obstructive pulmonary disease",
      "category": ["biolink:Disease"],
      "provided_by": ["MonarchArchive:gwascatalog"]
    }
  ],
  "edges": [
    {
      "id": "urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e",
      "subject": "HGNC:11603",
      "predicate": "biolink:contributes_to",
      "object": "MONDO:0005002",
      "relation": "RO:0003304",
      "category": ["biolink:GeneToDiseaseAssociation"],
      "primary_knowledge_source": ["MonarchArchive:gwascatalog"],
      "publications": ["PMID:26634245", "PMID:26634244"]
    }
  ]
}

{'nodes': [{'id': 'HGNC:11603',
   'name': 'TBX4',
   'category': ['biolink:Gene'],
   'provided_by': ['MonarchArchive:gwascatalog']},
  {'id': 'MONDO:0005002',
   'name': 'chronic obstructive pulmonary disease',
   'category': ['biolink:Disease'],
   'provided_by': ['MonarchArchive:gwascatalog']}],
 'edges': [{'id': 'urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e',
   'subject': 'HGNC:11603',
   'predicate': 'biolink:contributes_to',
   'object': 'MONDO:0005002',
   'relation': 'RO:0003304',
   'category': ['biolink:GeneToDiseaseAssociation'],
   'primary_knowledge_source': ['MonarchArchive:gwascatalog'],
   'publications': ['PMID:26634245', 'PMID:26634244']}]}



#### **KGX TSV Format**
The TSV format separates nodes and edges into two files: `nodes.tsv` and `edges.tsv`.

**nodes.tsv**:
| id            | category                                                                 | name                                  | provided_by               |
|---------------|--------------------------------------------------------------------------|---------------------------------------|---------------------------|
| HGNC:11603    | biolink:NamedThing\|biolink:BiologicalEntity\|biolink:Gene               | TBX4                                  | MonarchArchive:gwascatalog |
| MONDO:0005002 | biolink:NamedThing\|biolink:BiologicalEntity\|biolink:DiseaseOrPhenotypicFeature\|biolink:Disease | chronic obstructive pulmonary disease | MonarchArchive:gwascatalog |

**edges.tsv**:
| id                                    | subject     | predicate                  | object         | relation   | primary_knowledge_source | category                        | publications               |
|---------------------------------------|-------------|----------------------------|----------------|------------|---------------------------|---------------------------------|---------------------------|
| urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e | HGNC:11603  | biolink:contributes_to    | MONDO:0005002  | RO:0003304 | MonarchArchive:gwascatalog | biolink:GeneToDiseaseAssociation | PMID:26634245\|PMID:26634244 |

---

### **Key Points**
- **Validation**: KGX ensures that nodes and edges conform to the Biolink Model.
- **Flexibility**: Supports both Biolink and non-Biolink properties.
- **Interoperability**: Facilitates exchange between different graph systems and formats.

Documntation: [KGX documentation](https://github.com/biolink/kgx).



---

Packages

In [1]:
from bmt import Toolkit
import requests
import pprint
import biothings_client
import json


from tqdm import tqdm

# Initialize Biolink Model Toolkit
BMT = Toolkit() # only want to initialize once

---

### **KGX Format Pipeline Process**

This pipeline processes data from a SmartAPI source, maps it to the Biolink Model, and formats it into a KGX-compatible JSON file. Below is a step-by-step description of the process:

---

#### **1. Query SmartAPI Metadata**
- **Input**: SmartAPI ID (`api_id`) and API name (`api_name`).
- **Process**:
  - Construct a SmartAPI query URL using the `api_id`.
  - Send a GET request to retrieve metadata about the API.
- **Output**: Metadata in JSON format (`data`).

---

#### **2. Create Biolink Mappings**
- **Input**: SmartAPI metadata (`data`).
- **Process**:
  - Extract nodes and edges from the metadata.
  - Map entities (nodes) and relationships (edges) to Biolink Model URIs using the Biolink Model Toolkit (BMT).
  - Deduplicate and structure the mappings.
- **Output**: `node_mappings` and `edge_mappings`.

---

#### **3. Query BioThings API**
- **Input**: `node_mappings`, `edge_mappings`, and a BioThings API client.
- **Process**:
  - Query the BioThings API for all data (`__all__`).
  - Extract node and edge information based on the Biolink mappings.
  - Format nodes and edges into dictionaries.
- **Output**: Lists of `nodes` and `edges`.

---

#### **4. Format Data into KGX**
- **Input**: `nodes` and `edges`.
- **Process**:
  - Combine nodes and edges into a KGX-compatible dictionary.
  - Deduplicate nodes and edges.
- **Output**: A KGX dictionary (`kgx_dict`).

---

#### **5. Write KGX to JSON | TSV**
- **Input**: `kgx_dict`.
- **Process**:
  - Serialize the KGX dictionary into a JSON file.
  - Save the file locally (e.g., 

raresource_kgx_test2.json

).
- **Output**: A JSON file in KGX format.

---

#### **Code Example**


In [None]:
api_name = "rare_source"
api_id = "b772ebfbfa536bba37764d7fddb11d6f"
client = biothings_client.get_client(url=f"https://biothings.ci.transltr.io/{api_name}")

# Query SmartAPI
smartapi_url = f"https://smart-api.info/api/metakg/?q=api.smartapi.id:{api_id}&bte=1&consolidated=0&size=100"
response = requests.get(smartapi_url)
data = response.json()

# Run the KGX pipeline
kgx_dict = run_kgx_format_pipeline(data, client)

# Write to JSON
write_kgx_to_json(kgx_dict, "raresource_kgx_test2.json")

---

********************** Draft ********************** 

In [146]:
# Get the ID prefix priority list for biolink:Disease
element = BMT.get_element_by_mapping("orphanet:39041")
print(element)
# preferred_prefixes = element

# print(preferred_prefixes)

None


In [97]:
# 'disease_prefix': get_preferred_prefix('orphanet:39041')  # → 'MONDO'
# get_preferred_id('NCBIGene:100')

'NCBIGene:100'

In [2]:
# BMT.get_element_by_mapping("orphanet:277")

In [5]:
# node_norm = f"https://nodenorm.ci.transltr.io/1.5/get_normalized_nodes?curie={CURIE}&conflate=false&drug_chemical_conflate=false&description=false&individual_types=false"

In [11]:
def get_preferred_id(curie):
    url = f"https://nodenorm.ci.transltr.io/1.5/get_normalized_nodes?curie={curie}&conflate=false&drug_chemical_conflate=false&description=false&individual_types=false"

    response = requests.get(url)
    if response.ok:
        data = response.json()
        norm = data.get(curie)
        if norm and 'id' in norm:
            identifier = norm['id']['identifier']
            # return identifier.split(":")[0]  # e.g., MONDO
            return identifier
    return curie


********************** Original Code ********************** 

In [56]:
def get_biothings_api(node_mappings, edge_mappings, bt_data): # abstract out for one record
    # Iterate through BioThings API data    
    # Extract node data based on Biolink mappings
    # print('[INFO] BT data:', bt_data)

    for pred_uri, value in edge_mappings.items():
        # print(f"[INFO] Predicate URI: {pred_uri} | Value: {value}")
    #     print(value.keys())
        s_uri = value["subject"]
        o_uri = value["object"]
        # print(f"[INFO] Subject URI: {s_uri} | Object URI: {o_uri}")
        s_data = node_mappings[s_uri]
        o_data = node_mappings[o_uri]
        # print(f"[INFO] Subject data: {s_data} | Object data: {o_data}")
        s_identifier_key = s_data['identifier']
        o_identifier_key = o_data['identifier']
        s_prefix = s_data['prefix']
        o_prefix = o_data['prefix']
        s_category = [key for key,value in node_mappings.items() if value['prefix'] == s_prefix]
        o_category = [key for key,value in node_mappings.items() if value['prefix'] == o_prefix]
    #     s_prefix = s_data['prefix']
    #     s_category

        # try:
        s_identifiers = get_nested_value(bt_data,s_identifier_key)
        o_identifiers = get_nested_value(bt_data,o_identifier_key)
        if o_identifiers is None or s_identifiers is None:
            # print(f"[INFO] Found subject identifiers: {s_identifiers} key: {s_identifier_key} \n")
            # print(f"[INFO] Found object identifiers:{o_identifiers} key: {o_identifier_key}")
            continue
        # print(f"[INFO] Found subject identifiers: {s_identifiers} key: {s_identifier_key} \n")
        # ******************************************
        # Create Nodes
        for s_ in s_identifiers:
            s_node_id =  f"{s_prefix}:{s_[0]}"
            s_node_name = s_[1] 
            # s_node_norm_id = get_preferred_id(s_node_id)  # Use '=' for assignment
            # print(f"[INFO] {s_node_id} | {s_node_name}")
            if s_node_id not in nodes:
                nodes[s_node_id] = {
                    "id": s_node_id,
                    "name": s_node_name,
                    "category": s_category
                    # "provided_by": [node_mappings[s_uri]['provided_by']]
                }
                for prop_key, prop_value in node_mappings[s_uri]['properties'].items():
                    # print(prop_key)
                    if "." in prop_value:
                        pass
                    else:
                        if prop_key in prop_value:
                            v_ = bt_data[prop_key]
                            nodes[s_node_id][prop_key] = v_
                        
                # pprint.pprint(nodes[s_node_id])
            for o_ in o_identifiers:
                o_node_id =  f"{o_prefix}:{o_[0]}"
                o_node_name = o_[1]
                o_node_norm_id = get_preferred_id(o_node_id)
                # print(f"[INFO] {o_node_id} | {o_node_name}")
                if o_node_id not in nodes:
                    nodes[o_node_id] = {
                        "id": o_node_id,
                        "name": o_node_name,
                        "category": o_category
                        # "provided_by": [node_mappings[o_uri]['provided_by']]
                    }
                    for prop_key, prop_value in node_mappings[o_uri]['properties'].items():
                        if "." in prop_value:                            
                            nested_ids = get_nested_value(bt_data, prop_value)
                            if nested_ids is not None:
                                for nest_id in nested_ids:
                                    nodes[o_node_id][prop_key.split(".")[-1]] = nest_id[0]
                        else:
                            if prop_key in bt_data:
                                v_ = bt_data[prop_key]
                                nodes[o_node_id][prop_key] = v_
                # pprint.pprint(nodes[o_node_id])
                # ******************************************
                # Create Edges
                is_canonical = BMT.is_translator_canonical_predicate(pred_uri)
                is_symmetric = BMT.is_symmetric(pred_uri)

                if is_canonical:
                    print(f"[INFO] Predicate is canonical: {pred_uri} | {is_canonical}")
                    # if is_symmetric:
                    #     print(f"[INFO] {pred_uri} is also symmetric (this is unusual for canonical predicates).")
                    # else:
                    #     print(f"[INFO] {pred_uri} is NOT symmetric (expected for canonical predicates).")
                    rel = f"{s_node_id}-{pred_uri}-{o_node_id}"
                    if rel not in edges:
                        ref_url = get_nested_value(bt_data, edge_mappings[pred_uri]['properties']['ref_url'])
                        print(ref_url)
                        edges[rel] = {
                            "subject": s_node_id,
                            "predicate": pred_uri,
                            "object": o_node_id,
                            "knowledge_level": edge_mappings[pred_uri]['knowledge_level'],
                            "agent_type": edge_mappings[pred_uri]['agent_type'],
                            "primary_knowledge_source": edge_mappings[pred_uri]['primary_knowledge_source'],
                            "ref_url": y_[0][0] for y_ in ref_url,
                            
                        }

        # except Exception as error:
        #     print(f"\n[INFO] ERROR: {error}\n")
        #     pass

    return nodes,edges

SyntaxError: invalid syntax (3018815242.py, line 101)

In [75]:
# # def get_nested_value(data, key_path):
# #     """
# #     # Biothings Util function? 
# #     Retrieve a value from a nested dictionary using a dot-separated key path.
# #     Handles lists if encountered during traversal.
# #     Example: key_path = "raresource.disease.orphanet" will return the value of
# #     data["raresource"]["disease"][0]["orphanet"] if "disease" is a list.
# #     """

# #     keys = key_path.split(".")  # Split the key path into individual keys
# #     temp_data=data.copy() # for reference
# #     loop_ct=0

# #     # print('here')
# #     for i, key in enumerate(keys):
# #         # print(key)
# #         # print("in loop...")
# #         loop_ct+=1
# #         is_final_key = (i == len(keys) - 1)
# #         # data=data[key]
# #         # print(key, type(data), loop_ct)
# #         # print(f"Starting data: {type(data)} {type(temp_data)} {key}" )
# #         # *********************************************************************************************************
# #         if isinstance(data, dict) and key in data:
# #             # print(f"[INFO] Dictionary loop on key: {key}")
# #             if key in data:
# #                 temp_data = data[key]
# #                 if isinstance(temp_data, str): # found a string ID
# #                     id_ = temp_data
# #                     # print(f"[INFO] Inside string ID instance: {key}, {id_}")
# #                     # print(data.keys())
# #                     for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]: # **MAKE REGEX** (drug_name, etc...)
# #                         if search_key in data:
# #                             name = data[search_key]
# #                             break
# #                         else:
# #                             name = None
# #                     return [(id_, name)]
# #                 data = data[key]
# #             else:
# #                 # print(f"Key {key} not found in dictionary.")
# #                 return None
# #             # print(f"Ending data: {type(data)} {type(temp_data)} {key}\n" )
# #         # *********************************************
# #         elif isinstance(data, list):
# #             # print(f"[INFO] List loop on key: {key}")
# #             # If the current data is a list, assume we want the first element
# #             if len(data) > 0:
# #                 # print(data)
# #                 if is_final_key:
# #                     # print("[INFO] is final key", key)
# #                     id_list = []
# #                     for data_dict in data:
# #                         if key in data_dict:
# #                             # print("[INFO] found key ", key)
# #                             id_ = data_dict[key]
# #                             # print("\n", id_, key)
# #                             # pprint.pprint(data_dict)
# #                             name = None
# #                             for search_key in ["SYMBOL", "symbol", "NAME", "name","drug_name"]: # **MAKE REGEX** (drug_name, etc...)
# #                                 if search_key in data_dict:
# #                                     name = data_dict[search_key]
# #                                     break
# #                             id_list.append((id_, name))
# #                         else:
# #                             return None # key not found -- this does happen , i.e orphanet in raresource
# #                     # print(f"Returning from list: {id_list}")
# #                     return id_list
# #                 # temp_data = temp_data[key]
# #                 # data = data[key]
# #             else:
# #                 print("Empty list encountered.")
# #                 return None
# #         # *********************************************
# #         elif isinstance(data, str):
# #             print(f"[INFO] String loop on key: {key}")

# #             # If this is the final key, check for specific keys in the list element
# #             if is_final_key and isinstance(temp_data, dict) and key in temp_data:
# #                 for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]: # **MAKE REGEX** (drug_name, etc...)
# #                     if search_key in temp_data:
# #                         id_ = data
# #                         name = temp_data[search_key]
# #                         # print(f"Returning from list: {id_} | {name}")
# #                         return [(id_, name)]
# #             return(data, None)
# #         # *********************************************
# #         else:
# #             ...
# #         # *********************************************************************************************************
        

# #     return None

# def get_nested_value(data, key_path):
#     """
#     Retrieve a value from a nested dictionary using a dot-separated key path.
#     Handles lists if encountered during traversal.
#     Example: key_path = "raresource.disease.orphanet" will return the value of
#     data["raresource"]["disease"][0]["orphanet"] if "disease" is a list.
#     """

#     keys = key_path.split(".")  # Split the key path into individual keys
#     temp_data = data.copy()  # For reference
#     loop_ct = 0

#     for i, key in enumerate(keys):
#         loop_ct += 1
#         is_final_key = (i == len(keys) - 1)

#         if isinstance(data, dict) and key in data:
#             temp_data = data[key]
#             if isinstance(temp_data, str):  # Found a string ID
#                 id_ = temp_data
#                 for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]:
#                     if search_key in data:
#                         name = data[search_key]
#                         break
#                     else:
#                         name = None
#                 # Return the entire dictionary at this level
#                 return {"id": id_, "name": name, "all_keys": data}
#             data = data[key]

#         elif isinstance(data, list):
#             if len(data) > 0:
#                 if is_final_key:
#                     id_list = []
#                     for data_dict in data:
#                         if key in data_dict:
#                             id_ = data_dict[key]
#                             name = None
#                             for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]:
#                                 if search_key in data_dict:
#                                     name = data_dict[search_key]
#                                     break
#                             # Return the entire dictionary at this level
#                             id_list.append({"id": id_, "name": name, "all_keys": data_dict})
#                         else:
#                             return None  # Key not found
#                     return id_list
#             else:
#                 print("Empty list encountered.")
#                 return None

#         elif isinstance(data, str):
#             if is_final_key and isinstance(temp_data, dict) and key in temp_data:
#                 for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]:
#                     if search_key in temp_data:
#                         id_ = data
#                         name = temp_data[search_key]
#                         # Return the entire dictionary at this level
#                         return {"id": id_, "name": name, "all_keys": temp_data}
#             return {"id": data, "name": None, "all_keys": temp_data}

#     return None

In [4]:
# def create_edge_mappings(edge_maps, provided_by):
#     # Initialize the output dictionary
#     relations = {}
#     # Process each triplet
#     for subject, predicate, obj in edge_maps:
#         bl_pred = BMT.get_element(predicate)
#         if bl_pred:
#             predicate = bl_pred["slot_uri"]
#         if predicate not in relations:
#             relations[predicate] = {"from": [], "to": [], "provided_by": provided_by}
        
#         # Add the subject to the "from" list if not already present
#         if subject not in relations[predicate]["from"]:
#             relations[predicate]["from"].append(subject)
        
#         # Add the object to the "to" list if not already present
#         if obj not in relations[predicate]["to"]:
#             relations[predicate]["to"].append(obj)

#     # Print the resulting dictionary
#     return relations


*****************************************  
In use code 

In [3]:
def get_nested_value(data, key_path, return_full=True):
    """
    Retrieve a value from a nested dictionary using a dot-separated key path.
    Handles lists if encountered during traversal.
    Example: key_path = "raresource.disease.orphanet" will return the value of
    data["raresource"]["disease"][0]["orphanet"] if "disease" is a list.

    Args:
        data (dict): The input dictionary to search.
        key_path (str): The dot-separated key path to traverse.
        return_full (bool): If True, return the full structure (id, name, all_keys).
                            If False, return just the ID value.

    Returns:
        dict, list, or str: The full structure or just the ID, depending on `return_full`.
    """
    keys = key_path.split(".")  # Split the key path into individual keys
    temp_data = data.copy()  # For reference
    loop_ct = 0

    for i, key in enumerate(keys):
        loop_ct += 1
        is_final_key = (i == len(keys) - 1)

        if isinstance(data, dict) and key in data:
            temp_data = data[key]
            if isinstance(temp_data, str):  # Found a string ID
                id_ = temp_data
                if not return_full:
                    return id_  # Return just the ID if return_full is False
                for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]:
                    if search_key in data:
                        name = data[search_key]
                        break
                    else:
                        name = None
                return {"id": id_, "name": name, "all_keys": data}
            data = data[key]

        elif isinstance(data, list):
            if len(data) > 0:
                if is_final_key:
                    id_list = []
                    for data_dict in data:
                        if key in data_dict:
                            id_ = data_dict[key]
                            if not return_full:
                                id_list.append(id_)  # Append just the ID if return_full is False
                                continue
                            name = None
                            for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]:
                                if search_key in data_dict:
                                    name = data_dict[search_key]
                                    break
                            id_list.append({"id": id_, "name": name, "all_keys": data_dict})
                        else:
                            return None  # Key not found
                    return id_list
            else:
                print("Empty list encountered.")
                return None

        elif isinstance(data, str):
            if is_final_key and isinstance(temp_data, dict) and key in temp_data:
                for search_key in ["SYMBOL", "symbol", "NAME", "name", "drug_name"]:
                    if search_key in temp_data:
                        id_ = data
                        if not return_full:
                            return id_  # Return just the ID if return_full is False
                        name = temp_data[search_key]
                        return {"id": id_, "name": name, "all_keys": temp_data}
            return {"id": data, "name": None, "all_keys": temp_data}

    return None

In [4]:
def make_biolink_data_maps(data):
    """
    Create Biolink node and edge mappings from input data.
    """
    node_mappings = {}
    edge_mappings_updated = {}
    edge_maps = []

    for mkg_hit in data.get("hits", {}).get("hits", []):
        hit = mkg_hit["_source"]

        subject = hit["subject"]
        subject_prefix = hit["subject_prefix"]
        s_uri = BMT.get_element(subject)["class_uri"]
        # Not reliable -- currently using to get subject identification path, 
        # response mapping gives object details of edge, not subject.
        s_id_ref = hit["api"]["bte"]["query_operation"]["request_body"]["body"].get("scopes")

        object_ = hit["object"]
        object_prefix = hit["object_prefix"]
        o_uri = BMT.get_element(object_)["class_uri"]
        # Get fields from the query operation for the object
        o_properties = hit["api"]["bte"]["query_operation"]["params"].get("fields", "")
        o_properties = [field.strip() for field in o_properties.split(",")]

        # In the response mapping we get the object identification path,
        # and the edge ref_url. 
        # Should we extract out extra?
        o_id_ref = None
        for key, value in hit["api"]["bte"]["response_mapping"].items():
            if isinstance(value, dict):
                for k, v in value.items():
                    if object_prefix in v:
                        o_id_ref = v
                    elif "ref_url" == k:
                        ref_url_path = v
                break

        predicate = hit["predicate"]
        pred_element = BMT.get_element(predicate)
        pred_uri = pred_element["slot_uri"] if pred_element else None

        # edge properties
        provided_by = hit["api"]["provided_by"]
        agent_type = hit["api"]["bte"]["query_operation"].get("agent_type")
        knowledge_level = hit["api"]["bte"]["query_operation"].get("knowledge_level")

        # ========================
        # EDGE MAPPING
        # ========================
        if pred_uri and BMT.is_translator_canonical_predicate(pred_uri):
            if pred_uri not in edge_mappings_updated:
                edge_maps.append((s_uri, pred_uri, o_uri))
                edge_mappings_updated[pred_uri] = {
                    "subject": s_uri,
                    "object": o_uri,
                    "primary_knowledge_source": provided_by,
                    "agent_type": agent_type,
                    "knowledge_level": knowledge_level,
                    "ref_url": ref_url_path
                }

        # ========================
        # NODE MAPPING: SUBJECT
        # ========================
        if s_uri not in node_mappings:
            node_mappings[s_uri] = {
                "prefix": subject_prefix,
                "identifier": s_id_ref,
                "path": s_id_ref,
                "properties": {}
            }

        # ========================
        # NODE MAPPING: OBJECT
        # ========================
        if o_uri not in node_mappings:
            node_mappings[o_uri] = {
                "prefix": object_prefix,
                "identifier": o_id_ref,
                "path": o_id_ref,
                "properties": {}
            }

        for field in o_properties:
            field_key = field.split(".")[-1]
            node_mappings[o_uri]["properties"][field_key] = field 


    return node_mappings, edge_maps, edge_mappings_updated


In [65]:
def extract_biothings_nodes(node_mappings, biothings_api_data):
    # by data doc
    # print(api_data['entrezgene'])

    for node_uri, node_dict in node_mappings.items():
        key_identifier = node_dict["identifier"]
        node_prefix = node_dict["prefix"]
        # Change all instance to node_uri == node_cat
        # node_cat = node_uri #[key for key,value in node_mappings.items() if value['prefix'] == node_prefix] # why are we going through items here?
        node_api_data = get_nested_value(biothings_api_data, key_identifier)
        # pprint.pprint(node_api_data)

        # print(f"[INFO] Node Typ: {node_uri} | Node ID: {key_identifier} | Prefix: {node_prefix}")
        # print(f"[INFO] Node category: {node_cat}")
        # print(f"[INFO] type: {type(node_ids)}")

        if isinstance(node_api_data, dict):
            unique_node_id =  f"{node_prefix}:{node_api_data['id']}"
            unique_node_name = node_api_data["name"] # Update logic

            # print(f"[INFO] Unique Node ID: {unique_node_id} | Unique Node Name: {unique_node_name}")

            if unique_node_id not in nodes:
                nodes[unique_node_id] = {
                    "id": unique_node_id,
                    "name": unique_node_name,
                    "category": [node_uri]
                    # "provided_by": [node_mappings[s_uri]['provided_by']]
                }

                for prop_key, prop_value in node_dict['properties'].items():
                    if "." in prop_value and prop_key in node_api_data["all_keys"]:
                        nodes[unique_node_id][prop_key] = node_api_data["all_keys"][prop_key]

                    elif prop_key in  node_api_data["all_keys"]:                                
                            nodes[unique_node_id][prop_key] = node_api_data["all_keys"][prop_key]

        if isinstance(node_api_data, list):
            for id_ in node_api_data:
                unique_node_id =  f"{node_prefix}:{id_['id']}"
                unique_node_name = id_["name"] 
                if unique_node_id not in nodes:
                    nodes[unique_node_id] = {
                        "id": unique_node_id,
                        "name": unique_node_name,
                        "category": [node_uri],
                    }

                    for prop_key, prop_value in node_dict['properties'].items():
                        if "." in prop_value and prop_key in id_["all_keys"]:
                            nodes[unique_node_id][prop_key] = id_["all_keys"][prop_key]

                        elif prop_key in  id_:                                
                                nodes[unique_node_id][prop_key] = id_["all_keys"][prop_key]

In [88]:
def extract_biothings_edges(edge_relations, edge_mappings, node_mappings, biothings_api_data):
    print("[INFO] Building edges ...")
    for p_uri, edge_map in edge_mappings.items():
        s_uri = edge_map["subject"]
        s_node_map = node_mappings[s_uri]
        s_api_data = get_nested_value(biothings_api_data, s_node_map["identifier"])
        if isinstance(s_api_data, dict): # should only be one subject? no list returned?
            s_id = s_api_data["id"]
            s_id = f"{s_node_map['prefix']}:{s_id}"

        o_uri = edge_map["object"]
        o_node_map = node_mappings[o_uri]
        o_api_data = get_nested_value(biothings_api_data, o_node_map["identifier"])

        # ref_url_dict = get_nested_value(biothings_api_data, edge_map['ref_url'])
        ref_url_path = edge_map['ref_url'].split(".")[-1]

        if not isinstance(o_api_data, list):
            o_api_data = [o_api_data]

        for n in o_api_data:
            pprint.pprint(n)
            o_id = n["id"]
            o_id = f"{o_node_map['prefix']}:{o_id}"
            rel = f"{s_id}-{p_uri}-{o_id}"
            edges[rel] = {
                "subject": s_id,
                "predicate": p_uri,
                "object": o_id,
                "knowledge_level": edge_map['knowledge_level'],
                "agent_type": edge_map['agent_type'],
                "primary_knowledge_source": edge_map['primary_knowledge_source'],
                "attributes": {
                    "attribute_type_id":  "biolink:publications",
                    "value": [n["all_keys"][ref_url_path]], # assuming ref_url path alway in object paths
                    "value_type_id": "linkml:Uriorcurie"
                }
            }



In [6]:
def format_kgx_dict(nodes, edges):
    kgx_dict = {
        "nodes": list(nodes.values()),
        "edges": list(edges.values())
    }
    return kgx_dict

# def get_nodes_and_edges(node_mappings, edge_mappings,client):
#     nodes,edges = get_biothings_api(node_mappings, client, edge_mappings)
#     return nodes, edges

# # def remove_duplicates(list_of_dicts):
# #     Remove duplicate dictionaries from a list of dictionaries.



In [81]:
# def extract_biothings_apis(node_mappings, edge_relations, edge_mappings, client):
#     print(f"[INFO] Extracting BioThings APIs.. ")
#     ct=0
#     # https://biothings.ci.transltr.io/rare_source/gene/100
#     for biothings_api_data in tqdm(client.query(q="__all__", fetch_all=True)):
#         # pprint.pprint(biothings_api_data['entrezgene'])
#         #     # print(ct)
#         #     print("*"*100)
#         # print("[INFO] Extracting nodes from BioThings API data...")
#         extract_biothings_nodes(node_mappings, biothings_api_data)
#         # print(f"[INFO] Unique Nodes: {len([x for x in nodes])}")
#         # pprint.pprint(nodes)
#         # print(f"[INFO] Extracting edges from BioThings API data...")
#         extract_biothings_edges(edge_relations, edge_mappings, node_mappings, biothings_api_data) 
#         # print(f"[INFO] Unique Edges: {len([x for x in edges])}")           
#         ct+=1
#         if ct >= 2:
#             break
#         # print("*"*100)
#         # print()

def extract_biothings_apis(node_mappings, edge_relations, edge_mappings, client):
    print("Extracting BioThings APIs...")
    # Define the URL
    url = "https://biothings.ci.transltr.io/rare_source/gene/100"
    print(f"[INFO] Extracting from URL: {url}")
    # Make the GET request
    responseX = requests.get(url)

    # Check if the request was successful
    if responseX.status_code == 200:
        # Parse the JSON response
        biothings_api_data = responseX.json()

        extract_biothings_nodes(node_mappings, biothings_api_data)
        extract_biothings_edges(edge_relations, edge_mappings, node_mappings, biothings_api_data) 
        


    # return nodes, edges

In [55]:
def run_kgx_pipeline(data, client, debug=False):
    # === Global Initialization ===
    global nodes, edges
    nodes = {}
    edges = {}

    # === Step 1: Create Biolink mappings from SmartAPI metadata ===
    node_mappings, edge_relations, edge_mappings = make_biolink_data_maps(data)

    if debug:
        print("\n[DEBUG] Node Mappings:")
        pprint.pprint(node_mappings, indent=4)
        print("\n[DEBUG] Edge Relations:")
        pprint.pprint(edge_relations, indent=4)
        print("\n[DEBUG] Edge Mappings:")
        pprint.pprint(edge_mappings, indent=4)

    # === Step 1.1: Add hardcoded property mappings for Disease and Gene ===
    disease_dict = {
        "omim": "raresource.disease.omim",
        "gard": "raresource.disease.gard",
        "umls": "raresource.disease.umls",
        "mesh": "raresource.disease.mesh",
        "name": "raresource.disease.name",
        "icd10cm": "raresource.disease.icd10cm"
    }

    gene_dict = {
        "hgnc": "hgnc",
        "ensemblgene": "ensemblgene",
        "symbol": "symbol",
        "description": "description"
    }

    for s_uri, mapping in node_mappings.items():
        props = mapping.get("properties", {})

        if "Disease" in s_uri:
            for key, val in disease_dict.items():
                props.setdefault(key, val)

        if "Gene" in s_uri:
            for key, val in gene_dict.items():
                props.setdefault(key, val)

        # Remove overly generic 'disease' field
        props.pop("disease", None)
        mapping["properties"] = props

    # === Step 2: Extract nodes and edges using BioThings API ===
    extract_biothings_apis(node_mappings, edge_relations, edge_mappings, client)

    if debug:
        print(f"\n[DEBUG] Extracted {len(nodes)} nodes and {len(edges)} edges")

    # === Step 3: Format the graph as KGX-compatible dictionary ===
    kgx_dict = format_kgx_dict(nodes, edges)

    print("[INFO] Formatted dictionary")
    print("-"*100)
    return kgx_dict


In [62]:
def write_kgx_to_json(kgx_data, output_file):
    # Write the dictionary to a JSON file
    with open(output_file, "w") as json_file:
        json.dump(kgx_data, json_file, indent=4)  # Use indent for pretty formatting
    print(f"\n[INFO] KGX data written to {output_file}")

In [89]:
import json

api_name = "rare_source"
api_id = "b772ebfbfa536bba37764d7fddb11d6f"
client = biothings_client.get_client(url=f"https://biothings.ci.transltr.io/{api_name}")

# Construct the SmartAPI query URL
smartapi_url = f"https://smart-api.info/api/metakg/?q=api.smartapi.id:{api_id}&bte=1&meta=1&consolidated=0&size=100"
print(f"Querying SmartAPI: {smartapi_url}")

# Send the request and retrieve data
response = requests.get(smartapi_url)
data = response.json()
kgx_dict = run_kgx_pipeline(data, client, debug=True)

write_kgx_to_json(kgx_dict, "raresource_kgx_1-doc_example.json")


Querying SmartAPI: https://smart-api.info/api/metakg/?q=api.smartapi.id:b772ebfbfa536bba37764d7fddb11d6f&bte=1&meta=1&consolidated=0&size=100

[DEBUG] Node Mappings:
{   'biolink:Disease': {   'identifier': 'raresource.disease.orphanet',
                           'path': 'raresource.disease.orphanet',
                           'prefix': 'orphanet',
                           'properties': {   'cooccurrence_url': 'raresource.disease.cooccurrence_url',
                                             'orphanet': 'raresource.disease.orphanet',
                                             'umls': 'raresource.disease.umls'}},
    'biolink:Gene': {   'identifier': 'entrezgene',
                        'path': 'entrezgene',
                        'prefix': 'NCBIGene',
                        'properties': {   'disease': 'raresource.disease',
                                          'entrezgene': 'entrezgene',
                                          'symbol': 'symbol'}}}

[DEBUG] Edge Relati

---

In [39]:
# pprint.pprint(kgx_dict)

In [79]:
import csv

def write_kgx_to_tsv(kgx_dict, nodes_file, edges_file):
    """
    Write a KGX dictionary to TSV files for nodes and edges.
    """
    # Write nodes to nodes.tsv
    with open(nodes_file, mode="w", newline="", encoding="utf-8") as nodes_tsv:
        node_writer = csv.DictWriter(nodes_tsv, fieldnames=["id", "category", "name"], delimiter="\t")
        node_writer.writeheader()  # Write the header row
        for node in kgx_dict["nodes"]:
            # Ensure lists are joined with a pipe ('|') for multivalued fields
            node["category"] = "|".join(node.get("category", []))
            node_writer.writerow(node)

    # Write edges to edges.tsv
    with open(edges_file, mode="w", newline="", encoding="utf-8") as edges_tsv:
        edge_writer = csv.DictWriter(edges_tsv, fieldnames=["id", "subject", "predicate", "object", "relation", "primary_knowledge_source", "category", "publications"], delimiter="\t")
        edge_writer.writeheader()  # Write the header row
        for edge in kgx_dict["edges"]:
            # Ensure lists are joined with a pipe ('|') for multivalued fields
            edge["category"] = "|".join(edge.get("category", []))
            edge["publications"] = "|".join(edge.get("publications", []))
            edge_writer.writerow(edge)

In [119]:
BMT.is_translator_canonical_predicate("biolink:gene_associated_with_condition")
BMT.is_symmetric("biolink:gene_associated_with_condition")

False

In [58]:

seen_edges = set()
canonical_edges = []

for edge in kgx_dict["edges"]:
    subj = edge["subject"]
    obj = edge["object"]
    pred = edge["predicate"]

    # Get predicate element
    pred_element = BMT.get_element(pred)
    # print(dir(pred_element))
    # print(pred_element.name)

    # Canonical form is just the predicate's ID
    canonical_pred = pred_element.name

    # Check for inverse
    inverse_pred = pred_element.inverse
    symmetric = pred_element.symmetric
    print(inverse_pred, symmetric)

    # Hash keys
    direct = (subj, canonical_pred, obj)
    inverse = (obj, canonical_pred, subj)

    if symmetric:
        if direct in seen_edges or inverse in seen_edges:
            continue
    else:
        if inverse in seen_edges:
            continue

    seen_edges.add(direct)
    edge["predicate"] = canonical_pred
    canonical_edges.append(edge)

None None
None None
gene associated with condition None
gene associated with condition None


In [56]:
canonical_edges

[{'subject': 'NCBIGene:100',
  'predicate': 'gene associated with condition',
  'object': 'orphanet:39041',
  'knowledge_level': 'knowledge_assertion',
  'agent_type': 'manual_agent'},
 {'subject': 'NCBIGene:100',
  'predicate': 'gene associated with condition',
  'object': 'orphanet:277',
  'knowledge_level': 'knowledge_assertion',
  'agent_type': 'manual_agent'},
 {'subject': 'orphanet:39041',
  'predicate': 'condition associated with gene',
  'object': 'NCBIGene:100',
  'knowledge_level': 'knowledge_assertion',
  'agent_type': 'manual_agent'},
 {'subject': 'orphanet:277',
  'predicate': 'condition associated with gene',
  'object': 'NCBIGene:100',
  'knowledge_level': 'knowledge_assertion',
  'agent_type': 'manual_agent'}]

---

In [87]:
# def validate_kgx(kgx_file):
!kgx validate -i json raresource_kgx_1-doc_example.json

{
    "ERROR": {
        "INVALID_NODE_PROPERTY_VALUE": {
            "Node property 'id' has a value 'orphanet:39041' with a CURIE prefix 'orphanet' is not represented in Biolink Model JSON-LD context": [
                "orphanet:39041",
                "orphanet:39041"
            ],
            "Node property 'id' has a value 'orphanet:277' with a CURIE prefix 'orphanet' is not represented in Biolink Model JSON-LD context": [
                "orphanet:277",
                "orphanet:277"
            ]
        },
        "INVALID_EDGE_PROPERTY_VALUE": {
            "Edge property 'object' has a value 'orphanet:39041' with a CURIE prefix 'orphanet' that is not represented in Biolink Model JSON-LD context": [
                "NCBIGene:100->orphanet:39041",
                "NCBIGene:100->orphanet:39041"
            ],
            "Edge property 'object' has a value 'orphanet:277' with a CURIE prefix 'orphanet' that is not represented in Biolink Model JSON-LD context": [
               

--

In [29]:
nodes = {}
edges={} 

api_name = "ddinter"
client = biothings_client.get_client("gene", url=f"https://biothings.ci.transltr.io/{api_name}")
url = "https://smart-api.info/api/metakg/?q=api.smartapi.id:00fb85fc776279163199e6c50f6ddfc6&bte=1&consolidated=0&size=100&meta=1"
print(f"Querying SmartAPI: {url}")
response = requests.get(url)
data = response.json()

kgx_dict = run_kgx_format_pipeline(data, client)

write_kgx_to_json(kgx_dict, "ddinter_kgx_1-doc-0509.json")


Querying SmartAPI: https://smart-api.info/api/metakg/?q=api.smartapi.id:00fb85fc776279163199e6c50f6ddfc6&bte=1&consolidated=0&size=100&meta=1

[INFO] Node Mappings:
{'biolink:SmallMolecule': {'identifier': 'drug_a.drugbank',
                           'path': 'drug_a.drugbank',
                           'prefix': 'DRUGBANK',
                           'properties': {'agent_type': 'manual_validation_of_automated_agent',
                                          'knowledge_level': 'knowledge_assertion'}}}
[INFO] Edge Mappings:
{'biolink:interacts_with': {'from': ['biolink:SmallMolecule'],
                            'provided_by': 'infores:ddinter',
                            'to': ['biolink:SmallMolecule']}}


160001it [01:54, 1795.52it/s]No more results to return.
160235it [01:54, 1400.44it/s]

Retrieved Nodes: 1 and Edges: 1
[INFO] Formatted dictionary





In [18]:
!kgx validate -i json ddinter_kgx_test2.json

{
    "ERROR": {
        "INVALID_NODE_PROPERTY_VALUE": {
            "Node property 'id' is expected to be of type 'CURIE'": [
                "biolink:SmallMolecule:DB00270",
                "biolink:SmallMolecule:DB00270"
            ]
        },
        "INVALID_CATEGORY": {
            "Category '{'category': ['biolink:SmallMolecule']}' is not in CamelCase form": [
                "biolink:SmallMolecule:DB00270",
                "biolink:SmallMolecule:DB00270"
            ],
            "Category '{'category': ['biolink:SmallMolecule']}' is unknown in the current Biolink Model": [
                "biolink:SmallMolecule:DB00270",
                "biolink:SmallMolecule:DB00270"
            ]
        },
        "MISSING_EDGE_PROPERTY": {
            "Required edge property 'knowledge_level' is missing": [
                "biolink:SmallMolecule:DB00270->biolink:SmallMolecule:DB00270",
                "biolink:SmallMolecule:DB00270->biolink:SmallMolecule:DB00270"
            ],
       

In [24]:
nodes = {}
edges={} 

api_name = "dgidb"
client = biothings_client.get_client(url=f"https://biothings.ci.transltr.io/{api_name}")
url = "https://smart-api.info/api/metakg/?q=api.smartapi.id:e3edd325c76f2992a111b43a907a4870&bte=1&consolidated=0%22&size=1000&meta=1"
response = requests.get(url)
data = response.json()

kgx_dict = run_kgx_format_pipeline(data, client)

write_kgx_to_json(kgx_dict, "dgidb_kgx_1-doc.json")
# # Print the results
# print("API Name:", api_name)
# print("*"*50)
# print("\nNode Mappings:") 
# pprint.pprint(node_mappings)
# print("\nEdge Mappings:")
# pprint.pprint(edge_map)


[INFO] Node Mappings:
{'biolink:Gene': {'agent_type': 'automated_agent',
                  'identifier': 'object.NCBIGene',
                  'path': 'object.NCBIGene',
                  'prefix': 'NCBIGene',
                  'provided_by': 'infores:dgidb'},
 'biolink:SmallMolecule': {'agent_type': 'automated_agent',
                           'identifier': 'subject.CHEMBL_COMPOUND',
                           'path': 'subject.CHEMBL_COMPOUND',
                           'prefix': 'CHEMBL.COMPOUND',
                           'provided_by': 'infores:dgidb'}}
[INFO] Edge Mappings:
{'biolink:affected_by': {'from': ['biolink:Gene'],
                         'to': ['biolink:SmallMolecule']},
 'biolink:affects': {'from': ['biolink:SmallMolecule'], 'to': ['biolink:Gene']},
 'biolink:interacts_with': {'from': ['biolink:SmallMolecule', 'biolink:Gene'],
                            'to': ['biolink:Gene', 'biolink:SmallMolecule']},
 'biolink:physically_interacts_with': {'from': ['biolink:SmallM

58001it [00:42, 1192.07it/s]No more results to return.
58690it [00:42, 1383.71it/s]

Retrieved Nodes: 2 and Edges: 6
[INFO] Formatted dictionary





---
Draft

In [None]:
nodes_file = "raresource_nodes.tsv"
edges_file = "raresource_edges.tsv"
write_kgx_to_tsv(kgx_dict, nodes_file, edges_file)

In [107]:
# Combine node_mappings and edge_map into a single dictionary
node_mappings = {**node_mappings, **edge_map}

pprint.pprint(node_mappings)

{'biolink:Disease': {'identifier': 'raresource.disease.orphanet',
                     'path': 'raresource.disease.orphanet',
                     'prefix': 'orphanet'},
 'biolink:Gene': {'identifier': 'entrezgene',
                  'path': 'entrezgene',
                  'prefix': 'NCBIGene'},
 'biolink:condition_associated_with_gene': {'from': ['biolink:Disease'],
                                            'to': ['biolink:Gene']},
 'biolink:gene_associated_with_condition': {'from': ['biolink:Gene'],
                                            'to': ['biolink:Disease']}}


In [98]:
nodes

[{'id': 'entrezgene:100', 'category': {'category': ['biolink:Gene']}},
 {'id': 'orphanet:39041', 'category': {'category': ['biolink:Disease']}},
 {'id': 'entrezgene:100', 'category': {'category': ['biolink:Gene']}},
 {'id': 'orphanet:39041', 'category': {'category': ['biolink:Disease']}},
 {'id': 'entrezgene:100', 'category': {'category': ['biolink:Gene']}},
 {'id': 'orphanet:39041', 'category': {'category': ['biolink:Disease']}},
 {'id': 'entrezgene:100', 'category': {'category': ['biolink:Gene']}},
 {'id': 'orphanet:39041', 'category': {'category': ['biolink:Disease']}},
 {'id': 'entrezgene:100', 'category': {'category': ['biolink:Gene']}},
 {'id': 'orphanet:39041', 'category': {'category': ['biolink:Disease']}},
 {'id': 'entrezgene:100', 'category': {'category': ['biolink:Gene']}},
 {'id': 'orphanet:39041', 'category': {'category': ['biolink:Disease']}}]

---