### KGX Converter Workflow – Step List
1. Select a BioThings API

2. Fetch SmartAPI metadata

3. Define Biolink mapping schema

4. Query the API

5. Preprocess and parse data

6. Build KGX node and edge records

7. Write nodes.tsv and edges.tsv

8. (Optional) Validate the KGX output

## 🧪 BioThings Data Fetcher
- Query BioThings APIs with GIVEN API names (*etc?*)

- Parse results to get:

    - Input NODE (e.g., gene)

    - Output NODE (e.g., drug)

    - Optional label fields

- Respect input/output semantic types

`📁 fetch_data.py`
```def call_api(api_url, input_id):
    # For GET: use biothings client
    # Parse subject and object based on SmartAPI info
    return [
        {
            "subject_id": "HGNC:1234",
            "subject_category": "biolink:Gene",
            "object_id": "CHEMBL:4567",
            "object_category": "biolink:Drug",
            "predicate": "biolink:interacts_with",
            "source": "DGIdb"
        }
    ]
```

In [4]:
import biothings_client
import requests

from tqdm import tqdm
import csv, os

import pprint

---

In [7]:
def generate_kgx_records(client, api_name=api_name, output_dir="kgx_output", max_records=3, print_data=False):
    node_map = {}  # {node_id: {id, name, category}}
    edge_list = []  # [{subject, predicate, object, relation}]
    for i, data in enumerate(tqdm(client.query(q="__all__", fetch_all=True))): #fields="object,subject,predicate",
        if i >= max_records:
            break
        pprint.pprint(data)
        print()
        if print_data == True:
            subject_dict = data["subject"]
            s_label = next(iter(subject_dict))  # Get the first key dynamically
            s_name = subject_dict["id"]

            object_dict = data["object"]
            o_label = next(iter(object_dict))
            o_name = object_dict["id"]

            associations = data["association"]

            print(s_label, o_label, s_name, o_name, associations)
            

In [52]:
test_client = biothings_client.get_client("gene", url="https://biothings.ci.transltr.io/semmeddb")

In [57]:
generate_kgx_records(test_client)

5it [00:01,  2.94it/s]

{'_id': 'C0042210-TREATS-C5547365',
 '_score': 1.0,
 'object': {'name': 'Pediatric population',
            'novelty': 1,
            'semantic_type_abbreviation': 'popg',
            'semantic_type_name': 'Population Group',
            'umls': 'C5547365'},
 'pmid_count': 3,
 'predicate': 'TREATS',
 'predication': [{'object_score': 1000,
                  'object_text': 'pediatric population',
                  'pmid': 35196125,
                  'predication_id': 199100602,
                  'sentence': 'However, as breast milk transmission of HIV '
                              'still occurs at an unacceptable rate, there '
                              'remains a need to develop an effective vaccine '
                              'for the pediatric population.',
                  'sentence_id': 380229252,
                  'subject_score': 888,
                  'subject_text': 'vaccine'},
                 {'object_score': 1000,
                  'object_text': 'pediatric populati




In [2]:
# Draft 

#### DGIDB

https://smart-api.info/api/metakg/?q=api.smartapi.id:e3edd325c76f2992a111b43a907a4870&bte=1&consolidated=0&subject=%22SmallMolecule%22

In [3]:
api_name = "dgidb"
subject = "SmallMolecule"
client = biothings_client.get_client(url=f"https://biothings.ci.transltr.io/{api_name}")

In [None]:
metakg_edges = [
    ("biolink:Drug", "biolink:affects", "biolink:Gene"),
    ("biolink:Gene", "biolink:affected_by", "biolink:Drug"),
]


In [None]:
biolink_map = {
    # Entities
    "biolink:Drug": {
        "path": "subject",
        "identifier": "CHEMBL_COMPOUND",
        "id_prefix": "CHEMBL.COMPOUND",
        "properties": {
            "name": "drug_name"
        }
    },
    "biolink:Gene": {
        "path": "object",
        "identifier": "NCBIGene",
        "id_prefix": "NCBIGene",
        "properties": {
            "symbol": "SYMBOL"
        }
    },

    # Predicates (Relations)
    "biolink:affects": {
        "from": ["biolink:Drug"],
        "to": ["biolink:Gene"]
    },
    "biolink:affected_by": {
        "from": ["biolink:Gene"],
        "to": ["biolink:Drug"]
    }
}


---

In [26]:
def create_biolink_mappings(data):
    # Initialize the dictionary for entity mappings
    entity_mappings = {}
    unique_prefixes = set()
    # Loop over the hits and extract the relevant prefixes
    for hit in data["hits"]:
        subject_prefix = hit["subject_prefix"]
        object_prefix = hit["object_prefix"]
        unique_prefixes = set([subject_prefix, object_prefix])

        # Map subject_prefix to Biolink entity (if it's not already in the dictionary)
        if subject_prefix == "CHEMBL.COMPOUND":
            entity_mappings[subject_prefix] = "biolink:Drug"
        elif subject_prefix == "NCBIGene":
            entity_mappings[subject_prefix] = "biolink:Gene"
        elif subject_prefix == "UniProt":
            entity_mappings[subject_prefix] = "biolink:Protein"
        elif subject_prefix == "GO":
            entity_mappings[subject_prefix] = "biolink:BiologicalProcess"

        # Map object_prefix to Biolink entity (if it's not already in the dictionary)
        if object_prefix == "CHEMBL.COMPOUND":
            entity_mappings[object_prefix] = "biolink:Drug"
        elif object_prefix == "NCBIGene":
            entity_mappings[object_prefix] = "biolink:Gene"
        elif object_prefix == "UniProt":
            entity_mappings[object_prefix] = "biolink:Protein"
        elif object_prefix == "GO":
            entity_mappings[object_prefix] = "biolink:BiologicalProcess"

    # Print the resulting entity mappings
    # print(entity_mappings)
    print(f"Unique mappings: {len(entity_mappings)}")
    return entity_mappings    


In [58]:
def get_biothings_api(subject_id, node_dict, node_set):
     bt_query = f'subject.id:"{subject_id}"'
     bt_data = []

     for i, data in enumerate(tqdm(client.query(q=bt_query, fetch_all=True))): 
          node_id = data["subject"]["id"]
          node_name = data["subject"]["drug_name"]
          node_dict.update({"id": node_id, "name":node_name})
          node_set.add(frozenset(node_dict.items()))
          
     return node_set

In [59]:
def get_smartapi_data(data, api_name, biolink_mapping, node_set, edge_list):
    for hit in data["hits"]:
        node_dict = {}
        edge_dict = {}
        subject_prefix = hit["subject_prefix"]
        object_prefix = hit["object_prefix"]
        predicate = hit["predicate"]
        full_subject = hit['api']['bte']['query_operation']['testExamples'][0]['qInput']
        full_object = hit['api']['bte']['query_operation']['testExamples'][0]['oneOutput']
        # Update Node Dict
        for prefix in [subject_prefix, object_prefix]:
            if prefix not in biolink_mapping:
                print(f"Warning: {prefix} not found in biolink mapping.")
                continue
            biolink_entity = biolink_mapping.get(prefix)
            node_dict["category"] =  (biolink_entity,)
            node_set = get_biothings_api(full_subject, node_dict, node_set)

        # Update Edge Dict
        # # edge_dict["id"] = # CONFIRM THIS ID
        edge_dict["subject"] = full_subject
        edge_dict["predicate"] = f"biolink:{predicate}"
        edge_dict["object"] = full_object
        edge_list.append(edge_dict)
    node_list = [dict(node) for node in node_set]
    # If needed, convert back to a list of dictionaries
    return node_list,edge_list


In [60]:
import json

def write_to_json_file(data, output_file):
        
    # Write the dictionary to a JSON file
    with open(output_file, "w") as json_file:
        json.dump(data, json_file, indent=4)  # Use indent for pretty formatting



In [61]:
def create_lists():
    node_set = set()
    edge_list=[]
    biolink_mapping = create_biolink_mappings(data)
    node_list = get_smartapi_data(data, api_name, biolink_mapping, node_set, edge_list)
    return node_list

In [62]:
api_name = "dgidb"
subject = "SmallMolecule"
client = biothings_client.get_client("gene", url=f"https://biothings.ci.transltr.io/{api_name}")
url = "https://smart-api.info/api/metakg/?q=api.smartapi.id:e3edd325c76f2992a111b43a907a4870&bte=1&consolidated=0&subject=%22SmallMolecule%22&size=100"
response = requests.get(url)
data = response.json()

In [63]:
nodes_list, edge_list = create_lists()

Entity mappings: {'CHEMBL.COMPOUND': 'biolink:Drug', 'NCBIGene': 'biolink:Gene'}
Unique mappings: 2
Unique prefixes encountered: 2


1it [00:00,  3.68it/s]No more results to return.
9it [00:00, 11.28it/s]
1it [00:00,  2.92it/s]No more results to return.
9it [00:00, 15.63it/s]
1it [00:00,  4.14it/s]No more results to return.
10it [00:00, 15.05it/s]
1it [00:00,  6.35it/s]No more results to return.
10it [00:00, 18.71it/s]
1it [00:00,  2.17it/s]No more results to return.
5it [00:00,  5.17it/s]
1it [00:00,  3.41it/s]No more results to return.
5it [00:00,  8.09it/s]
1it [00:00,  2.13it/s]No more results to return.
1it [00:00,  1.16it/s]
1it [00:00,  6.59it/s]No more results to return.
1it [00:00,  3.01it/s]
1it [00:00,  3.70it/s]No more results to return.
3it [00:00,  5.78it/s]
1it [00:00,  3.80it/s]No more results to return.
3it [00:00,  6.04it/s]
1it [00:00,  3.66it/s]No more results to return.
15it [00:00, 29.15it/s]
1it [00:00,  2.72it/s]No more results to return.
15it [00:01, 14.68it/s]
1it [00:00,  1.68it/s]No more results to return.
55it [00:00, 66.71it/s]
1it [00:00,  2.78it/s]No more results to return.
55it [00:0

In [64]:
for node in nodes_list:
    pprint.pprint(node)

{'category': ('biolink:Drug',),
 'id': 'CHEMBL.COMPOUND:CHEMBL3545038',
 'name': 'S-237648'}
{'category': ('biolink:Drug',),
 'id': 'CHEMBL.COMPOUND:CHEMBL1200986',
 'name': 'HALOPERIDOL DECANOATE'}
{'category': ('biolink:Gene',),
 'id': 'CHEMBL.COMPOUND:CHEMBL266510',
 'name': 'FLINDOKALNER'}
{'category': ('biolink:Drug',),
 'id': 'CHEMBL.COMPOUND:CHEMBL3305985',
 'name': 'ALCURONIUM'}
{'category': ('biolink:Gene',),
 'id': 'CHEMBL.COMPOUND:CHEMBL1200986',
 'name': 'HALOPERIDOL DECANOATE'}
{'category': ('biolink:Drug',),
 'id': 'CHEMBL.COMPOUND:CHEMBL266510',
 'name': 'FLINDOKALNER'}
{'category': ('biolink:Gene',),
 'id': 'CHEMBL.COMPOUND:CHEMBL3305985',
 'name': 'ALCURONIUM'}
{'category': ('biolink:Gene',),
 'id': 'CHEMBL.COMPOUND:CHEMBL3301626',
 'name': 'BASIMGLURANT'}
{'category': ('biolink:Drug',),
 'id': 'CHEMBL.COMPOUND:CHEMBL3301626',
 'name': 'BASIMGLURANT'}
{'category': ('biolink:Drug',),
 'id': 'CHEMBL.COMPOUND:CHEMBL1200833',
 'name': 'DIPIVEFRIN HYDROCHLORIDE'}
{'category

In [65]:
for edge in edge_list[:4]:
    pprint.pprint(edge)

{'object': 'NCBIGene:9132',
 'predicate': 'biolink:affects',
 'subject': 'CHEMBL.COMPOUND:CHEMBL266510'}
{'object': 'NCBIGene:155',
 'predicate': 'biolink:affects',
 'subject': 'CHEMBL.COMPOUND:CHEMBL1200833'}
{'object': 'NCBIGene:1132',
 'predicate': 'biolink:affects',
 'subject': 'CHEMBL.COMPOUND:CHEMBL3305985'}
{'object': 'NCBIGene:4889',
 'predicate': 'biolink:affects',
 'subject': 'CHEMBL.COMPOUND:CHEMBL3545038'}


In [66]:
# Create a dictionary with the required structure
data = {
    "nodes": nodes_list,
    "edges": edge_list
}

In [67]:
pprint.pprint(data)

{'edges': [{'object': 'NCBIGene:9132',
            'predicate': 'biolink:affects',
            'subject': 'CHEMBL.COMPOUND:CHEMBL266510'},
           {'object': 'NCBIGene:155',
            'predicate': 'biolink:affects',
            'subject': 'CHEMBL.COMPOUND:CHEMBL1200833'},
           {'object': 'NCBIGene:1132',
            'predicate': 'biolink:affects',
            'subject': 'CHEMBL.COMPOUND:CHEMBL3305985'},
           {'object': 'NCBIGene:4889',
            'predicate': 'biolink:affects',
            'subject': 'CHEMBL.COMPOUND:CHEMBL3545038'},
           {'object': 'NCBIGene:80380',
            'predicate': 'biolink:physically_interacts_with',
            'subject': 'CHEMBL.COMPOUND:CHEMBL4297570'},
           {'object': 'NCBIGene:7226',
            'predicate': 'biolink:affects',
            'subject': 'CHEMBL.COMPOUND:CHEMBL91'},
           {'object': 'NCBIGene:673',
            'predicate': 'biolink:affects',
            'subject': 'CHEMBL.COMPOUND:CHEMBL1229517'},
         

In [68]:
outfile = f"dgidb_smallmolecule_kgx.json"
write_to_json_file(data, outfile)

In [72]:
## ADD TSV OUTPUT

In [46]:
!pip install kgx


Collecting kgx
  Downloading kgx-2.4.2-py3-none-any.whl.metadata (8.9 kB)
Collecting SPARQLWrapper>=1.8.2 (from kgx)
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl.metadata (2.0 kB)
Collecting docker<7.0.0,>=6.0.0 (from kgx)
  Downloading docker-6.1.3-py3-none-any.whl.metadata (3.5 kB)
Collecting ijson<4.0.0,>=3.1.3 (from kgx)
  Downloading ijson-3.3.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata (21 kB)
Collecting inflection<0.6.0,>=0.5.1 (from kgx)
  Downloading inflection-0.5.1-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting jsonlines<5.0.0,>=4.0.0 (from kgx)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting jsonstreams<0.7.0,>=0.6.0 (from kgx)
  Downloading jsonstreams-0.6.0-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting linkml<2.0.0,>=1.7.7 (from kgx)
  Downloading linkml-1.9.1-py3-none-any.whl.metadata (3.7 kB)
Collecting mypy (from kgx)
  Downloading mypy-1.15.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata (2.1 kB)
Collecting neo4j>=4.4.10 (from 

In [49]:
!kgx validate -i json dgidb_smallmolecule_kgx.json


{
    "ERROR": {
        "MISSING_EDGE_PROPERTY": {
            "Required edge property 'knowledge_level' is missing": [
                "CHEMBL.COMPOUND:CHEMBL30219->NCBIGene:2565",
                "CHEMBL.COMPOUND:CHEMBL30219->NCBIGene:2565"
            ],
            "Required edge property 'agent_type' is missing": [
                "CHEMBL.COMPOUND:CHEMBL30219->NCBIGene:2565",
                "CHEMBL.COMPOUND:CHEMBL30219->NCBIGene:2565"
            ]
        }
    }
}
[KGX][__init__.py][    validate_wrapper] ERROR: kgx.validate() errors encountered... check the error log


___

#### DDINTER

In [50]:
def create_biolink_mappings(data):
    # Define a mapping of prefixes to Biolink entities
    prefix_to_biolink = {
        "CHEMBL.COMPOUND": "biolink:Drug",
        "NCBIGene": "biolink:Gene",
        "UniProt": "biolink:Protein",
        "GO": "biolink:BiologicalProcess",
        "DRUGBANK": "biolink:Drug"
    }
    
    # Initialize the dictionary for entity mappings and set for unique prefixes
    entity_mappings = {}
    unique_prefixes = set()
    
    try:
        # Loop over the hits and extract the relevant prefixes
        for hit in data["hits"]:
            subject_prefix = hit.get("subject_prefix")
            object_prefix = hit.get("object_prefix")
            
            # Add prefixes to unique_prefixes set
            if subject_prefix:
                unique_prefixes.add(subject_prefix)
            if object_prefix:
                unique_prefixes.add(object_prefix)
            
            # Map prefixes to Biolink entities if they exist in the mapping
            for prefix in [subject_prefix, object_prefix]:
                if prefix in prefix_to_biolink:
                    entity_mappings[prefix] = prefix_to_biolink[prefix]
        
        # Print the results
        print(f"Entity mappings: {entity_mappings}")
        print(f"Unique mappings: {len(entity_mappings)}")
        print(f"Unique prefixes encountered: {len(unique_prefixes)}")
        
        return entity_mappings
    
    except KeyError as e:
        print(f"Error: Missing key in data - {e}")
        return {}

In [118]:
def get_biothings_api(subject_id, node_dict, node_data):
    parts = subject_id.split(":")
    subject_id_lowercased = f"{parts[0].lower()}:{parts[1]}"
    bt_query = f'drug_a.{subject_id_lowercased}'
    print(bt_query)
    bt_data = []

    for data in tqdm(client.query(q=bt_query, fetch_all=True)):
        for drug_key in ["drug_a", "drug_b"]:
            drug = data.get(drug_key, {})
            if not drug:
                continue

            curie_id = None
            for prefix in ["drugbank", "chembl", "pubchem"]:
                if prefix in drug:
                    curie_id = f"{prefix}:{drug[prefix]}"
                    break

            if not curie_id:
                print(f"No valid CURIE found for {drug_key} in entry: {data.get('_id')}")
                continue

            node = {
                "id": curie_id,
                "name": drug.get("name", "Unknown"),
                "category": node_dict.get("category", "biolink:Drug")
            }

            # Add to dict by CURIE key
            node_data[curie_id] = node

    return node_data

In [119]:
def get_smartapi_data(data, api_name, biolink_mapping, node_data, edge_list):
    for hit in data["hits"]:
        node_dict = {}
        edge_dict = {}
        subject_prefix = hit["subject_prefix"]
        object_prefix = hit["object_prefix"]
        predicate = hit["predicate"]
        full_subject = hit['api']['bte']['query_operation']['testExamples'][0]['qInput']
        full_object = hit['api']['bte']['query_operation']['testExamples'][0]['oneOutput']
        
        # Update Node Dict
        for prefix in [subject_prefix, object_prefix]:
            if prefix not in biolink_mapping:
                print(f"Warning: {prefix} not found in biolink mapping.")
                continue
            biolink_entity = biolink_mapping.get(prefix)
            node_dict["category"] =  (biolink_entity,)
            node_set = get_biothings_api(full_subject, node_dict, node_data)

        # Update Edge Dict
        # # edge_dict["id"] = # CONFIRM THIS ID
        edge_dict["subject"] = full_subject
        edge_dict["predicate"] = f"biolink:{predicate}"
        edge_dict["object"] = full_object
        edge_list.append(edge_dict)
    # node_list = [dict(node) for node in node_set]
    # If needed, convert back to a list of dictionaries
    return node_data,edge_list


In [120]:
api_name = "ddinter"
subject = "SmallMolecule"
client = biothings_client.get_client("gene", url=f"https://biothings.ci.transltr.io/{api_name}")
url = "https://smart-api.info/api/metakg/?q=api.smartapi.id:00fb85fc776279163199e6c50f6ddfc6&bte=1&consolidated=0&size=100"
response = requests.get(url)
data = response.json()

In [122]:
biolink_mapping = create_biolink_mappings(data)


Entity mappings: {'DRUGBANK': 'biolink:Drug'}
Unique mappings: 1
Unique prefixes encountered: 1


In [123]:
node_data = {}
edge_list=[]
node_list,edge_list = get_smartapi_data(data, api_name, biolink_mapping, node_data, edge_list)


drug_a.drugbank:DB00244


1it [00:01,  1.16s/it]No more results to return.
268it [00:01, 187.10it/s]


drug_a.drugbank:DB00244


1it [00:00,  1.80it/s]No more results to return.
268it [00:00, 313.42it/s]


drug_a.drugbank:DB00540


1it [00:00,  2.14it/s]No more results to return.
218it [00:01, 209.15it/s]


drug_a.drugbank:DB00540


1it [00:00,  2.69it/s]No more results to return.
218it [00:00, 307.83it/s]


In [124]:
node_list

{'drugbank:DB00244': {'id': 'drugbank:DB00244',
  'name': 'Mesalazine',
  'category': ('biolink:Drug',)},
 'drugbank:DB00959': {'id': 'drugbank:DB00959',
  'name': 'Methylprednisolone',
  'category': ('biolink:Drug',)},
 'drugbank:DB00994': {'id': 'drugbank:DB00994',
  'name': 'Neomycin',
  'category': ('biolink:Drug',)},
 'drugbank:DB00646': {'id': 'drugbank:DB00646',
  'name': 'Nystatin',
  'category': ('biolink:Drug',)},
 'drugbank:DB00642': {'id': 'drugbank:DB00642',
  'name': 'Pemetrexed',
  'category': ('biolink:Drug',)},
 'drugbank:DB00738': {'id': 'drugbank:DB00738',
  'name': 'Pentamidine',
  'category': ('biolink:Drug',)},
 'drugbank:DB00958': {'id': 'drugbank:DB00958',
  'name': 'Carboplatin',
  'category': ('biolink:Drug',)},
 'drugbank:DB00291': {'id': 'drugbank:DB00291',
  'name': 'Chlorambucil',
  'category': ('biolink:Drug',)},
 'drugbank:DB00515': {'id': 'drugbank:DB00515',
  'name': 'Cisplatin',
  'category': ('biolink:Drug',)},
 'drugbank:DB13867': {'id': 'drugbank:D

In [125]:
# Create a dictionary with the required structure
data = {
    "nodes": node_list,
    "edges": edge_list
}

In [126]:
pprint.pprint(data)

{'edges': [{'object': 'DRUGBANK:DB00414',
            'predicate': 'biolink:interacts_with',
            'subject': 'DRUGBANK:DB00244'},
           {'object': 'DRUGBANK:DB00451',
            'predicate': 'biolink:interacts_with',
            'subject': 'DRUGBANK:DB00540'}],
 'nodes': {'drugbank:DB00244': {'category': ('biolink:Drug',),
                                'id': 'drugbank:DB00244',
                                'name': 'Mesalazine'},
           'drugbank:DB00254': {'category': ('biolink:Drug',),
                                'id': 'drugbank:DB00254',
                                'name': 'Doxycycline'},
           'drugbank:DB00264': {'category': ('biolink:Drug',),
                                'id': 'drugbank:DB00264',
                                'name': 'Metoprolol'},
           'drugbank:DB00266': {'category': ('biolink:Drug',),
                                'id': 'drugbank:DB00266',
                                'name': 'Dicoumarol'},
           'drugbank

In [127]:
outfile = f"ddinter_kgx.json"
write_to_json_file(data, outfile)

---

RARE_SOURCE

In [21]:
from bmt import Toolkit

#### Initial Variables

In [54]:
api_name = "rare_source"
api_id = "b772ebfbfa536bba37764d7fddb11d6f"
client = biothings_client.get_client("gene", url=f"https://biothings.ci.transltr.io/{api_name}")
url = f"https://smart-api.info/api/metakg/?q=api.smartapi.id:{api_id}&bte=1&consolidated=0&size=100"
print(url)
response = requests.get(url)
data = response.json()
BMT = Toolkit()

https://smart-api.info/api/metakg/?q=api.smartapi.id:b772ebfbfa536bba37764d7fddb11d6f&bte=1&consolidated=0&size=100


In [55]:
def create_biolink_mappings(data):
    # Define a mapping of prefixes to Biolink entities
    # prefix_to_biolink = {
    #     "CHEMBL.COMPOUND": "biolink:Drug",
    #     "NCBIGENE": "biolink:Gene",
    #     "UNIPROT": "biolink:Protein",
    #     "GO": "biolink:BiologicalProcess",
    #     "DRUGBANK": "biolink:Drug",
    #     "UMLS": "biolink:Disease",
    #     "ORPHANET": "biolink:Disease"
    # }

    # Initialize the dictionary for entity mappings and set for unique prefixes
    entity_mappings = {}
    prefix_mappings = {}
    unique_prefixes = set()
    try:
        # Loop over the hits and extract the relevant prefixes
        for hit in data["hits"]:
            subject = hit['subject']
            object = hit['object']
            subject_prefix = hit.get("subject_prefix")#.upper()
            object_prefix = hit.get("object_prefix")#.upper()
            
            # Add prefixes to unique_prefixes set
            if subject_prefix:
                unique_prefixes.add(subject_prefix)
            if object_prefix:
                unique_prefixes.add(object_prefix)
            
            # # Map prefixes to Biolink entities if they exist in the mapping
            # for prefix in [subject_prefix, object_prefix]:
            #     if prefix in prefix_to_biolink:
            #         entity_mappings[prefix] = prefix_to_biolink[prefix]
            bl_element = BMT.get_element(subject)
            entity_mappings[subject] = bl_element["class_uri"]
            # prefix_element = BMT.get_element_by_prefix(subject_prefix)
            # prefix_mappings[subject_prefix] = prefix_element
            
        # Print the results
        print(f"Unique mappings: {len(entity_mappings)}")
        print(f"Unique prefixes encountered: {len(unique_prefixes)}")
        # print(f"Entity mappings: {entity_mappings}")
        # print(f"Prefix Mappings: {prefix_mappings}")
        # print(unique_prefixes)


        return entity_mappings
    
    except KeyError as e:
        print(f"Error: Missing key in data - {e}")
        return {}

In [56]:
def get_biothings_api(subject_id, node_dict, node_data):
    if "NCBIGene" in subject_id:
        query_term = f"entrezgene:{subject_id.split(':')[1]}"
    elif "orphanet" in subject_id:
        query_term = f"raresource.disease.orphanet:{subject_id.split(':')[1]}"

    bt_data = []

    for data in tqdm(client.query(q=query_term, fetch_all=True)):
        if "NCBIGene" in subject_id:
            node_name = data["description"]
        elif "orphanet" in subject_id:
            for data_dict in data["raresource"]["disease"]:
                if "orphanet" in data_dict and data_dict["orphanet"] == subject_id.split(":")[1]:
                    node_name = f'ORPHA:{data_dict["orphanet"]}'
                    break

        node = {
            "id": subject_id,
            "name": node_name,
            "category": node_dict
        }

        node_data[subject_id] = node

    return node_data

In [57]:
def get_smartapi_data(data, api_name, biolink_mapping):
    node_data = {}
    edge_list=[]
    for hit in data["hits"][:2]:
        subject = hit['subject']
        object =  hit['object']
        node_dict = {}
        edge_dict = {}
        subject_prefix = hit["subject_prefix"]
        object_prefix = hit["object_prefix"]
        predicate = hit["predicate"]
        
        
        full_subject = hit['api']['bte']['query_operation']['testExamples'][0]['qInput']
        full_object = hit['api']['bte']['query_operation']['testExamples'][0]['oneOutput']
        print(f"[NODES]: {full_subject} - {predicate} - {full_object}")
        # Update Node Dict
        for prefix in [subject, object]:
            if prefix not in biolink_mapping:
                print(f"Warning: {prefix} not found in biolink mapping.")
                continue
            biolink_entity = biolink_mapping.get(subject)
            node_dict["category"] =  (biolink_entity,)
            node_set = get_biothings_api(full_subject, node_dict, node_data)

        # Update Edge Dict
        # # edge_dict["id"] = # CONFIRM THIS ID
        if "orphanet" in full_subject:
            full_subject = f"ORPHA:{full_subject.split(':')[1]}"
        else:
            edge_dict["subject"] = full_subject
        edge_dict["predicate"] = f"biolink:{predicate}"
        if "orphanet" in full_object:
            full_object = f"ORPHA:{full_object.split(':')[1]}"
        else:
            edge_dict["object"] = full_object
        edge_list.append(edge_dict)
    # node_list = [dict(node) for node in node_set]
    # If needed, convert back to a list of dictionaries
    return node_data,edge_list


In [58]:
import json

def write_to_json_file(data, output_file):
        
    # Write the dictionary to a JSON file
    with open(output_file, "w") as json_file:
        json.dump(data, json_file, indent=4)  # Use indent for pretty formatting



In [59]:
biolink_mapping = create_biolink_mappings(data)
node_data, edge_list = get_smartapi_data(data, api_name, biolink_mapping)

Unique mappings: 2
Unique prefixes encountered: 3
[NODES]: NCBIGene:100 - gene_associated_with_condition - orphanet:39041


1it [00:00,  3.70it/s]No more results to return.
1it [00:00,  2.24it/s]
1it [00:00,  5.49it/s]No more results to return.
1it [00:00,  2.79it/s]


[NODES]: orphanet:110 - condition_associated_with_gene - NCBIGene:10806


1it [00:00,  3.79it/s]No more results to return.
26it [00:00, 58.44it/s]
1it [00:00,  3.75it/s]No more results to return.
26it [00:00, 58.54it/s]


In [60]:
node_data

{'NCBIGene:100': {'id': 'NCBIGene:100',
  'name': 'adenosine deaminase',
  'category': {'category': ('biolink:Gene',)}},
 'orphanet:110': {'id': 'orphanet:110',
  'name': 'ORPHA:110',
  'category': {'category': ('biolink:Disease',)}}}

In [61]:
biolink_mapping

{'Gene': 'biolink:Gene', 'Disease': 'biolink:Disease'}

In [62]:
edge_list

[{'subject': 'NCBIGene:100',
  'predicate': 'biolink:gene_associated_with_condition'},
 {'predicate': 'biolink:condition_associated_with_gene',
  'object': 'NCBIGene:10806'}]

In [63]:
# Create a dictionary with the required structure
data = {
    "nodes": [node_data],
    "edges": edge_list
}

In [64]:
pprint.pprint(data)

{'edges': [{'predicate': 'biolink:gene_associated_with_condition',
            'subject': 'NCBIGene:100'},
           {'object': 'NCBIGene:10806',
            'predicate': 'biolink:condition_associated_with_gene'}],
 'nodes': [{'NCBIGene:100': {'category': {'category': ('biolink:Gene',)},
                             'id': 'NCBIGene:100',
                             'name': 'adenosine deaminase'},
            'orphanet:110': {'category': {'category': ('biolink:Disease',)},
                             'id': 'orphanet:110',
                             'name': 'ORPHA:110'}}]}


In [65]:
outfile = "raresource_kgx.json"
write_to_json_file(data,outfile)

In [66]:
!kgx validate -i json raresource_kgx.json

{}


---

In [43]:
from bmt.util import guess_casing, pascal_to_snake, snake_to_pascal

In [44]:
guess_casing("NCBIGene")

'pascal'

In [48]:
new_case  = pascal_to_snake("NCBIGene")

In [51]:
BMT.get_element_by_mapping(new_case)

In [46]:
guess_casing("orphanet")

'snake'

In [47]:
snake_to_pascal("orphanet")

'Orphanet'

In [None]:
def get_biothings_api(subject_id, node_dict, node_data):
    bt_query = f"raresource.disease.umls:{}"

    for data in tqdm(client.query(q=bt_query, fetch_all=True)):
        # Iterate here 
    return node_data

In [23]:
len(data)

4

---- 
Draft with `bmt-lite`

In [4]:
from bmt.util import pascal_to_snake

In [16]:
test_case = "Gene"

In [5]:
from bmt import Toolkit

In [3]:
print(bmt.__file__)

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/bmt/__init__.py


In [6]:
BMT = Toolkit()

In [17]:
test_element = BMT.get_element_by_prefix(test_case)

no biolink class found for the given curie: Gene, try get_element_by_mapping?


In [18]:
BMT.get_descendants(test_case)

['gene']

In [19]:
BMT.get_element(test_case)

ClassDefinition({
  'name': 'gene',
  'id_prefixes': ['NCBIGene', 'ENSEMBL', 'HGNC', 'MGI', 'ZFIN', 'dictyBase', 'WB', 'WormBase',
    'FB', 'RGD', 'SGD', 'PomBase', 'OMIM', 'KEGG.GENES', 'UMLS', 'Xenbase', 'AspGD',
    'PHARMGKB.GENE'],
  'description': ('A region (or regions) that includes all of the sequence elements necessary '
     'to encode a functional transcript. A gene locus may include regulatory '
     'regions, transcribed regions and/or other functional sequence regions.'),
  'in_subset': ['translator_minimal', 'model_organism_database'],
  'from_schema': 'https://w3id.org/biolink/biolink-model',
  'exact_mappings': ['SO:0000704', 'SIO:010035', 'WIKIDATA:Q7187', 'dcid:Gene'],
  'narrow_mappings': ['bioschemas:gene'],
  'broad_mappings': ['NCIT:C45822'],
  'is_a': 'biological entity',
  'mixins': ['gene or gene product', 'genomic entity',
    'chemical entity or gene or gene product', 'physical essence',
    'ontology class'],
  'slots': ['symbol', 'xref'],
  'class_uri': 

Lets test on a real example -- lets see if we traverse through a document, if we can grab the relevant info, and IF and HOW MANY times it cannot -- TEST THIS ON ALL THREE