# Query and Filter data from NeDRex via the API

## 1. Query a filtered PPI from NeDRex
### Interactive Exercise: Explore the API Documentation

Before we start querying data, let's explore the NeDRex API documentation to understand the available parameters:

1. **Go to the NeDRex API documentation**: https://exbio.wzw.tum.de/repo4eu_nedrex_open/
2. **Navigate to PPI routes** (check out the parameter for the protein-protein-interaction endpoint)
3. **Find the correct parameter names for**:
   - Filtering by **reviewed proteins only** (hint: look for reviewed/curated protein parameters)
   - Filtering by **methods score**
   - Filtering by **evidence type**
4. **Update the `ppi_body` dictionary below** with the correct parameter names

### Current Filter Criteria
- Only reviewed proteins: True
- Evidence type: experimental (exp)
- Method score threshold: 13.75

In [None]:
import requests

class NeDRexService:
    API_LINK = "https://exbio.wzw.tum.de/repo4eu_nedrex_open/"


def getEdges(body):
    all_edges_reviewed_proteins = []
    upper_limit = body.get("limit", 10000)
    offset = 0

    while True:
        # start with skipping no entries
        # skip: offset for pagination -> 0
        # In the body we will always adjust the skip/offset value to get the next set of edges
        body["skip"] = offset
        try:
            # POST request to the ppi endpoint with the body
            response = requests.post(
                url=f"{NeDRexService.API_LINK}/ppi", json=body, headers={"content-type": "application/json"}
            )
            response.raise_for_status()
            data = response.json()
            # collect all edges in a list
            all_edges_reviewed_proteins.extend(data)
            if len(data) < upper_limit:
                # This means that we have reached the end of the edges
                break
            # increase the offset by the upper limit to get the next set of edges
            offset += upper_limit
        except requests.exceptions.RequestException as e:
            print(f"HTTP request failed: {e}")
            return None
    return all_edges_reviewed_proteins
        
def createNedrexGraph(filename, body):
    edges_review_proteins = getEdges(body)
    print("Number of queried edges overall: ", len(edges_review_proteins))
    print("Example edge: ", edges_review_proteins[0])
    
    # Write the edges to a csv file in the format "memberOne,memberTwo"
    with open(filename, "w") as f:
        for edge in edges_review_proteins:
            f.write(str(edge["memberOne"]).lstrip(".uniprot") + "," + str(edge["memberTwo"]).lstrip(".uniprot"))
            f.write("\n")


if __name__ == "__main__":
    # TODO: Update this body with the correct parameter names from the API documentation
    # Visit: https://exbio.wzw.tum.de/repo4eu_nedrex_open/ 
    # Look at PPI routes to find the correct parameter names for:
    # - Reviewed proteins filtering
    # - Methods score filtering
    # - Evidence type filtering
    ppi_body = {
        "limit": 10000,
        # Add your parameters here after checking the documentation:
        # "PARAMETER_NAME_FOR_REVIEWED_PROTEINS": [True],
        # "PARAMETER_NAME_FOR_EVIDENCE": ["exp"],
        # "PARAMETER_NAME_FOR_METHODS_SCORE": 13.75
    }
    createNedrexGraph(filename="../../data/NeDRex_api/filtered_ppi_only_reviewed_proteins.csv", body=ppi_body)

### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the documentation</summary>

The correct parameter names are:
- **Reviewed proteins filtering**: `reviewed_proteins`
- **Evidence filtering**: `iid_evidence` 
- **Methods score filtering**: `methods_score_cutoff`

Complete working body:
```python
ppi_body = {
    "limit": 10000,
    "reviewed_proteins": [True],
    "iid_evidence": ["exp"],
    "methods_score_cutoff": 13.75
}
```

**Explanation**:
- `reviewed_proteins`: Boolean list to include only reviewed/curated proteins from UniProt
- `iid_evidence`: List of evidence types to filter by (e.g., ["exp"] for experimental evidence)
- `methods_score_cutoff`: Minimum score threshold for methods confidence
- `limit`: Maximum number of edges to return per request (pagination)

</details>

## 2. Query PDIs from NeDRex

In addition to PPIs, NeDRex contains many other node types and relationships.

Next, let's query some protein-drug interactions (PDIs).

Check the available edge types (https://exbio.wzw.tum.de/repo4eu_nedrex_open/list_edge_collections) and fill in the correct type below.


In [14]:
import requests

class NeDRexService:
    API_LINK = "https://exbio.wzw.tum.de/repo4eu_nedrex_open/"

# Function to get the edges from the NeDRex API
# type: type of the edges to query
# source_domain_ids: list of source domain ids to query for the edges
# target_domain_ids: list of target domain ids to query for the edges
# extra_attributes: additional attributes to query for the edges
def getEdges(
    type: str, body
):
    all_edges = []
    upper_limit = body.get("limit", 10000)

    offset = 0
    while True:
        body["skip"] = offset
        try:
            response = requests.post(
                url=f"{NeDRexService.API_LINK}/{type}/attributes/json", json=body
            )
            response.raise_for_status()
            data = response.json()
            all_edges.extend(data)
            if len(data) < upper_limit:
                break
            offset += upper_limit
        except requests.exceptions.RequestException as e:
            print(f"HTTP request failed: {e}")
            return None
    return all_edges

def createNedrexGraph(filename, type: str, body):
    edges_pdi = getEdges(type, body)
    print("Number of queried edges overall: ", len(edges_pdi))
    print("Example edge: ", edges_pdi[0])
    
    # Write the edges to a csv file in the format "memberOne,memberTwo"
    with open(filename, "w") as f:
        for edge in edges_pdi:
            f.write(str(edge["sourceDomainId"]).lstrip(".drugbank") + "," + str(edge["targetDomainId"]).lstrip(".uniprot"))
            f.write("\n")

if __name__ == "__main__":

    pdi_body = {
        "source_domain_id": [],
        "target_domain_id": [],
        "attributes": ["sourceDomainId", "targetDomainId"],
        "skip": 0,
        "limit": 1000,
    }
    
    # TODO: fill in the correct edge type to query PDIs
    createNedrexGraph(filename="../../data/NeDRex_api/pdi.csv", type="TODO", body=pdi_body)

### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the documentation</summary>

The correct type is 'drug_has_target'

```python
createNedrexGraph(filename="../../data/NeDRex_api/pdi.csv", type="drug_has_target", body=pdi_body)
```

</details>

## 3. Get seed genes from NeDRex for Huntingtons disease
### ToDO
- Check out the documentation to find the correct edge type identifier for seed genes (gene associated with disorder): https://exbio.wzw.tum.de/repo4eu_nedrex_open/
- Based on the NeDRex graph, check out the directionality of the graph: We have the MONDO id for Huntingtons disease and based on that we want to get the seed genes
- Querry the seed genes based on the correct parameters with the provided functions below

In [6]:
import requests
from gprofiler import GProfiler
from collections import Counter

gp = GProfiler(return_dataframe=True)

class NeDRexService:
    API_LINK = "https://exbio.wzw.tum.de/repo4eu_nedrex_open/"

# Function to get the edges from the NeDRex API
# type: type of the edges to query
# source_domain_ids: list of source domain ids to query for the edges
# target_domain_ids: list of target domain ids to query for the edges
# extra_attributes: additional attributes to query for the edges
def getEdges(
    type: str, source_domain_ids=[], target_domain_ids=[], extra_attributes=[]
):
    all_edges = []
    attributes = ["sourceDomainId", "targetDomainId", "dataSources", "type"]
    attributes.extend(extra_attributes)
    upper_limit = 10000

    body = {
        "source_domain_id": source_domain_ids,
        "target_domain_id": target_domain_ids,
        "attributes": attributes,
        "skip": 0,
        "limit": upper_limit,
    }
    offset = 0
    while True:
        body["skip"] = offset
        try:
            response = requests.post(
                url=f"{NeDRexService.API_LINK}/{type}/attributes/json", json=body
            )
            response.raise_for_status()
            data = response.json()
            all_edges.extend(data)
            if len(data) < upper_limit:
                break
            offset += upper_limit
        except requests.exceptions.RequestException as e:
            print(f"HTTP request failed: {e}")
            return None
    return all_edges

# Function to get the edges from the NeDRex API in batches
# type: type of the edges to query
# ids: list of node ids to query for the edges
# sources: boolean to indicate whether the node ids are source or target domain ids
# extra_attributes: additional attributes to query for the edges
def getEdgesInBatches(type: str, ids, sources=True, extra_attributes=[]):
    batch_size = 10000
    all_edges = []
    for i in range(0, len(ids), batch_size):
        batch_nodes = ids[i : i + batch_size]
        if sources:
            edges_batch = getEdges(
                type, source_domain_ids=batch_nodes, extra_attributes=extra_attributes
            )
        else:
            edges_batch = getEdges(
                type, target_domain_ids=batch_nodes, extra_attributes=extra_attributes
            )
        if edges_batch is not None:
            all_edges.extend(edges_batch)
        else:
            print(f"Error occurred while fetching batch starting at index {i}.")
            break
    return all_edges

if __name__ == "__main__":
    ############# EXAMPLE CALLS TO THE NEDREX API ##################
    
    # Example call to get the edges for the protein_has_go_annotation type -> by protein ids as source domain ids
    edges_to_get_proteins = ["uniprot.A0A024RBG1", "uniprot.A0A075B6H9", "uniprot.A0A075B6I6"]
    edges_go = getEdgesInBatches(
        "protein_has_go_annotation",
        ids=edges_to_get_proteins,
        extra_attributes=["qualifiers"],
    )
    print("Example edge: ",edges_go[0])
    
    # Example call to get the edges for the protein_has_go_annotation type -> by go ids as target domain ids
    edges_to_get_go = ["go.0003723", "go.0005829", "go.0046872"]
    edges_go = getEdgesInBatches(
        "protein_has_go_annotation",
        ids=edges_to_get_go,
        sources=False,
        extra_attributes=["qualifiers"],
    )
    print("Example edge: ",edges_go[0])
    
    # Example call to get the edges for the drug_has_contraindication type -> by drug ids as source domain ids and disorder ids as target domain ids
    drug_ids = ["drugbank.DB12530", "drugbank.DB08893"]
    disorder_ids = ["mondo.0005344", "mondo.0005044"]
    edges_contraindications = getEdges(
        "drug_has_contraindication",
        source_domain_ids=drug_ids,
        target_domain_ids=disorder_ids,
    )
    print("Example edge: ",edges_contraindications[0])
    
    #############################################################
    
    # Mondo disease id for Huntingtons disease
    disease_ids = ["mondo.0007739"]
    edges_seed_genes = getEdgesInBatches(
        "gene_associated_with_disorder",
        ids=disease_ids,
        sources=False
    )
    print("Number of queried edges for seed genes: ", len(edges_seed_genes))
    print("Example edge: ", edges_seed_genes[0])
    
    # Count data sources
    source_counter = Counter()

    for edge in edges_seed_genes:
        if "dataSources" in edge:
            for src in edge["dataSources"]:
                source_counter[src] += 1
        else:
            source_counter["<no_source_info>"] += 1

    # Print overview
    print("\nEdge counts per data source:")
    for source, count in source_counter.most_common():
        print(f"{source}: {count}")
        
    seeds_entrez = [edge["sourceDomainId"].lstrip(".entrez") for edge in edges_seed_genes]
    
    ############################### CONVERTING ENTREZ IDs TO UNIPROT IDs ###############################
    converted = gp.convert(organism='hsapiens',
        query=list(seeds_entrez),
        target_namespace='UNIPROTSWISSPROT_ACC', numeric_namespace="ENTREZGENE_ACC")

    seeds_uniprot = set(converted['converted'])
    
    #############################################################
    
    output_file = "../../data/NeDRex_api/seed_genes_huntingtons_disease.csv"
    print(f"Writing {len(seeds_uniprot)} seed genes to {output_file}")
    # Write the seed genes to a file
    with open(output_file, 'w') as file:
        for gene in seeds_uniprot:
            file.write(gene + '\n')
    

Example edge:  {'sourceDomainId': 'uniprot.A0A024RBG1', 'targetDomainId': 'go.0003723', 'dataSources': ['go'], 'type': 'ProteinHasGOAnnotation', 'qualifiers': ['enables']}
Example edge:  {'sourceDomainId': 'uniprot.A0A024RBG1', 'targetDomainId': 'go.0003723', 'dataSources': ['go'], 'type': 'ProteinHasGOAnnotation', 'qualifiers': ['enables']}
Example edge:  {'sourceDomainId': 'drugbank.DB08893', 'targetDomainId': 'mondo.0005044', 'dataSources': ['drugcentral'], 'type': 'DrugHasContraindication'}
Number of queried edges for seed genes:  3694
Example edge:  {'sourceDomainId': 'entrez.6515', 'targetDomainId': 'mondo.0007739', 'dataSources': ['orphanet', 'opentargets_orphanet', 'opentargets_europepmc'], 'type': 'GeneAssociatedWithDisorder'}

Edge counts per data source:
opentargets_europepmc: 3618
opentargets_impc: 98
opentargets_chembl: 90
opentargets_ot_genetics_portal: 4
orphanet: 2
opentargets_orphanet: 2
opentargets_eva: 1
opentargets_clingen: 1
opentargets_uniprot_literature: 1
openta

In [11]:
# Read the TSV file with seed genes from NeDRex web interface
import pandas as pd
from gprofiler import GProfiler

# Initialize GProfiler
gp = GProfiler(return_dataframe=True)

# Read the TSV file
tsv_file = "../../data/NeDRex_api/seeds_huntington_nedrexWeb.tsv"
print(f"Reading seed genes from: {tsv_file}")

# Read the file, skipping the comment line but preserving the header
df = pd.read_csv(tsv_file, sep='\t', skiprows=1, names=['ID', 'Name'])
print(f"Found {len(df)} seed genes in the file")

# Extract Entrez IDs (remove the "entrez." prefix)
entrez_ids = [gene_id.replace('entrez.', '') for gene_id in df['ID']]
print(f"\nEntrez IDs to convert: {entrez_ids}")

# Convert Entrez IDs to UniProt IDs using GProfiler
print("\nConverting Entrez IDs to UniProt IDs...")
converted = gp.convert(
    organism='hsapiens',
    query=entrez_ids,
    target_namespace='UNIPROTSWISSPROT_ACC',
    numeric_namespace="ENTREZGENE_ACC"
)

# Get the successfully converted UniProt IDs
seeds_uniprot = set(converted['converted'].dropna())
print(f"\nSuccessfully converted {len(seeds_uniprot)} genes to UniProt IDs")

# Write the UniProt seed genes to a new file
output_file = "../../data/NeDRex_api/seed_genes_huntingtons_disease.csv"
print(f"Writing {len(seeds_uniprot)} seed genes to {output_file}")

with open(output_file, 'w') as file:
    for gene in seeds_uniprot:
        file.write(gene + '\n')

print(f"Successfully wrote seed genes to: {output_file}")
print(f"UniProt seed genes: {seeds_uniprot}")

Reading seed genes from: ../../data/NeDRex_api/seeds_huntington_nedrexWeb.tsv
Found 17 seed genes in the file

Entrez IDs to convert: ['6515', '1268', '2596', '5649', '1827', '3064', '4968', '140767', '2668', '2166', '9131', '4128', '51083', '4129', '51447', '56616', '2030']

Converting Entrez IDs to UniProt IDs...

Successfully converted 17 genes to UniProt IDs
Writing 17 seed genes to ../../data/NeDRex_api/seed_genes_huntingtons_disease.csv
Successfully wrote seed genes to: ../../data/NeDRex_api/seed_genes_huntingtons_disease.csv
UniProt seed genes: {'P11169', 'P17677', 'O15527', 'Q99808', 'P22466', 'Q9UHH9', 'P27338', 'P53805', 'P21554', 'P39905', 'O00519', 'P21397', 'P42858', 'O95831', 'Q8IZ57', 'Q9NR28', 'P78509'}

Successfully converted 17 genes to UniProt IDs
Writing 17 seed genes to ../../data/NeDRex_api/seed_genes_huntingtons_disease.csv
Successfully wrote seed genes to: ../../data/NeDRex_api/seed_genes_huntingtons_disease.csv
UniProt seed genes: {'P11169', 'P17677', 'O15527',