__Graph Databanks and Knowledge Graphs in Life Sciences - Python-Exercise__

***

# Exercise 6: Querying RDFs with SPARQL
***

Name:
<br>
Matrikel-Nr.: 
***

# Exercise 6:

<div>
     <img src="./figures/uniprotSPARQL_Endpoint.jpg" width=60%/>
</div>

Uniprot/Swissprot is a common knowledgebase containing information about proteins providing an endpoint for SPARQL queries at https://sparql.uniprot.org/uniprot. The figure shows the main concepts of the linked data in the Uniprot database, further subgraphs and their properties as well as links to external databases can be inspected at https://sparql.uniprot.org/.well-known/void. 

# Task 0:
> Describe the Semantics and General principles of GO annotations

GO (Gene Ontology) annotations describe the biological role of gene products (proteins, RNAs) using controlled vocabulary terms. The semantics are based on three main aspects:

1. **Molecular Function**: The biochemical activity of a gene product
2. **Biological Process**: The larger biological programs accomplished by multiple molecular activities  
3. **Cellular Component**: The locations where a gene product performs its function

GO annotations follow these general principles:
- Each annotation links a gene product to a GO term with evidence codes
- Annotations are hierarchical (child terms inherit properties from parent terms)
- Evidence codes indicate the type of evidence supporting the annotation
- Annotations can have qualifiers that modify or refine the meaning

> What are annotation qualifiers?

In [None]:
# Annotation qualifiers are additional information that refine or modify 
# the meaning of a GO annotation. Common qualifiers include:
#
# - NOT: Indicates the gene product is NOT associated with the GO term
# - contributes_to: Used when a gene product contributes to but is not 
#   solely responsible for a molecular function
# - colocalizes_with: Used for cellular component annotations when the 
#   gene product is only transiently associated with an organelle
#
# These qualifiers help provide more precise biological information about 
# the relationship between a gene product and its GO term annotation.

print("Annotation qualifiers refine GO annotation meanings")

# Task 1:
> Construct and locally save a graph which contains all human proteins and their corresponding GO terms and their specified labels. <br>
> All proteins should be represented as entities of the type "human Protein".

In [None]:
from SPARQLWrapper import SPARQLWrapper
from rdflib import Graph

# Set up SPARQL endpoint for UniProt
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# CONSTRUCT query to get human proteins with their GO terms and labels
sparql.setQuery("""
    PREFIX up: <http://purl.uniprot.org/core/>
    PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    
    CONSTRUCT {
        ?protein rdf:type <http://example.org/HumanProtein> .
        ?protein up:classifiedWith ?go .
        ?go rdfs:label ?goLabel .
    }
    WHERE {
        ?protein a up:Protein ;
                 up:organism taxon:9606 ;
                 up:classifiedWith ?go .
        ?go rdfs:label ?goLabel .
    }
    LIMIT 1000
""")

# Execute query and get results as RDF graph
results = sparql.queryAndConvert()

# Save the graph locally
results.serialize(destination="human_proteins_go.ttl", format="turtle")

print(f"Graph constructed and saved with {len(results)} triples")

# Task 2:
> Perform a query to recieve the number of proteins that are contained in the database?

In [None]:
from rdflib import Graph

# Load the graph we created
g = Graph()
g.parse("human_proteins_go.ttl", format="turtle")

# SPARQL query to count proteins
count_query = """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    
    SELECT (COUNT(DISTINCT ?protein) as ?proteinCount)
    WHERE {
        ?protein rdf:type <http://example.org/HumanProtein> .
    }
"""

# Execute query
qres = g.query(count_query)

# Display result
for row in qres:
    print(f"Number of proteins in the database: {row.proteinCount}")

# Task 3:
> Are their proteins connected to __GO:0046330__?

In [None]:
from rdflib import Graph

# Load the graph
g = Graph()
g.parse("human_proteins_go.ttl", format="turtle")

# Query to check if proteins are connected to GO:0046330
query = """
    PREFIX up: <http://purl.uniprot.org/core/>
    PREFIX obo: <http://purl.obolibrary.org/obo/>
    
    SELECT ?protein
    WHERE {
        ?protein up:classifiedWith obo:GO_0046330 .
    }
"""

qres = g.query(query)
results = list(qres)

if len(results) > 0:
    print(f"Yes, there are {len(results)} proteins connected to GO:0046330")
    print("\nFirst few proteins:")
    for i, row in enumerate(results[:5]):
        print(f"  {i+1}. {row.protein}")
else:
    print("No proteins are connected to GO:0046330 in our local graph")

> Research this GO term and give an insight into the process that it is involved in.

**GO:0046330 - D-gluconate catabolic process**

This GO term refers to the chemical reactions and pathways resulting in the breakdown of D-gluconate. D-gluconate is an oxidized form of glucose and is involved in carbohydrate metabolism.

The process involves:
- Breaking down D-gluconate (6-carbon sugar acid) into smaller molecules
- Part of the pentose phosphate pathway
- Important for generating NADPH and pentose sugars
- Involved in cellular energy metabolism and biosynthetic processes

Proteins associated with this GO term are typically enzymes that catalyze steps in the D-gluconate degradation pathway, contributing to cellular metabolism and energy production.

# Task 4:
> Delete all proteins that are not connected to the GO term label "chronic inflammatory response".

In [None]:
from rdflib import Graph

# Load the graph
g = Graph()
g.parse("human_proteins_go.ttl", format="turtle")

print(f"Graph size before deletion: {len(g)} triples")

# Delete all proteins that are NOT connected to "chronic inflammatory response"
delete_query = """
    PREFIX up: <http://purl.uniprot.org/core/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    
    DELETE {
        ?protein ?p ?o .
    }
    WHERE {
        ?protein rdf:type <http://example.org/HumanProtein> .
        ?protein ?p ?o .
        
        # Filter: Keep only proteins that do NOT have chronic inflammatory response
        FILTER NOT EXISTS {
            ?protein up:classifiedWith ?go .
            ?go rdfs:label ?label .
            FILTER(CONTAINS(LCASE(STR(?label)), "chronic inflammatory response"))
        }
    }
"""

g.update(delete_query)

print(f"Graph size after deletion: {len(g)} triples")
g.serialize(destination="human_proteins_go.ttl", format="turtle")

# Task 5:
> Establish a relationship which links the remaining proteins as a potential cause of an entity of Inflammatory bowel disease.

In [None]:
from rdflib import Graph, Namespace, URIRef

# Load the graph
g = Graph()
g.parse("human_proteins_go.ttl", format="turtle")

# Define namespaces
ex = Namespace("http://example.org/")

# Create entity for Inflammatory Bowel Disease
ibd = URIRef("http://example.org/InflammatoryBowelDisease")

# Insert triples linking proteins to IBD
insert_query = """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX ex: <http://example.org/>
    
    INSERT {
        ?protein ex:potentialCauseOf ex:InflammatoryBowelDisease .
    }
    WHERE {
        ?protein rdf:type ex:HumanProtein .
    }
"""

print(f"Graph size before insertion: {len(g)} triples")
g.update(insert_query)
print(f"Graph size after insertion: {len(g)} triples")

# Save the updated graph
g.serialize(destination="human_proteins_go.ttl", format="turtle")

# Verify the relationships were added
verify_query = """
    PREFIX ex: <http://example.org/>
    
    SELECT (COUNT(?protein) as ?count)
    WHERE {
        ?protein ex:potentialCauseOf ex:InflammatoryBowelDisease .
    }
"""

qres = g.query(verify_query)
for row in qres:
    print(f"\nNumber of proteins linked to IBD: {row.count}")

# Task 6

> - Find GO terms that are unique to proteins (i.e., GO terms associated with only one protein).
> - List the unique GO terms, their labels, and the corresponding protein identifiers.

In [None]:
from rdflib import Graph

# Load the graph
g = Graph()
g.parse("human_proteins_go.ttl", format="turtle")

# Query to find GO terms associated with only one protein
unique_go_query = """
    PREFIX up: <http://purl.uniprot.org/core/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    
    SELECT ?go ?goLabel (COUNT(DISTINCT ?protein) as ?proteinCount) ?singleProtein
    WHERE {
        ?protein rdf:type <http://example.org/HumanProtein> .
        ?protein up:classifiedWith ?go .
        ?go rdfs:label ?goLabel .
        
        # Get the protein for display
        {
            SELECT ?go (SAMPLE(?p) as ?singleProtein)
            WHERE {
                ?p up:classifiedWith ?go .
            }
            GROUP BY ?go
        }
    }
    GROUP BY ?go ?goLabel ?singleProtein
    HAVING (COUNT(DISTINCT ?protein) = 1)
    ORDER BY ?goLabel
"""

qres = g.query(unique_go_query)
results = list(qres)

print(f"Found {len(results)} unique GO terms (associated with only one protein):\n")
print(f"{'GO Term':<50} {'Label':<60} {'Protein'}")
print("-" * 150)

for i, row in enumerate(results[:20]):  # Display first 20
    go_short = str(row.go).split("/")[-1] if row.go else "N/A"
    label_short = str(row.goLabel)[:57] + "..." if len(str(row.goLabel)) > 60 else str(row.goLabel)
    protein_short = str(row.singleProtein).split("/")[-1] if row.singleProtein else "N/A"
    print(f"{go_short:<50} {label_short:<60} {protein_short}")

if len(results) > 20:
    print(f"\n... and {len(results) - 20} more unique GO terms")