# Assignment 5 SPARQL queries

I would like you to create the SPARQL query that will answer each of these questions.  Please submit the queries as a Jupyter notebook with the SPARQL kernel activated.  NO programming is required! Submit to GitHub as usual, WITH THE ANSWERS STILL VISIBLE IN THE NOTEBOOK.   Thanks!

For many of these you will need to look-up how to use the SPARQL functions ‘COUNT’ and ‘DISTINCT’ (we used ‘distinct’ in class), and probably a few others...


UniProt SPARQL Endpoint:  http://sparql.uniprot.org/sparql

### Q1: 1 POINT  How many protein records are in UniProt? 

In [13]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/> 

SELECT (COUNT(DISTINCT ?protein) as ?count)
WHERE{
    ?protein a up:Protein .
}

count
360157660


### Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt?

In [1]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT ?protein) as ?count)
WHERE {
    ?protein a up:Protein ;
               up:organism ?taxon .
    ?taxon a up:Taxon ;
             up:scientificName "Arabidopsis thaliana"
}

count
136782


### Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt? 

In [16]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX foaf:<http://xmlns.com/foaf/0.1/>

SELECT ?name ?image
WHERE {
       ?taxon    foaf:depiction  ?image .
       ?taxon    up:scientificName   ?name .
       FILTER(CONTAINS(?name, "Arabidopsis thaliana"))
}

name,image
Arabidopsis thaliana,https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
Arabidopsis thaliana,https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


### Q4: 1 POINT:  What is the description of the enzyme activity of UniProt Protein Q9SZZ8 

In [5]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT(?description)
WHERE{
    uniprotkb:Q9SZZ8 a up:Protein ;
                       up:enzyme ?enzyme ;
                       up:annotation ?annotation .
    ?enzyme up:activity ?activity .
    ?activity a up:Catalytic_Activity ;
                rdfs:label ?description
    
}

description
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


### Q5: 1 POINT:  Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [1]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT(?id) ?date
WHERE{
    ?protein a up:Protein ;
               up:mnemonic ?id ;
               up:created ?date .
    FILTER (?date > "2022-01-01"^^xsd:date)
}

id,date


### Q6: 1 POINT How  many species are in the UniProt taxonomy?

In [18]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/>

SELECT (COUNT (DISTINCT ?taxon) as ?count)
WHERE{
    ?taxon a up:Taxon ;
             up:rank up:Species
}

count
2029846


### Q7: 2 POINT  How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [1]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/>

SELECT (COUNT (DISTINCT ?taxon) as ?count)
WHERE {
    ?protein a up:Protein ;
               up:organism ?taxon.
    ?taxon a up:Taxon;
             up:rank up:Species             
}

count
1057158


### Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [67]:
%endpoint  http://sparql.uniprot.org/sparql
%format JSON

PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?agis ?genename
WHERE {
    ?protein a up:Protein ;
               up:organism ?taxon ;
               up:annotation ?annotation ;
               up:encodedBy ?gene .
    ?taxon a up:Taxon ;
             up:scientificName "Arabidopsis thaliana" .    
    ?gene up:locusName ?agis ;
          skos:prefLabel ?genename .
    ?annotation a up:Function_Annotation ;
                  rdfs:comment ?function 
    FILTER(CONTAINS(str(?function), "pattern formation"))
}

agis,genename
At3g54220,SCR
At4g21750,ATML1
At1g13980,GN
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At3g09090,DEX1
At4g37650,SHR


### From the MetaNetX metabolic networks for metagenomics database SPARQL Endpoint: https://rdf.metanetx.org/sparql
(this slide deck will make it much easier for you!  https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf)


### Q9: 4 POINTS:  what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

In [11]:
%endpoint  https://rdf.metanetx.org/sparql
%format JSON

PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT(?rid)
WHERE {
    ?protein mnx:peptXref uniprotkb:Q18A79 .
    ?pept mnx:pept ?protein .
    ?cata mnx:cata ?pept ;
          mnx:reac ?reaction .
    ?reaction rdfs:label ?rid .
}

rid
mnxr165934
mnxr145046c3


### FEDERATED QUERY - UniProt and MetaNetX

### Q10: 5 POINTS:  What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).


In [2]:
%endpoint  http://sparql.uniprot.org/sparql
%format JSON

PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT ?id DISTINCT(?rid) 
WHERE{
    ?protein a up:Protein ;
               up:organism taxon:272563 ;
               up:mnemonic ?id ;
               up:annotation ?annotation .
    ?annotation a up:Function_Annotation ;
                  rdfs:comment ?function .
    FILTER(CONTAINS(str(?function), "starch synthase")) .
       
    SERVICE <https://rdf.metanetx.org/sparql> {
        
        ?protein mnx:peptXref uniprotkb:?id .
        ?pept mnx:pept ?protein .
        ?cata mnx:cata ?pept ;
              mnx:reac ?reaction .
        ?reaction rdfs:label ?rid .               
    }    
}