# Assignment 5 : SPARQL

Q1. How many protein records are in UniProt? 

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT(?protein) as ?count)
WHERE {
    ?protein a up:Protein .
}

count
360157660


Q2.  How many Arabidopsis thaliana protein records are in UniProt? 

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT (COUNT(DISTINCT ?protein) as ?countA)
WHERE 
{
   ?protein a up:Protein .
   ?protein up:organism taxon:3702 .
}


countA
136782


Q3. Retrieve pictures of Arabidopsis thaliana from UniProt? 

In [2]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?name ?image
WHERE {
       ?taxon  foaf:depiction  ?image .
       ?taxon up:scientificName ?name .
  FILTER regex(?name, '^Arabidopsis.*', 'i') .
}

name,image
Arabidopsis thaliana,https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
Arabidopsis thaliana,https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


Q4. What is the description of the enzyme activity of UniProt Protein Q9SZZ8 ?

In [3]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?description
WHERE 
{
	uniprotkb:Q9SZZ8 a up:Protein ;
             up:enzyme ?prot .
	?prot up:activity ?acti .
	?acti rdfs:label ?description
}

description
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


Q5. Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year (HINT Google for “SPARQL FILTER by date”)

In [None]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/> 

SELECT ?id ?date
WHERE{
  ?protein a up:Protein . 
  ?protein up:mnemonic ?id .
  ?protein up:created ?date .
  FILTER (contains(STR(?date), "2021"))
}

Q6. How many species are in the UniProt taxonomy?

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT ?taxon) AS ?countSpecies)
WHERE{
  ?taxon a up:Taxon .
  ?taxon up:rank up:Species 
}

countSpecies
2029846


Q7. How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [None]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/>

SELECT (COUNT(DISTINCT ?taxon) AS ?countSpecies)
WHERE{
  ?taxon a up:Taxon .
  ?taxon up:rank up:Species .
  ?protein a up:Protein .  
}
#too long to run

Q8. Find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”.

In [2]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 

SELECT DISTINCT ?AGI ?name
WHERE{
  ?protein a up:Protein .
  ?protein up:organism taxon:3702 .
  ?protein up:annotation ?annotation .
  ?annotation rdfs:comment ?description .
  FILTER (contains(STR(?description), "pattern formation")).
  
  ?protein up:encodedBy ?gene .
  ?gene up:locusName ?AGI .
  ?gene skos:prefLabel ?name
}

AGI,name
At3g54220,SCR
At1g13980,GN
At4g21750,ATML1
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At1g55325,MED13
At3g09090,DEX1


Q9. What is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

In [4]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON

PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
SELECT DISTINCT ?reacID
WHERE{
    ?pept mnx:peptXref uniprotkb:Q18A79 .
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?reacID
}

reacID
mnxr165934
mnxr145046c3


Q10. What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563) ?

In [5]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON
 
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT ?geneID ?reacID
WHERE {
?protein a up:Protein .
?protein up:organism taxon:272563 ;
up:enzyme ?prot .
?prot up:activity ?acti .
?acti rdfs:label 'Starch synthase' .
?prot up:mnemonic ?geneID .
?pept mnx:peptXref ?protein .
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?reacID
}

geneID,reacID
