## ASSIGNMENT 5 -SPARQL QUERIES
#### ANDREA ESCOLAR PEÑA

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

How many protein records are in UniProt? 

In [2]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT(?protein) AS ?n)
WHERE
{
    ?protein a up:Protein .
}

n
360157660


How many Arabidopsis thaliana protein records are in UniProt?

In [3]:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (DISTINCT ?protein) AS ?n) 
WHERE
{
    ?protein a up:Protein .
    ?protein ?pred1 ?taxon .
    ?taxon a up:Taxon .
    ?taxon ?pred2 "Arabidopsis thaliana" 
}

n
136782


Retrieve pictures of Arabidopsis thaliana from UniProt

In [4]:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT ?picture
WHERE
{
    ?taxon a up:Taxon .
    ?taxon ?pred1 "Arabidopsis thaliana" .
  	?taxon ?pred2 ?picture .
  	?picture a foaf:Image 	
}

picture
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


What is the description of the enzyme activity of UniProt Protein Q9SZZ8?

In [5]:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT ?description
WHERE
{
  uniprotkb:Q9SZZ8 ?pred ?enzyme .
  ?enzyme up:activity ?activity .
  ?activity a up:Catalytic_Activity .
  ?activity rdfs:label ?description
 
}

description
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [6]:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT ?id ?date
WHERE
{
  ?protein a up:Protein .
  BIND (SUBSTR(STR(?protein),33,42) AS ?id)
  ?protein up:created ?date .
  FILTER (?date >= "2021"^^xsd:dateTime && ?date < "2022"^^xsd:dateTime) .
} LIMIT 10

id,date
A0A1H7ADE3,2021-06-02
A0A1V1AIL4,2021-06-02
A0A2Z0L603,2021-06-02
A0A4J5GG53,2021-04-07
A0A6G8SU52,2021-02-10
A0A6G8SU69,2021-02-10
A0A7C9JLR7,2021-02-10
A0A7C9JMZ7,2021-02-10
A0A7C9KUQ4,2021-02-10
A0A7D4HP61,2021-02-10


The result is limited because if I don't do so, it takes a long time to get the results (I am not even able to get the results). In the web page, the results are got in a very fast way. 

 How  many species are in the UniProt taxonomy?

In [7]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (DISTINCT ?species) AS ?count) 
WHERE
{
  ?species up:rank up:Species .
} 

count
2029846


 How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [9]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (DISTINCT ?species) AS ?count)
WHERE
{

  ?protein a up:Protein .
  ?protein ?pred ?species .
  ?species up:rank up:Species .
  
} 

This query didn't give any result here, in the jupyter notebook. If the query is executed in the web page, we obtain 1057158. 

Find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [8]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


SELECT ?locus_name ?gene_name
WHERE
{
    ?protein a up:Protein .
    ?protein ?pred1 ?taxon .
    ?taxon a up:Taxon .
    ?taxon ?pred2 "Arabidopsis thaliana" .
  
  	?protein ?pred3 ?gene .
  	?gene a up:Gene .
    ?gene up:locusName ?locus_name .
  	?gene skos:prefLabel ?gene_name .
  
    ?protein ?pred4 ?annotation .
    ?annotation a up:Function_Annotation . 
  	?annotation rdfs:comment ?description
  	FILTER regex(?description, "pattern formation") .

} 

locus_name,gene_name
At3g54220,SCR
At1g13980,GN
At4g21750,ATML1
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At3g09090,DEX1
At5g55250,IAMT1


What is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79?

In [None]:
%endpoint https://rdf.metanetx.org/sparql

In [10]:
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?MNX_id

WHERE{
  
  ?pept mnx:peptXref uniprotkb:Q18A79 .
  
  ?cata mnx:pept ?pept .

  ?gpr mnx:cata ?cata .

  ?gpr mnx:reac ?reac  .
    
  ?reac mnx:mnxr ?reacR .

  ?reacR rdfs:label ?MNX_id
}

MNX_id
MNXR145046
MNXR165934


What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563)?

In [11]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?mnemonic ?MNX_id
WHERE
{
    ?protein a up:Protein .
    ?protein up:organism taxon:272563 .
             
  	?protein up:classifiedWith ?goTerm .
   	?goTerm rdfs:subClassOf <http://purl.obolibrary.org/obo/GO_0003674> .
    ?goTerm rdfs:label ?moltype .
  
  	FILTER regex(?moltype, "starch synthase") .
  	?protein up:mnemonic ?mnemonic .

    SERVICE <https://rdf.metanetx.org/sparql> {
  	
    ?pept a mnx:PEPT .
      
  	?pept mnx:peptXref ?protein .
  
    ?cata mnx:pept ?pept .

    ?gpr mnx:cata ?cata .

    ?gpr mnx:reac ?reac  .

    ?reac mnx:mnxr ?reacR .

    ?reacR rdfs:label ?MNX_id
           
    }
  
} 

mnemonic,MNX_id
GLGA_CLOD6,MNXR145046
GLGA_CLOD6,MNXR165934
