ASSIGNMENT 5: SPARQL queries

AUTHOR: **LUCÍA SÁNCHEZ GONZÁLEZ**


In [1]:
#Firstly, we must set the endpoint and the format in order to perform the queries:

%endpoint https://sparql.uniprot.org/sparql
%format JSON 

# Q1: 1 POINT  How many protein records are in UniProt? 

In [2]:

PREFIX up: <http://purl.uniprot.org/core/>

SELECT 
    (COUNT(?protein) AS ?protein_count)
WHERE
{   
  ?protein rdf:type up:Protein .
}


protein_count
360157660


# Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt?

In [6]:
#### FIRST OPTION :
# The disadvantage of this query is that you have to
# know the code of taxon of the organism

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT (COUNT(DISTINCT(?protein)) AS ?protein_count)
 WHERE
{   
    ?protein rdf:type up:Protein ;
        up:organism taxon:3702 .        
}


protein_count
136782


In [9]:
#### SECOND OPTION : 
#Here it searches by the scientific name

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT
    (COUNT(DISTINCT(?protein)) AS ?protein_count)
WHERE
{   
    ?protein rdf:type up:Protein ;
        up:organism ?taxon .
    ?taxon rdf:type up:Taxon ;
       up:scientificName "Arabidopsis thaliana" .        
}


protein_count
136782


# Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt? 

Here I used the property foaf:depiction, in which every value is a image.

In [11]:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT ?picture
WHERE
{
   ?taxon rdf:type up:Taxon ;
       up:scientificName "Arabidopsis thaliana" ;
       foaf:depiction  ?picture .
                
}

picture
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


# Q4:  1 POINT  What is the description of the enzyme activity of UniProt Protein Q9SZZ8

Using the prefix of UniProt Knowledgebase we can obtain the information of the protein Q9SZZ8. 

In [12]:
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX label:<http://www.w3.org/2004/02/skos/core#>

SELECT ?description
WHERE
{
  uniprotkb:Q9SZZ8 rdf:type up:Protein ;
          up:enzyme ?enzyme . 
  ?enzyme up:activity ?enzyme_activity . 
  ?enzyme_activity rdfs:label ?description . 
                 
}


description
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


# Q5: 1 POINT  Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year.

If I run this query in the Uniprot SPARQL web page it works correctly, but when I try to run in the jupyter notebook it returns the following error: "Error: Query processing error: [Errno 104] Connection reset by peer". So in order to show the results, I include a picture of the results from the SPARQL page, called "dates_protein_id.png". 

This answer is inspired in: https://stackoverflow.com/questions/24051435/filter-by-date-range-in-sparql

To obtain the protein ids and the dates of this year I filtered this data translating the data into the standard date time format xsd:dateTime. 

In [3]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?protein_id ?date_submission
WHERE
{
    ?protein rdf:type up:Protein .
    ?protein up:mnemonic ?protein_id .
    ?protein up:created ?date_submission  .
    FILTER (?date_submission > "2021-01-01T10:00:00+00:00"^^xsd:dateTime)
             
}


# Q6: 1 POINT How  many species are in the UniProt taxonomy?

I used the object property up:rank which returns the rank of a taxon, in this case, the rank of the species. 

In [2]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT 
(COUNT (DISTINCT ?specie_taxon) AS ?count_taxon)

WHERE
{
  ?specie_taxon rdf:type up:Taxon;
          up:rank up:Species .
}


count_taxon
2029846


# Q7: 2 POINT  How many species have at least one protein record? 

From all the species of the Uniprot taxonomy (see previous query), retrieve those that have at least one protein. 

In [13]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT (COUNT (DISTINCT ?specie_rank) AS ?count_species)

WHERE
{
  ?protein rdf:type up:Protein ;
    up:organism ?specie_rank .
  ?specie_rank rdf:type up:Taxon;
          up:rank up:Species .
  
}


count_species
1057158


# Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

To retrieve the gene name and the AGI code, I used the object property "encodedBy" which retrieves the gene by which a protein is encoded. Then to retrieve the AGI code I used the object property of the class Gene locusName; then I employed the property prefLabel to obtained the gene name. Finally I used the functions FILTER and contains to look for those that contains the words "pattern formation" in the function annotation description.

In [15]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?agi_code ?gene_names
WHERE
{
    ?protein rdf:type up:Protein ;
        up:encodedBy ?genes ;
        up:annotation ?annotation ;       
            up:organism ?taxon . 
        
    ?taxon rdf:type up:Taxon ;
        up:scientificName "Arabidopsis thaliana" .
  
    ?genes rdf:type up:Gene ;
        up:locusName ?agi_code ;
        skos:prefLabel ?gene_names .    
    ?annotation rdf:type up:Function_Annotation ;
                      rdfs:comment ?description .
  FILTER (contains(?description, "pattern formation")) .

}


agi_code,gene_names
At3g54220,SCR
At4g21750,ATML1
At1g13980,GN
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At3g09090,DEX1
At4g37650,SHR


# Q9: 4 POINTS:  what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

This query is based in the query example number 12 from the MetaNetX SPARQL web page:


In [4]:
#From the MetaNetX metabolic networks for metagenomics database SPARQL: 
%endpoint https://rdf.metanetx.org/sparql
%format JSON 

In [5]:
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?identifier_mnxr
WHERE {
    
    ?protein rdf:type mnx:PEPT .
    ?protein mnx:peptXref up:Q18A79 .

    ?catalyst mnx:pept ?protein ;
         mnx:pept ?protein .

    ?gene_prot_r mnx:cata ?catalyst ;
         mnx:cata ?catalyst ;
         mnx:reac ?reaction .

    ?reaction rdfs:label ?identifier_mnxr .
}


identifier_mnxr
mnxr165934
mnxr145046c3


# Q10: 5 POINTS:  What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).

This answer is based on the query example answer nº23 of the Uniprot SPARQL web page and the previous query. To search for protein that have "Starch synthase" catalytic activity I look for the code of the GO term of the Starch synthase (GO:0009011).  



In [7]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON 

In [9]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX GO: <http://purl.obolibrary.org/obo/GO_>
PREFIX mnx: <https://rdf.metanetx.org/schema/>

SELECT ?gene_id ?identifier_mnxr

WHERE
{
  ?protein rdf:type up:Protein ;
    up:organism  taxon:272563 ;
    up:mnemonic ?gene_id ;
    up:classifiedWith GO:0009011 .
   
  SERVICE <https://rdf.metanetx.org/sparql> {
      ?prot2 rdf:type mnx:PEPT ;
         mnx:peptXref ?protein .
      
      ?catalyst mnx:pept ?prot2 ;
         mnx:pept ?prot2 .
      
      ?gene_prot_r mnx:cata ?catalyst ;
         mnx:cata ?catalyst ;
         mnx:reac ?reaction .
      ?reaction rdfs:label ?identifier_mnxr .
 }

} GROUP BY ?gene_id ?identifier_mnxr


gene_id,identifier_mnxr
GLGA_CLOD6,mnxr145046c3
GLGA_CLOD6,mnxr165934
