# Assignment 5 SPARQL queries

I would like you to create the SPARQL query that will answer each of these questions.  Please submit the queries as a Jupyter notebook with the SPARQL kernel activated.  NO programming is required! Submit to GitHub as usual, WITH THE ANSWERS STILL VISIBLE IN THE NOTEBOOK.   Thanks!

For many of these you will need to look-up how to use the SPARQL functions ‘COUNT’ and ‘DISTINCT’ (we used ‘distinct’ in class), and probably a few others...


*UniProt SPARQL Endpoint:  http://sparql.uniprot.org/sparql*


----------------------------------------------------------------

### Q1: How many protein records are in UniProt? 

In [None]:
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (STR(COUNT(?prot)) as ?prot_number) # Count every protein (with not repetition) and save as "prot_number"
WHERE{ 
        ?prot a up:Protein
}

Result: prot_number
"360157660"xsd:string

### Q2: How many Arabidopsis thaliana protein records are in UniProt? 


In [None]:
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT(DISTINCT ?prot) AS ?count_arabidopsis_protein) #Count every protein (with no repetition) and save the result as "count_arabidopsis_protein"
WHERE
{
        ?prot a up:Protein ; #Filter by protein records
        up:organism taxon:3702 . #Filter by organism = Arabidopsis Thaliana (taxonid = 3702)
}

Result: count_arabidopsis_protein
"136782"xsd:int

### Q3: Retrieve pictures of Arabidopsis thaliana from UniProt? 

In [None]:
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
SELECT DISTINCT (?image AS ?photo) # Show the scientific name and images of all Arabidopsis Thaliana.
WHERE {
       ?taxon    foaf:depiction  ?image ; # Select the image
           up:scientificName   ?name .
       FILTER(CONTAINS(?name, "Arabidopsis thaliana")) # Select the names wich contains "Arabidopsis Thaliana"
}

Results for this query has been downloaded in csv format. See in result_query_3.csv.

### Q4: What is the description of the enzyme activity of UniProt Protein Q9SZZ8 

In [None]:
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX uniprot:<http://purl.uniprot.org/uniprot/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX label:<http://www.w3.org/2004/02/skos/core#>

SELECT ?activity_description
WHERE {
  uniprot:Q9SZZ8 up:enzyme ?enzyme . #Search for the id of the proteine
  ?enzyme up:activity ?activity .
  ?activity rdfs:label ?activity_description #Taje only the activity description (column description)
}

Result: activityDescription
"Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O."xsd:string

### Q5: Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [None]:
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX uniprot:<http://purl.uniprot.org/uniprot/> 

SELECT ?id ?date 
WHERE
{
  ?protein a up:Protein ; # Select ids and date of creation of the entry (submission date)
           up:mnemonic ?id ;
           up:created ?date .

  FILTER (contains(STR(?date), "2021")) # Filter and select only the ones which contains "2021" (current year)
}

Results for this query were extremely long and could not be downloaded. Instead, I took a screenshot of the first page results (result_query_5.png).

### Q6: How  many species are in the UniProt taxonomy?

In [None]:
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT(DISTINCT ?taxon) AS ?different_species) # Count different species and save as "different_species"
WHERE
{
  ?taxon a up:Taxon ;
           up:rank up:Species . # Look in taxon the different species contained
}

# Reference: https://www.vedantu.com/question-answer/differentiate-between-species-and-taxon-class-11-biology-cbse-5fcef165bd52b90ffc9d9e9d
# Difference between taxon and specie:
# As taxon is part of the taxonomic category it is constructed by individual biological objects. -Species contains a group of individuals with same morphological characters and they can breed among themselves whereas taxon is a group of one or more population of an organism in taxonomical form.


differentSpecies
"2029846"xsd:int

### Q7: How many species have at least one protein record? 

In [None]:
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT (DISTINCT ?protein_organism) AS ?num_protein_records) # Count without repetition the species with protein records
WHERE
{
    ?protein rdf:type up:Protein ; # Select protein type
            up:organism ?protein_organism . # Select organisims
    ?protein_organism up:rank up:Species . # Count different species
}

Result: 
num_protein_records
"1057158"xsd:int

### Q8: Find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [None]:
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?agi_code ?name
WHERE{ 
  ?protein a up:Protein ;			  # Searching for protein 
  			 up:organism taxon:3702 ; # Arabidopsis Tahliana organisms
  			 up:annotation ?annotation ; # Use annotation function below
  			 up:encodedBy ?gene. # Select the gene

  ?gene up:locusName ?agi_code ; 
        skos:prefLabel ?name . # Select the AGI cide and the gene name of the name
  
  ?annotation a up:Function_Annotation ; # Use annoation function for geting descriptions
      rdfs:comment ?annot_comment.
        
  FILTER CONTAINS(str(?annot_comment), 'pattern formation') . # Select only description that mentions "pattern formation"
}

Results for this query has been downloaded in csv format. See in result_query_8.csv.

# MetaNetX
From the MetaNetX metabolic networks for metagenomics database SPARQL Endpoint: https://rdf.metanetx.org/sparql

(this slide deck will make it much easier for you!  https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf)

### Q9: What is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

In [None]:
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX mnx:<https://rdf.metanetx.org/schema/>

SELECT DISTINCT ?MetaNetX_reaction_identifiers

where {
        ?polypeptide a mnx:PEPT ; # Select node mnx:PEPT (A gene or gene product)
            mnx:peptRefer uniprotkb:Q18A79 . # The corresponding UniProt identifiers are recovered at MetaNetX, when possible 
                                             # In this case we are using "mnx:peptRefer" which is "Q18A79"
        ?catalyst a mnx:CATA ;      # Once we have the polypeptide, we search for the catalyst and complex description.   
            mnx:pept ?polypeptide . # This node is mnx:CATA and is join to mnx:PEPT by mnx:peptRefer
        ?gpr a mnx:GPR ; # mnx:GPR node (Gene-Protein-Reaction) is A particular reaction with zero, one, or several catalysts, in the context of a particular GSMN
            mnx:cata ?catalyst ;      # Therefore we use the catalyst selected before for filtering 
            mnx:reac ?reactions .     # and select the reaction by mnx:reac
        ?reactions a mnx:REAC ;        # Once we know the raction we need to search its id
            rdfs:label  ?MetaNetX_reaction_identifiers . # An option is consulting its label

}

Result:

| MetaNetX_reaction_identifiers | 
| -- | 
| mnxr165934 | 
| mnxr145046c3 | 


# FEDERATED QUERY - UniProt and MetaNetX

### Q10: What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).

In [None]:
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX mnx:<https://rdf.metanetx.org/schema/>

# In this query we will need to use two services (http://sparql.uniprot.org/sparql and https://rdf.metanetx.org/sparq)
# For this task I used the format seen in different webs 
# References (https://stackoverflow.com/questions/35260421/sparql-query-using-multiple-datasources
#             and https://www.w3.org/TR/sparql11-federated-query/)

SELECT DISTINCT ?gene_id ?MetaNetX_reaction_identifiers # Variables we need to show as result of the different queries
WHERE {
    SERVICE <http://sparql.uniprot.org/sparql> { # First service (uniprot)
         select distinct ?prot ?gene_id         # In this query we need to store the prot (for using in the next query as parameter) and the gene id for showing as result
         where {
            ?prot a uniprot:Protein ;           # We access to Protein
                uniprot:organism taxon:272563 ; # And look for the taxon given
                uniprot:enzyme ?enz ;           # As we need to look for an special enzyme of the protein (Starch synthase)
                uniprot:mnemonic ?gene_id .     # Store gene id

            ?enz rdfs:comment ?name .           # Look for the enzyme starch synthase
            FILTER regex(?name, "starch synthase") 
         }
    }
     SERVICE <https://rdf.metanetx.org/sparql> {
         OPTIONAL {     # Every added query is considered a nested service and must be inside OPTIONAL clause
            # Reusing code from query 9
             ?polypeptide a mnx:PEPT ;
                mnx:peptRefer ?prot . # In this case we obtained mnx:peptRefer from uniprot before
             ?catalyst a mnx:CATA ;
                mnx:pept ?polypeptide .
             ?gpr a mnx:GPR ;
                mnx:cata ?catalyst ;
                mnx:reac ?reactions .
             ?reactions a mnx:REAC ;
                rdfs:label  ?MetaNetX_reaction_identifiers.
            }
     }
} 


Result:
| gene_id | MetaNetX_reaction_identifiers | 
| -- | -- |
| "GLGA_CLOD6" | "mnxr165934" |
| "GLGA_CLOD6" | "mnxr145046c3" |