Assignment 5 - SPARQL queries

Student: Álvaro García López

In [1]:
%endpoint https://sparql.uniprot.org/sparql 
#we set the endpoint of UniProtKB (the database in which we are going to ask the queries)

%format JSON
# We define the format as JSON

Q1: 1 POINT  How many protein records are in UniProt? 

In [2]:
PREFIX up: <http://purl.uniprot.org/core/>
# We use prefix to make easier the syntax in the WHERE part ("syntatic sugar")

SELECT (COUNT (?protein) AS ?protein_records) 
# We use the COUNT function to ask for the number of all the protein records in UniProtKB database

# Since the only "restriction" for the count is to be a UniProt protein record,
# we ask for the database rdf "resources" with a rdf:type ("a") as predicate 
# is "linked" to "up:Protein" resource 
WHERE {
    ?protein a up:Protein .
}

protein_records
360157660


Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt? 

In [3]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (?protein) AS ?Arabidopsis_thaliana_records)

# Same idea, asking for the organism scientic name associated with the protein (in this case "Arabidopsis thaliana")
WHERE {
    ?protein a up:Protein .
    ?protein up:organism ?specie .
    ?specie up:scientificName "Arabidopsis thaliana" .
}

Arabidopsis_thaliana_records
136782


Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt? 

In [4]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?pictures

# Asking for the pictures associated to the specie "Arabidopsis tahliana"
WHERE {
  ?specie up:scientificName "Arabidopsis thaliana" .
  ?specie foaf:depiction ?pictures .
}

pictures
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


Q4: 1 POINT:  What is the description of the enzyme activity of UniProt Protein Q9SZZ8 

In [5]:
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?reaction

# We ask for the UniProt Q9SZZ8 enzyme, for its activity and the label of that activity (reaction catalyzed)
WHERE { 
    uniprotkb:Q9SZZ8 up:enzyme ?protein .
    ?protein up:activity ?activity .
    ?activity rdfs:label ?reaction .
}

reaction
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


Q5: 1 POINT:  Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [6]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?protein ?creation_date

# We ask for the protein creation date to be between the days 2021-01-01 and 2021-12-31 (both included)
WHERE {
    ?protein a up:Protein .
    ?protein up:created ?creation_date .
    FILTER (?creation_date >= xsd:date("2021-01-01") && ?creation_date <= xsd:date("2021-12-31")) .
} LIMIT 5

protein,creation_date
http://purl.uniprot.org/uniprot/A0A1H7ADE3,2021-06-02
http://purl.uniprot.org/uniprot/A0A1V1AIL4,2021-06-02
http://purl.uniprot.org/uniprot/A0A2Z0L603,2021-06-02
http://purl.uniprot.org/uniprot/A0A4J5GG53,2021-04-07
http://purl.uniprot.org/uniprot/A0A6G8SU52,2021-02-10


Q6: 1 POINT How  many species are in the UniProt taxonomy?

In [7]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (?specie) AS ?number_of_species)

# We ask for the "resource" tagged with the rdf:type ("a") "up:Taxon"
WHERE {
  ?specie a up:Taxon .
}

number_of_species
2848758


Q7: 2 POINT  How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [11]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (DISTINCT ?specie) AS ?count)

WHERE {
  ?protein a up:Protein .
  ?protein up:organism ?specie .
  ?specie a up:Taxon .
  ?specie up:rank up:Species .
}

count
0


Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [8]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?agi ?gene_label

WHERE {    
    ?protein a up:Protein . # all proteins
    ?protein up:organism ?specie . # get specie
    ?protein up:encodedBy ?gene . # get gene
    ?protein up:annotation ?f_annotation .# get function annotation
    
    ?specie up:scientificName "Arabidopsis thaliana" . 
    # get those one with the specie scientific name corresponding to "Arabidopsis thaliana" 
    
    ?f_annotation rdfs:comment ?f_annotation_desc . # get function annotation description
    FILTER REGEX (?f_annotation_desc, "pattern formation", "i") . 
    # and filter (get only) those one with "pattern formation" regular expression in its description ("i", insensitive to case)
    
    ?gene up:locusName ?agi . # get the AGI code of the gene
    ?gene skos:prefLabel ?gene_label . # get the gene label (name)
}

agi,gene_label
At3g54220,SCR
At4g21750,ATML1
At1g13980,GN
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At1g55325,MED13
At3g09090,DEX1


Q9: 4 POINTS:  what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

In [9]:
#Defining the new endpoint
%endpoint https://rdf.metanetx.org/sparql

In [10]:
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>

SELECT DISTINCT ?mnxr_label

WHERE {
    ?mnx_pept mnx:peptXref uniprotkb:Q18A79 . # get the mnx_pept of the UniProtKB protein Q18A79
    ?mnx_cata mnx:pept ?mnx_pept . # get the mnx_cata from the previous mnx_pept
    ?mnx_gpr mnx:cata ?mnx_cata . # get mnx_gpr using mnx_cata
    ?mnx_gpr mnx:reac ?mnx_reac . # get mnx_reac
    ?mnx_reac rdfs:label ?mnxr_label . # get the label of that mnx_reac (mnxr_label)
}

mnxr_label
mnxr165934
mnxr145046c3


Q10: 5 POINTS:  What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).