# SPARQL queries

This notebook demonstrates how SPARQL queries can be composed programmatically, and without (almost) no knowledge of SPARQL. For this purpose, we will use an existing dataset. For this example, we will use pre-prepared ontology networks. See `tools4rdf/network/ontology.py` on how to read in Ontologies and create networks from them.

In [1]:
from tools4rdf.network.ontology import read_ontology
from rdflib import Graph

In [2]:
onto = read_ontology()

In [3]:
#kg = KnowledgeGraph.unarchive('dataset.tar.gz')
kg = Graph()
kg.parse("dataset/triples", format="turtle")

<Graph identifier=Na4730e087d2641698162513edb40c7eb (<class 'rdflib.graph.Graph'>)>

Of course, SPARQL queries can be directly run. See an example:

In [4]:
query = """
PREFIX cmso: <http://purls.helmholtz-metadaten.de/cmso/>
SELECT DISTINCT ?sample ?symbol ?number 
WHERE {
    ?sample cmso:hasMaterial ?material .
    ?material cmso:hasStructure ?structure .
    ?structure cmso:hasSpaceGroupSymbol ?symbol .
    ?sample cmso:hasNumberOfAtoms ?number .
FILTER (?number="4"^^xsd:integer)
}"""

The above query finds the Space Group symbol of all structures which have four atoms.

In [5]:
results = kg.query(query)

In [6]:
for row in results:
    print(row)

(rdflib.term.URIRef('sample:10ffd2cc-9e92-4f04-896d-d6c0fdb9e55f'), rdflib.term.Literal('Pm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('sample:1f6b1b0f-446a-4ad8-877e-d2e6176797df'), rdflib.term.Literal('Fm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('sample:286c3974-962b-4333-a2bb-d164ae645454'), rdflib.term.Literal('Fm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('sample:67be61c7-f9c7-4d46-a61d-5350fd0ee246'), rdflib.term.Literal('Fm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4'

This query can also be performed programmatically, which looks like this:

`onto.terms` can be auto-completed to find ontology terms

In [7]:
query = onto.create_query(onto.terms.cmso.AtomicScaleSample, [onto.terms.cmso.hasSpaceGroupSymbol, onto.terms.cmso.hasNumberOfAtoms==4])

In [8]:
print(query)

PREFIX cmso: <http://purls.helmholtz-metadaten.de/cmso/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?AtomicScaleSample ?hasSpaceGroupSymbolvalue ?hasNumberOfAtomsvalue
WHERE {
    ?AtomicScaleSample cmso:hasMaterial ?cmso_Material .
    ?cmso_Material cmso:hasStructure ?cmso_CrystalStructure .
    ?cmso_CrystalStructure cmso:hasSpaceGroupSymbol ?hasSpaceGroupSymbolvalue .
    ?AtomicScaleSample cmso:hasNumberOfAtoms ?hasNumberOfAtomsvalue .
    ?AtomicScaleSample rdf:type cmso:AtomicScaleSample .
FILTER (?hasNumberOfAtomsvalue="4"^^xsd:integer)
}


Which can now be executed

In [9]:
results = kg.query(query)
for row in results:
    print(row)

(rdflib.term.URIRef('sample:10ffd2cc-9e92-4f04-896d-d6c0fdb9e55f'), rdflib.term.Literal('Pm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('sample:286c3974-962b-4333-a2bb-d164ae645454'), rdflib.term.Literal('Fm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('sample:8fc8e47b-acee-40f8-bcbf-fc298cc31f05'), rdflib.term.Literal('Fm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('sample:9f0f48d1-5ebf-4f7a-b241-5e7aa273f5a0'), rdflib.term.Literal('Fm-3m', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), rdflib.term.Literal('4'

OntologyNetwork also has a query method which returns a pandas DataFrame for convenience

In [10]:
onto.query(kg, onto.terms.cmso.AtomicScaleSample, [onto.terms.cmso.hasSpaceGroupSymbol, onto.terms.cmso.hasNumberOfAtoms==4])

Unnamed: 0,AtomicScaleSample,hasSpaceGroupSymbolvalue,hasNumberOfAtomsvalue
0,sample:10ffd2cc-9e92-4f04-896d-d6c0fdb9e55f,Pm-3m,4
1,sample:286c3974-962b-4333-a2bb-d164ae645454,Fm-3m,4
2,sample:8fc8e47b-acee-40f8-bcbf-fc298cc31f05,Fm-3m,4
3,sample:9f0f48d1-5ebf-4f7a-b241-5e7aa273f5a0,Fm-3m,4
4,sample:e54c0e91-52ec-4c47-8ba3-43979a1ebe2e,Fm-3m,4
5,sample:1f6b1b0f-446a-4ad8-877e-d2e6176797df,Fm-3m,4
6,sample:67be61c7-f9c7-4d46-a61d-5350fd0ee246,Fm-3m,4
7,sample:721b7447-8363-4e65-9515-9da2581d7124,Fm-3m,4
8,sample:a3cf6d97-c922-4c4d-8517-e784df83b71e,Fm-3m,4
9,sample:ab2bea57-39ea-49ea-ad3f-c1c40b013154,Fm-3m,4


Now the building of such a query programmatically is discussed. The function needs a source and destination(s). Destination can include conditions attached to it, for example, that numbers of atoms. The first thing to do is to find the right terms. For this, we can use the tab completion feature.

In [11]:
onto.terms

brick, csvw, dc, dcat, dcmitype, dcterms, dcam, doap, foaf, geo, odrl, org, prof, prov, qb, schema, sh, skos, sosa, ssn, time, vann, void, wgs, owl, rdf, rdfs, xsd, xml, obo, asmo, calculation, cmso, pldo, podo, qudt, ldo

Those are all the ontologies, with the terms we use. One can go deeper down

In [12]:
onto.terms.cmso

NormalVector, Length, SimulationCell, AmorphousMaterial, ComputationalSample, Basis, ChemicalElement, UnitCell, ChemicalComposition, CrystalDefect, MacroscaleSample, Microstructure, AtomicScaleSample, AtomicForce, ChemicalSpecies, SimulationCellLength, MicroscaleSample, LatticePlane, LatticeAngle, CalculatedProperty, LatticeParameter, Material, LatticeVector, NanoscaleSample, Atom, AtomAttribute, SpaceGroup, SimulationCellAngle, CrystallineMaterial, Plane, SimulationCellVector, Vector, Occupancy, Angle, CoordinationNumber, Structure, Molecule, CrystalStructure, AtomicPosition, AtomicVelocity, MesoscaleSample, hasAngle, hasBasis, hasStructure, hasUnitCell, hasLatticeParameter, hasElement, hasLength, hasDefect, hasSpecies, hasAttribute, isDefectOf, hasSpaceGroup, hasVector, hasNormalVector, hasUnit, hasSimulationCell, isCalculatedPropertyOf, hasMaterial, hasCalculatedProperty, isMaterialOf, hasLength_y, hasAtomicPercent, hasComponent_y, hasAngle_alpha, hasRepetition_x, hasReference, hasN

And further select terms from there.

In [13]:
onto.terms.cmso.AtomicScaleSample

cmso:AtomicScaleSample
Atomic scale sample is a computational sample in the atomic length scale.

Domains and ranges can also be checked

In [14]:
onto.terms.cmso.hasSpaceGroupSymbol.domain, onto.terms.cmso.hasSpaceGroupSymbol.range

(['cmso:CrystalStructure'], ['string'])

Applying constraints can be done through basic comparison operators

## Basic comparison operations

Basic operators such as <, >, <=, >=, and ==

These operations are useful for adding conditions to the SPARQL query. When these operations are performed on a term, it is stored in its condition string. No other changes are needed. 

In [15]:
onto.terms.cmso

NormalVector, Length, SimulationCell, AmorphousMaterial, ComputationalSample, Basis, ChemicalElement, UnitCell, ChemicalComposition, CrystalDefect, MacroscaleSample, Microstructure, AtomicScaleSample, AtomicForce, ChemicalSpecies, SimulationCellLength, MicroscaleSample, LatticePlane, LatticeAngle, CalculatedProperty, LatticeParameter, Material, LatticeVector, NanoscaleSample, Atom, AtomAttribute, SpaceGroup, SimulationCellAngle, CrystallineMaterial, Plane, SimulationCellVector, Vector, Occupancy, Angle, CoordinationNumber, Structure, Molecule, CrystalStructure, AtomicPosition, AtomicVelocity, MesoscaleSample, hasAngle, hasBasis, hasStructure, hasUnitCell, hasLatticeParameter, hasElement, hasLength, hasDefect, hasSpecies, hasAttribute, isDefectOf, hasSpaceGroup, hasVector, hasNormalVector, hasUnit, hasSimulationCell, isCalculatedPropertyOf, hasMaterial, hasCalculatedProperty, isMaterialOf, hasLength_y, hasAtomicPercent, hasComponent_y, hasAngle_alpha, hasRepetition_x, hasReference, hasN

In [16]:
onto.terms.cmso.hasElementRatio==1.0

cmso:hasElementRatio
A data property linking a chemical element with the ratio or fraction of it in the material.

## Logical operators

Logical operators currently supported are & and |. These operators, when applied, aggregates the condition between two terms|

In [17]:
(onto.terms.cmso.hasChemicalSymbol=='Al') & (onto.terms.cmso.hasElementRatio==1.0)

cmso:hasChemicalSymbol
A data property linking an element with its chemical symbol.

In [18]:
(onto.terms.cmso.hasChemicalSymbol=='Al') | (onto.terms.cmso.hasElementRatio==1.0)

cmso:hasChemicalSymbol
A data property linking an element with its chemical symbol.

## @ operator

The final class of operator we have is the @ operator. This can be used for resolving terms that has multiple paths. For example, rdfs:label which multiple entities can have. 

If we want to specify label for the InputParameter, it can be done like this:

In [19]:
onto.terms.rdfs.label@onto.terms.asmo.hasInputParameter

rdfs:label

conditions can also be applied on top

In [20]:
onto.terms.rdfs.label@onto.terms.asmo.hasInputParameter=='label_string'

rdfs:label

That summarises all the possible options. Now we put together these blocks to formulate some more complex queries