# Custom queries on LiPD objects

## Authors

[Deborah Khider](https://orcid.org/0000-0001-7501-8430)

## Preamble

`PyLiPD` is a Python package that allows you to read, manipulate, and write [LiPD](https://cp.copernicus.org/articles/12/1093/2016/cp-12-1093-2016-discussion.html#discussion) formatted datasets. In this tutorial, we will demonstrate how to generate your custom queries. 

### Goals

* Learn how to build SPAQRL queries

Reading Time: 5 minutes

### Keywords

LiPD, SPARQL

### Pre-requisites

* [Understand the basics of RDF and SPAQRL](http://linked.earth/pylipdTutorials/graph.html)
* [Loading LiPD objects](L0_loading_lips_datasets.md)
* [Getting information from LiPD objects](L1_getting_information.md)
* [Filtering by certain criteria](L1_filtering.md)                                                                                   
                                                                                
### Relevant Packages

pylipd

## Data Description

This notebook uses the following datasets, in LiPD format:

- McCabe-Glynn, S., Johnson, K., Strong, C. et al. Variable North Pacific influence on drought in southwestern North America since AD 854. Nature Geosci 6, 617–621 (2013). https://doi.org/10.1038/ngeo1862

- Lawrence, K. T., Liu, Z. H., & Herbert, T. D. (2006). Evolution of the eastern tropical Pacific through Plio-Pleistocne glaciation. Science, 312(5770), 79-83.

- PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

## Demonstration

### Understanding the LiPD object

Let's start by importing our favorite package and load our datasets. 

In [1]:
from pylipd.lipd import LiPD

In [2]:
path = '../data/Pages2k/'

D = LiPD()
D.load_from_dir(path)

Loading 16 LiPD files


100%|████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 37.50it/s]

Loaded..





By now, you should be familiar with our `load` functionalities. But you need to understand a little bit about what's happening under the hood to truly appreciate this tutorial and why everything will function the way it does. 

- The first step is to expand the dataset stored in LiPD format. Remember that LiPD essentially consists of data tables stored in csv format and a JSON-LD file that contains the metadata.
- The second step is to map the JSON file to RDF using our ontology as the schema.
- The third (and **unique**) step is to load the data contained into the csv files into the graph directly. This is quite unique to LinkedEarth. If you work with other knowledge bases, you'll find that most of the time, the actual data are stored in another format that you will need the learn to handle in conjunction with the metadata.

Also, each dataset gets its own graph! So you can think of the `LiPD` object as a collection of graphs, each of which represent a particular dataset. 

Let's have a close look at our object, shall we?

In [3]:
D.copy().__dict__

{'graph': <Graph identifier=Ned1080da185c444fa3d3836b882df696 (<class 'rdflib.graph.ConjunctiveGraph'>)>}

In particular, our graph object is a `ConjuctiveGraph` from the [rdflib library](https://rdflib.readthedocs.io/en/stable/). Why should you care?

1. If you isolate the graph, then you can use the `rdflib` package directly. Since this is a well-maintained community package with a larger user base than `PyLiPD`, chances are that stackoverflow or their documentation will provide you with some answers on how to work with your graph.
2. why a `ConjuctiveGraph`? We use this particular type because although each dataset is stored into its own graph, we often want to query across datasets (i.e., across graphs) and this particular object in `rdflib` allows us to do so.

Ok, so let's try to create our own queries.

### Constructing a query

Let's return to the query described in [this primer](http://linked.earth/pylipdTutorials/graph.html) in which we try to gather all the dataset names. Let's modify it to also return the variable `ds` so you can see the difference between the object and the name:

In [13]:
query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname WHERE{

?ds a le:Dataset .
?ds le:name ?dsname .}

"""

Notice that this is just a string containing the query in SPARQL language. Now, let's perform it: 

In [14]:
res, res_df = D.query(query)

Which returns two objects:
* res, which is a `SPARQLResult` object from `rdflib`
* res_df, which processes the `SPARQLResult` into a DataFrame, which is often what you will be interested to look at. Let's have a look at it:

In [15]:
res_df.head()

Unnamed: 0,ds,dsname
0,http://linked.earth/lipd/Ocn-RedSea.Felis.2000,Ocn-RedSea.Felis.2000
1,http://linked.earth/lipd/Ant-WAIS-Divide.Sever...,Ant-WAIS-Divide.Severinghaus.2012
2,http://linked.earth/lipd/Asi-SourthAndMiddleUr...,Asi-SourthAndMiddleUrals.Demezhko.2007
3,http://linked.earth/lipd/Ocn-AlboranSea436B.Ni...,Ocn-AlboranSea436B.Nieto-Moreno.2013
4,http://linked.earth/lipd/Eur-SpannagelCave.Man...,Eur-SpannagelCave.Mangini.2005


First, notice that the columns of the dataframe corresponds to the names of the variables that you have specified in the SPARQL query. You can choose any name that you want as long as you use a question mark at the beginning to indicate to SPARQL that you are questioning the database for the answer. Second notice the difference between `ds` and `dsname`. `ds` is the dataset object and the query returns the URI of that object. `dsname` is a string, which contains the names of the dataset. In general, we form our URIs with the names of the objects so don't be surprised if you sometimes end up with the URI instead of the name (we all sometimes forget to ask for the name).

One thing to consider is the speed of query. If you're doing exploratory work with SPARQL, you may wish to only return a few results to see if you're doing this right. In this case, the number of datasets is quite small:

In [16]:
len(res_df.index)

16

But if you have thousands of rows in that dataframe, you might not want to wait for the query to perform as this may take a few minutes. If you only want to return the first 10 rows in the query, just add the limit to the query:

In [17]:
query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname WHERE{

?ds a le:Dataset .
?ds le:name ?dsname .}

LIMIT 10

"""

res, res_df = D.query(query)

len(res_df.index)

10

Please note that the number of rows doesn't correspond to the number of datasets. For instance, if I ask for each variable in a dataset, then each variable will be contained in an individual row. 

### Optional variables

Let's pop out one the Pages2k dataset into a new LiPD object, `D2` and add the ODP846 dataset for the purpose of this demonstration. Why ODP846? The file was created before LiPD had DatasetIDs, therefore I know this dataset doesn't have one. 

In [18]:
D.get_all_dataset_names()

['Ocn-RedSea.Felis.2000',
 'Ant-WAIS-Divide.Severinghaus.2012',
 'Asi-SourthAndMiddleUrals.Demezhko.2007',
 'Ocn-AlboranSea436B.Nieto-Moreno.2013',
 'Eur-SpannagelCave.Mangini.2005',
 'Ocn-FeniDrift.Richter.2009',
 'Eur-LakeSilvaplana.Trachsel.2010',
 'Ocn-PedradeLume-CapeVerdeIslands.Moses.2006',
 'Ocn-SinaiPeninsula_RedSea.Moustafa.2000',
 'Eur-NorthernSpain.Martin-Chivelet.2011',
 'Arc-Kongressvatnet.D_Andrea.2012',
 'Eur-CoastofPortugal.Abrantes.2011',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Eur-FinnishLakelands.Helama.2014',
 'Eur-NorthernScandinavia.Esper.2012',
 'Eur-Stockholm.Leijonhufvud.2009']

In [20]:
D2 = D.pop('Ocn-RedSea.Felis.2000')

In [21]:
D2.load('../data/ODP846.Lawrence.2006.lpd')
D2.get_all_dataset_names()

Loading 1 LiPD files


100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it]

Loaded..





['Ocn-RedSea.Felis.2000', 'ODP846.Lawrence.2006']

Let's perform a similar query, but asking for the datasetID in addition to returning the dataset name:

In [25]:
query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname ?datasetID WHERE{

?ds a le:Dataset .
?ds le:name ?dsname .
?ds le:datasetId ?datasetID}

"""

res, res_df = D2.query(query)

res_df

Unnamed: 0,ds,dsname,datasetID
0,http://linked.earth/lipd/Ocn-RedSea.Felis.2000,Ocn-RedSea.Felis.2000,4fZQAHmeuJn8ipLfurWv


As you can see, only the PAGES2k dataset was returned through this query. The ODP846 dataset was omitted because it doesn't have a datasetID. If you want to make this query optional, then all you need to do is write `OPTIONAL` in front of it:

In [26]:
query ="""

PREFIX le:<http://linked.earth/ontology#>

SELECT ?ds ?dsname ?datasetID WHERE{

?ds a le:Dataset .
?ds le:name ?dsname .
OPTIONAL{?ds le:datasetId ?datasetID.}}

"""

res, res_df = D2.query(query)

res_df

Unnamed: 0,ds,dsname,datasetID
0,http://linked.earth/lipd/Ocn-RedSea.Felis.2000,Ocn-RedSea.Felis.2000,4fZQAHmeuJn8ipLfurWv
1,http://linked.earth/lipd/ODP846.Lawrence.2006,ODP846.Lawrence.2006,


In this case, you at least get the dataset name. When to you use the `OPTIONAL` filtering is really up to you. If the answer associated with a specific property is critical to your analysis, then it allows you to filter out these datasets directly. However, if the piece of information is not critical, you may want to use the `OPTIONAL` filter. 