/
searching-dataone.Rmd
147 lines (113 loc) · 7.07 KB
/
searching-dataone.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "Searching DataONE Data Holdings"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Searching DataONE Data Holdings}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
The primary method to locate data in the DataONE network is to use the web based search tool located at https://search.dataone.org.
Data can also be discovered programatically from R using the *dataone* R package method `query`. Both the web page and the R method
access the same
underlying search mechanism, the Apache Foundation Solr search engine that runs on the
[DataONE Coordinating Node](https://www.dataone.org/coordinating-nodes). The DataONE Solr index, similar to a catalog or database, contains
information for every dataset that is available from the DataONE network.
The same search mechanism is available on [DataONE Member Nodes](https://www.dataone.org/current-member-nodes). However, the search
index on a member node only contains dataset information for datasets that are contained
on that member node.
Information about the Solr search engine can be obtained at [Solr Resources](http://lucene.apache.org/solr/resources.html).
Additional information about searching DataONE can be viewed at [Content Discovery](https://purl.dataone.org/architecture/design/SearchMetadata.html).
## Using The `query` Method
Some familiarity with Solr is helpful when using the `query` method
effectively, however basic queries can be very powerful and the examples in this document can provide a starting point. As an alternative to
composing queries using Solr syntax, a simplier search mechanism is available with the `query` method that uses name, value pairs (See the section: *A Simplied Search").
Additional information about the `query` method can be obtained from the R help facility, e.g. `?query` from the R command line.
### Query and return an R list
The following example queries DataONE and returns values in an R list, with each value converted from the Solr result to the appropriate R data type:
```{r,eval=F}
library(dataone)
cn <- CNode("PROD")
queryParamList <- list(q="id:doi*", rows="5", fq="abstract:carbon", fl="id,title,dateUploaded,abstract,datasource,size")
result <- query(cn, solrQuery=queryParamList, as="list")
```
The 'solrQuery' argument takes a list of query parameters that are sent to the DataONE Solr search engine and control how the
search is performed. The name of each list element is a Solr keyword which is combined with the list element value to create
each Solr query term. All these terms are combined to create the complete Solr query. The `queryParamlist` in the example
above will be used to construct this Solr query:
```
?q=id:doi*&rows=5,&fq=abstract:carbon&fl=id,title,dateUploaded,abstract,datasource,size
```
The `q=id:doi*` is the main query term and specifies that all DataONE objects that have an `id` field
value beginning with `doi*` should be returned.
The `&fq=abstract:carbon` term is a `filter query`, which filters the results so that only results with an `abstract` field containing
the word `carbon` will be returned.
The `&fl=id,title,dateUploaded,abstract,datasource,size` term is a `field` specifier, so only those fields in the list will be
included in the result set.
The `&rows=5` term specifies that a maximum of five results will be returned.
The `result` object contains all the data values found and returned from the query. Each element of the returned list contains information
for one dataset. Each returned attribute for a dataset can be accessed with the appropriate element name, for example, to
access the title information of the first dataset returned, use the R statement:
```{r,eval=F}
result[[1]]$title
```
To print out selected information for all returned values, use:
```{r,eval=F}
ids <- lapply(result, function(x) {
message(sprintf("id: %s", x$id))
message(sprintf("origin member node: %s", x$datasource))
message(sprintf("title: %s", x$title))
message(sprintf("date uploaded: %s", x$dateUploaded))
x$id
})
```
The complete list of possible searchable values stored for a dataset can be viewed at [DataONE Solr Engine Description](https://cn.dataone.org/cn/v2/query/solr). However,
the values available for a particular dataset may be a subset of these, depending on the metadata provided when the dataset is uploaded to DataONE.
The DataONE Coordinating Node (CN) contains metadata about datasets from all Member Nodes (MN) in the network.
As the above example shows, sending a query to the CN may find matching datasets located on potentially any Member Node in the network.
### Return values as a data frame
To return results as a data frame, use the `as` parameter as shown below. In addition all values will be stored as `character`, because `parse=FALSE` is specified:
```{r,eval=F}
cn <- CNode("PROD")
result <- query(cn, queryParamList, as="data.frame", parse=FALSE)
```
The `result` is a data frame with each row containing information for one dataset. To print all returned DataONE identifiers from the result:
```{r, eval=F}
result[,'id']
```
### Using a Simplified Search.
A search may be performed by just specifying query fields and values using the *searchTerms* parameter.
For example, to search for datasets that mention 'kelp' in the abstract and have an attribute discription that contains the word 'biomass', the following query could be used:
```{r, eval=F}
cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
mySearchTerms <- list(abstract="kelp", attribute="biomass")
result <- query(cn, searchTerms=mySearchTerms, as="data.frame")
```
Using the *searchTerms* parameter causes the *query* method to construct a Solr query
based on the list, that is passed on to the DataONE node specified.
The names in the *searchTerm* list are the query field names available from the Solr query engine
being used. These names can be determined using the *queryEngineDescription* method, or simply
using a web browser to view https://cn.dataone.org/cn/v2/query/solr.
### Search just a member node
If it is known which Member Node holds the data of interest, then a search can be limited to just that
MN by sending the search directly to that MN instead of the CN. For example, if the dataset of interest
is on The Knowledge Network for Biocomplexity (KNB) Member Node, then the search is performed with the statments:
```{r, eval=F}
# Query the data holdings on a member node
cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
queryParams <- list(q="abstract:habitat", fl="id,title,abstract")
result <- query(mn, queryParams, as="data.frame", parse=FALSE)
# Choose the first matchin PID
pid <- result[1,'id']
```
### Filter query results for a specific member node
An alternative way to locate datasets on a particular node is to send the query to the CN but limit the returned results to data holdings that originated from the specific member node by using Solr filter query parameter (`&fq`) can be used
for the `datasource` field:
```{r,eval=F}
cn <- CNode("PROD")
queryParams <- 'q=id:*&fl=id,title&fq=datasource:"urn:node:KNB"&rows=5'
result <- query(cn, queryParams, as="data.frame", parse=FALSE)
result[,'id']
```