Skip to content

Commit 3ce8549

Browse files
author
John Waller
committed
updating occ_search docs
1 parent 1a0afda commit 3ce8549

File tree

5 files changed

+192
-3
lines changed

5 files changed

+192
-3
lines changed

R/occ_count.r

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,6 @@ occ_count <- function(...,occurrenceStatus="PRESENT", curlopts = list()) {
225225
publishedByGbifRegion = args$publishedByGbifRegion,
226226
island = args$island,
227227
islandGroup = args$islandGroup,
228-
recordedById = args$recordedById,
229228
taxonId = args$taxonId,
230229
taxonConceptId = args$taxonConceptId,
231230
taxonomicStatus = args$taxonomicStatus,

R/occ_search.r

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,6 @@ occ_search <- function(taxonKey = NULL,
108108
publishedByGbifRegion = NULL,
109109
island = NULL,
110110
islandGroup = NULL,
111-
recordedById = NULL,
112111
taxonId = NULL,
113112
taxonConceptId = NULL,
114113
taxonomicStatus = NULL,

man-roxygen/occsearch.r

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,8 @@
225225
#' a specimen.
226226
#' @param datasetId (character) The ID of the dataset. Parameter may be
227227
#' repeated. Example : https://doi.org/10.1594/PANGAEA.315492
228+
#' @param datasetName (character) The exact name of the dataset. Not the same as
229+
#' dataset title.
228230
#' @param publishedByGbifRegion (character) GBIF region based on the owning
229231
#' organization's country.
230232
#' @param island (character) The name of the island on or near which the

man/occ_search.Rd

Lines changed: 3 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
---
2+
title: "Effectively using occ_search"
3+
author: "John Waller"
4+
date: "2024-05-08"
5+
output: rmarkdown::html_vignette
6+
vignette: >
7+
%\VignetteIndexEntry{effectively_using_occ_search}
8+
%\VignetteEngine{knitr::rmarkdown}
9+
%\VignetteEncoding{UTF-8}
10+
---
11+
12+
GBIF's [occurrence search](https://www.gbif.org/occurrence/search) is a powerful and versatile tool for accessing GBIF mediate data. This vignette will provide an overview of the `occ_search()` function and provide examples and advice of how to use it effectively and also when **not** to use it.
13+
14+
> The function `occ_search()` (and related legacy function `occ_data()`) **should not** be used for serious research. Users sometimes find it easier to use `occ_search()` rather than `occ_download()` because they do not need to supply a username or password, and also do not need to wait for a download to finish. However, any serious research project should always use `occ_download()` instead.
15+
16+
`occ_search()` is a quick way to get a non-random sample of occurrences from the GBIF mediated data. It is useful for quickly exploring the data, but it is not suitable for serious research because users are **limited to 100,000 records** per search combination.
17+
18+
And, even if your search returns fewer than 100,000 records, it is **still** not recommended to use `occ_search()` to retrieve all the records for a serious research project. This is because it is not possible to [cite the data](https://docs.ropensci.org/rgbif/articles/gbif_citations.html) obtained this way in an easy way.
19+
20+
Here are some examples of some **good** usages of `occ_search()`:
21+
22+
- Quickly exploring occurrence data
23+
- Getting occurrence counts and statistics (see also `occ_count()` and article [here](https://docs.ropensci.org/rgbif/articles/occ_counts.html))
24+
- Testing out search parameters before downloading data
25+
26+
And here are some examples of **bad** usages of `occ_search()`:
27+
28+
- Looping through a large number of species to extract occurrence data (See article [here](https://docs.ropensci.org/rgbif/articles/downloading_a_long_species_list.html) instead)
29+
- Treating the data as a random sample
30+
- Using `occ_search()` data for citable research
31+
32+
## basisOfRecord
33+
34+
One of the more useful fields to search on is `basisOfRecord`, which gives roughly the origin of the occurrence record. Most records on GBIF are either `PRESERVED_SPECIMEN` (museum/herbarium records) or `HUMAN_OBSERVATION` (usually citizen science, but sometimes research observations).
35+
36+
Other interesting `basisOfRecord` values are `FOSSIL_SPECIMEN` and `LIVING_SPECIMEN` (zoos or botanical gardens), because people typically want to exclude these from their downloads.
37+
38+
Keep in mind that the `basisOfRecord` values are not guaranteed to be filled in accurately by the publisher. Sometimes records are misclassified or given a `basisOfRecord` that you would not expect or have a [complicated provenance](https://data-blog.gbif.org/post/living-specimen-to-preserved-specimen-understanding-basis-of-record/).
39+
40+
``` r
41+
occ_search(basisOfRecord="PRESERVED_SPECIMEN") # museum and herbarium records
42+
occ_search(basisOfRecord="HUMAN_OBSERVATION") # citizen science and research observations
43+
occ_search(basisOfRecord="FOSSIL_SPECIMEN") # fossil records
44+
occ_search(basisOfRecord="LIVING_SPECIMEN") # zoo and botanical garden records
45+
occ_search(basisOfRecord="PRESERVED_SPECIMEN;HUMAN_OBSERVATION") # museum/herbarium and citizen science/research observations
46+
occ_search(basisOfRecord="MACHINE_OBSERVATION") # machine observations (e.g. camera traps, acoustic recorders, etc.)
47+
```
48+
49+
## Searching with scientificName
50+
51+
Users are sometimes attracted to `occ_search()` because it is possible to supply a `scientificName` rather than a `taxonKey`. Note, that in the background a call is made the species match service (similar `to name_backbone()`) in order to retrieve a GBIF taxonKey. Because of this, a user can sometimes rarely receive back poorly matched occurrences, particularly if authorship is not supplied.
52+
53+
``` r
54+
occ_search(scientificName="Caloptery splendens")
55+
# Or better
56+
occ_search(scientificName="Calopteryx splendens (Harris, 1780)")
57+
```
58+
59+
Is equivalent to doing the following:
60+
61+
``` r
62+
occ_search(taxonKey=name_backbone("Calopteryx splendens")$usageKey)
63+
# OR
64+
occ_search(taxonKey=1427067)
65+
```
66+
67+
If your name happens to be a [homotypic synonym](https://docs.ropensci.org/rgbif/articles/taxonomic_names.html#too-many-choices-problem) of another name, you may get back occurrences for the other name or no results or a higher-rank match results. Therefore, it is usually safer to use the GBIF taxonKey.
68+
69+
## Non-interpreted fields
70+
71+
Some fields in the GBIF mediated data are "interpreted" by GBIF, meaning that they are standardized in some way. For example, the field `basisOfRecord` is standardized to a controlled vocabulary. Therefore, only a few values are returned no matter what the publisher has supplied. For instance, "pinned insect", "fish specimen", and "herbarium sheet", will all get mapped to `PRESERVED_SPECIMEN` by GBIF.
72+
73+
Other fields are "non-interpreted", meaning that they are not standardized in any way. For example, the field `recordedBy` is a free text field. If you search for `recordedBy="John Smith"`, you may not get back occurrences where the `recordedBy` field is some variant such as `J. Smith`, `Smith, J.`, `Smith, John`, etc.
74+
75+
One strategy for determining whether a search term is free text is by using `occ_count(facet=<"search term">)`. See article of `occ_count()` [here](https://docs.ropensci.org/rgbif/articles/occ_counts.html).
76+
77+
``` r
78+
occ_count(facet="recordedBy")
79+
occ_count(facet="basisOfRecord")
80+
```
81+
82+
If many unique values are returned, then it is likely that the field is free text.
83+
84+
## Un-intentional mass data removal from NULL values
85+
86+
Some search parameters are often `NULL` or not supplied from the publisher. In general, `occ_search()` terms that are not required fields or not filled by GBIF during interpretation are often `NULL`. For example, even though `coordinateUncertaintyInMeters` [theoretically applies](https://docs.gbif.org/georeferencing-best-practices/1.0/en/) to all occurrences with coordinates, it is often `NULL` because the publishers choose not to supply this information or it is unknown. Similarly, `sex` might often be left `NULL` more than what would be expected naively.
87+
88+
Other columns with more `NULL`s than one might expect :
89+
90+
- `stateProvince`
91+
- `elevation`
92+
- `establishmentMeans`
93+
- `coordinateUncertaintyInMeters`
94+
95+
Keep in mind that specifying any filter will remove all records with `NULL` in the filter.
96+
97+
## Searching for locations
98+
99+
Location searching can sometimes be challenging for new users. Particularly, searching for `stateProvince` can be tricky because the field is free text when one might expect it to be from a controlled vocabulary. `stateProvince="California"` will not return occurrences where the publisher supplied has values such as `CA`, `Calif.`, or `Cal.`. Additionally, records with coordinates falling within California may not have been supplied with a `stateProvince` value by the publisher.
100+
101+
``` r
102+
occ_search(stateProvince="California")
103+
occ_search(stateProvince="CA")) # will return different number of records
104+
occ_search(stateProvince="CA;California")) # search both variants at the same time
105+
```
106+
107+
A usually better choice than searching by `stateProvince` is to search by `gadmGid`. The term `gadmGid` is a GBIF interpreted field that is filled by GBIF when coordinates are available. Looking up the `gadmGid`s can be done of the GBIF [occurrence search page](https://www.gbif.org/occurrence/map?continent=NORTH_AMERICA&has_coordinate=true&has_geospatial_issue=false&gadm_gid=USA.5_1).
108+
109+
``` r
110+
occ_search(gadmGid="USA.5_1") # search for California
111+
occ_search(gadmGid="JPN.12_1") # search for Hokkaido Japan
112+
occ_search(gadmGid="USA.5_1;USA.6_1") # search for California and Colorado
113+
occ_search(gadmGid="PHL.10_1") # Bataan Philippines
114+
occ_search(gadmGid="USA") # United States "just land" without EEZ area
115+
```
116+
117+
Searching by `country` is typically straightforward because the field is standardized and filled by GBIF when coordinates are available. Two letter country codes are used when searching occurrences. These codes can be looked up using `enumeration_country()`.
118+
119+
``` r
120+
occ_search(country="US") # search for United States
121+
occ_search(country="JP") # search for Japan
122+
occ_search(country="PH") # search for Philippines
123+
occ_search(country="SW") # search for Sweden
124+
occ_search(country="US;JP") # search for United States and Japan
125+
```
126+
127+
Searching by `continent` is also possible, but unlike `country`, this value is **not** filled in when coordinates are available, and instead relies on the publisher filling in this field. So if the publisher has not filled in a value, then this field will be `NULL`, even if it obviously lies on a continent.
128+
129+
The field is however standardized by GBIF, so that the values are mapped to supplied values are all mapped to a controlled vocabulary(e.g. "Europa, Euroopa,EUR,Eu" -\> EUROPE, "Afrique,"Afr.","AF" -\> AFRIKA).
130+
131+
``` r
132+
occ_search(continent="EUROPE") # search for Europe
133+
occ_search(continent="AFRIKA") # search for Africa
134+
occ_search(continent="EUROPE;AFRIKA") # search for Europe and Africa
135+
```
136+
137+
If you need to get all occurrences from a certain continent, I would use the `gadmGid` filter or supply a bounding box or WKT polygon to `geometry`. When using `geometry` make sure that your polygon is wound in the correct order (anti-clockwise). When in doubt, using the GBIF [web UI](https://www.gbif.org/occurrence/map) to draw and debug the polygon can be a good option. Only POLYGON and MULTIPOLYGON are accepted WKT types.
138+
139+
``` r
140+
occ_search(geometry="POLYGON((13.42436 69.86167,4.6469 67.01976,-8.26114 67.2205,-19.62021 67.81281,-28.39768 64.25374,-27.88135 53.09437,-17.55493 44.99691,-16.52228 30.81969,3.61426 32.57676,19.62021 30.37524,38.72411 32.14062,54.21375 33.87246,66.60546 43.14228,72.80133 50.54193,70.21972 62.16009,38.20778 72.6752,23.23447 73.42765,13.42436 69.86167))") # rough polygon around Europe
141+
```
142+
143+
Sometimes it can be useful to select everything **but** a [certain region](https://www.gbif.org/occurrence/map?has_coordinate=true&has_geospatial_issue=false&geometry=POLYGON((-180%20-90,-90%20-90,0%20-90,90%20-90,180%20-90,180%2090,90%2090,0%2090,-90%2090,-180%2090,-180%20-90),(-5%20-5,-5%205,5%205,5%20-5,-5%20-5))&occurrence_status=present), also known as a "polygon with hole in it". This can be done by formatting your WKT with enough interpolated points.
144+
145+
```
146+
POLYGON(
147+
(-180 -90,-90 -90,0 -90,90 -90,180 -90,180 90,90 90,0 90,-90 90,-180 90,-180 -90),
148+
(-5 -5,-5 5,5 5,5 -5,-5 -5)
149+
)
150+
```
151+
152+
## Searching for dates
153+
154+
Some records on GBIF can be quite old (1600s), so it is sometimes useful to filter by `year` to remove these records. Year is typically the collection event or the observation event of the record. Almost all occurrences on GBIF supply a `year` value. Therefore filtering by `year` is typically safe from un-intentional mass data filtering from `NULL` values.
155+
156+
157+
```r
158+
occ_search(year=1998) # search for occurrences from 1998
159+
occ_search(year="1998,2024") # search for occurrences from 1998-2024
160+
occ_search(year="1900;2000") # search for occurrences from 1900 and 2000
161+
occ_search(year="1950,2024") # search for somewhat modern records
162+
```
163+
164+
## Other record ids
165+
166+
Sometimes users are coming to GBIF looking for a specific museum record, but they don't know the `gbifid` of the record. In these cases, searching by `occurrenceId`, `catalogNumber`, `recordNumber` or `institutionCode` can be useful. Keep in mind that many of these fields and may not be unique across all of GBIF. For example, a few institutions might use the same `institutionCode`, but actual be different institutions. Usually combining a few of these values can get you close to the record you are looking for.
167+
168+
```r
169+
occ_search(institutionCode="KU")
170+
occ_search(catalogNumber="KU 110")
171+
172+
```
173+
174+
## DWCA extensions
175+
176+
New users might not be aware that some data publishers supply additional data beyond simple "when-what-where" data. Richer extra data usually comes in the form of `dwcaExtensions`. While `occ_search()` does not return the values from these extensions, it is possible to filter by extension type to see what dataset publishers have published extensions of interest.
177+
178+
```r
179+
occ_search(dwcaExtension="http://rs.gbif.org/terms/1.0/Multimedia")
180+
occ_search(dwcaExtension="http://rs.tdwg.org/dwc/terms/MeasurementOrFact")
181+
occ_search(dwcaExtension="http://rs.gbif.org/terms/1.0/DNADerivedData")
182+
```
183+
184+
## Further reading
185+
186+
[GBIF tech docs](https://techdocs.gbif.org/en/openapi/v1/occurrence#/Searching%20occurrences/searchOccurrence)
187+

0 commit comments

Comments
 (0)