A workflow to integrate ecological monitoring data from different sources

This repository contains code template illustrating the workflow presented in the article:

Wicquart, J., Gudka, M., Obura, D., Logan, M., Staub, F., Souter, D., Planes, S. (2022). A workflow to integrate ecological monitoring data from different sources. Ecological Informatics, 68.

1. Abstract

Programs and initiatives aiming to protect biodiversity and ecosystems have increased over the last decades in response to their decline. Most of these are based on monitoring data to quantitatively describe trends in biodiversity and ecosystems. The estimation of such trends, at large scales, requires the integration of numerous data from multiple monitoring sites. However, due to the high heterogeneity of data formats and the resulting lack of interoperability, the data integration remains sparsely used and synthetic analyses are often limited to a restricted part of the data available. Here we propose a workflow, comprising four main steps, from data gathering to quality control, to better integrate ecological monitoring data and to create a synthetic dataset that will make it possible to analyse larger sets of monitoring data, including unpublished data. The workflow was designed and applied in the production of the Status of Coral Reefs of the World: 2020 report, where more than two hundred individual datasets were integrated to assess the status and trends of hard coral cover at the global scale. The workflow was applied to two case studies and associated R codes, based on the experience acquired during the production of this report. The proposed workflow allows for the integration of datasets with different levels of taxonomic and spatial precision, with a high degree of reproducibility. It provides a conceptual and technical framework for the integration of ecological monitoring data, allowing for the estimation of temporal trends in biodiversity and ecosystems or to test ecological hypotheses at larger scales.

2. How to download this project?

On the project main page on GitHub, click on the green button Code and then click on Download ZIP

3. Path 3.A - Taxonomic re-categorization

3.1 Context

The first case study correspond to the path 3A of the workflow and illustrate the integration of data from monitoring of benthic communities (sessile organisms) in coral reefs. Because the taxonomic identification is difficult, broad categories are often used (e.g. algae, hard living coral) in most of monitoring programs. Thus, during the data integration, a taxonomic re-categorization must be done to insure the use of common categories across the different datasets integrated. We highlight that the different datasets were created to illustrate the worklow and they do not correspond to any existing real datasets.

3.2 Project organization

The folder path_3a contains the folders data and R. The data folder regroup all the data files with a numbering corresponding to their level of advancement in the workflow. Hence, the folder 01_raw contains the different raw data files as received by data contributors, the folder 02_reformatted contains the individually reformatted datasets, the file 03_synthetic-dataset correspond to the grouped data with taxonomic assignment done, and the file 04_final-synthetic-dataset correspond to the final synthetic dataset.

Those data files and folders numbering is used correspondingly in the R folder where three scripts are present. The first one (step-2_individual-data-reformatting.Rmd) correspond to the individual data reformatting (step 2 of the workflow) with one or more chunk code by data contributor. The second one (step-3_data-grouping-tax-assignment.Rmd) correspond to the data grouping and taxonomic assignment (step 3 of the workflow), and the last one (step-4_quality-assurance-quality-control.Rmd) correspond to the quality assurance and quality control (step 4 of the workflow).

The .Rmd format (rmarkdown) was chosen for the different R scripts because it allows a better segmentation and annotation of the code and process (necessary for the step 2) and the exportation of code and output to an HTML file which may include interactive tables, plots and maps (necessary for steps 3 and 4). The HTML files can be opened with a search engine (e.g. Google Chrome) and an internet connection is necessary for the visualization of interactive maps.

3.3 Raw datasets description

The 01_raw folder includes five folders corresponding to the data shared by five different data contributors. Each of them represent a specific case:

📄 data_contributor_1: One .xlsx file containing two sheets, the first one with the main data and the second one with the substrate codes.
📄 data_contributor_2: One .xlsx file containing two sheets, the first one with the main data in wide format and the second one with site coordinates.
📄 data_contributor_3: One .xlsx file containing three sheets with same columns names corresponding to three different sites.
📄 data_contributor_4: Three .csv files where the two first files contains data for the same site but for two different years, and the third file contains substrate codes.
📄 data_contributor_5: Three .xlsx files where the two first files contains data in wide format with different columns names, and the third file contains site coordinates.

3.4 Variables selected

The first step of the workflow is to select the variables that will have to be present in the final synthetic dataset. The variables selected for the first case study are described in the Table 1.

Table 1. Variables selected for the benthic synthetic dataset. The icons for the variables categories (Cat.) represents 📝 = description variables, 🌐 = spatial variables, 📆 = temporal variables, 📏 = methodological variables, 🦀 = taxonomic variables, 📈 = metric variables. Variables names in parentheses correspond to the DarwinCore (DwC) terms.

	Variable (DwC)	Cat.	Type	Unit	Description
1	DatasetID (datasetID)	📝	Factor		Dataset ID
2	Area (higherGeography)	🌐	Factor		Biogeographic area
3	Country (country)	🌐	Factor		Country
4	Archipelago (islandGroup)	🌐	Factor		Archipelago
5	Location (stateProvince)	🌐	Factor		Location or island within the country
6	Site (locality)	🌐	Factor		Site within the location
7	Replicate (parentEventID)	🌐	Integer		Replicate ID
8	Zone (habitat)	🌐	Factor		Reef zone
9	Latitude (decimalLatitude)	🌐	Numeric		Latitude of the site (decimal format)
10	Longitude (decimalLongitude)	🌐	Numeric		Longitude of the site (decimal format)
11	Depth (verbatimDepth)	🌐	Numeric	m	Mean depth
12	Year (year)	📆	Integer		Year
13	Date (eventDate)	📆	Date		Date (YYYY-MM-DD)
14	Method (samplingProtocol)	📏	Factor		Description of the method used
15	Observer	📏	Factor		Name of the diver
16	Category	🦀	Factor		See Table 2
17	Group	🦀	Factor		See Table 2
18	Family (family)	🦀	Factor		Family
19	Genus (genus)	🦀	Factor		Genus
20	Species (scientificName)	🦀	Factor		Species
21	Cover (measurementValue)	📈	Numeric	%	Cover percentage

Table 2. Factor levels of variables Category and Group used for the re-categorization.

Category	Group
Abiotic	Rock
	Rubble
	Sand
	Silt
Algae	Coralline algae
	Cyanophyceae
	Macroalgae
	Turf algae
Hard bleached coral
Hard dead coral
Hard living coral
Other fauna	Actiniaria
	Alcyonacea
	Antipatharia
	Asteroidea
	Bivalvia
	Bryozoa
	Corallimorpharia
	Crinoidea
	Echinoidea
	Gastropoda
	Holothuroidea
	Hydrozoa
	Ophiuroidea
	Polychaeta
	Porifera
	Tunicata
	Zoantharia
Seagrass

4. Path 3.B - Taxonomical verification

4.1 Context

The second case study correspond to the path 3B of the workflow and illustrate the integration of data from monitoring of fish communities (vagile organisms) in coral reefs. Because the monitoring of fish is based on true taxonomical levels (e.g. species, genus) instead of broad categories, a taxonomical verification must be assessed during the data integration to avoid misspelling names and include recent update in taxonomy. We highlight that the different datasets were created to illustrate the worklow and they do not correspond to any existing real datasets.

4.2 Project organization

The folder path_3b contains the folders data and R. The data folder regroup all the data files with a numbering corresponding to their level of advancement in the workflow. Hence, the folder 01_raw contains the different raw data files as received by data contributors, the folder 02_reformatted contains the individually reformatted datasets, the file 03_synthetic-dataset correspond to the grouped data with taxonomic assignment done, and the file 04_final-synthetic-dataset correspond to the final synthetic dataset.

Those data files and folders numbering is used correspondingly in the R folder where three scripts are present. The first one (step-2_individual-data-reformatting.Rmd) correspond to the individual data reformatting (step 2 of the workflow) with one or more chunk code by data contributor. The second one (step-3_data-grouping-tax-assignment.Rmd) correspond to the data grouping and taxonomic assignment (step 3 of the workflow), and the last one (step-4_quality-assurance-quality-control.Rmd) correspond to the quality assurance and quality control (step 4 of the workflow).

The .Rmd format (rmarkdown) was chosen for the different R scripts because it allows a better segmentation and annotation of the code and process (necessary for the step 2) and the exportation of code and output to an HTML file which may include interactive tables, plots and maps (necessary for steps 3 and 4). The HTML files can be opened with a search engine (e.g. Google Chrome) and an internet connection is necessary for the visualization of interactive maps.

4.3 Raw datasets description

The 01_raw folder includes five folders corresponding to the data shared by five different data contributors. Each of them represent a specific case:

📄 data_contributor_1: One .xlsx file containing two sheets, the first one with the main data and the second one with the site coordinates.
📄 data_contributor_2: Two .csv files, the first contains the main data in wide format and the second the sites coordinates.
📄 data_contributor_3: One .xlsx file containing three sheets, the first contains the main data, the second the site coordinates and the third the species codes.
📄 data_contributor_4: Two files, one in .xlsx and one in .csv. The .xlsx file contains two sheets with same column names, corresponding to two different sites. The .csv file contains the site coordinates.
📄 data_contributor_5: Four .xlsx files with one sheet, the first three contains the main data with same column names for three different sites, the fourth contains sites coordinates data.

4.4 Variables selected

The first step of the workflow is to select the variables that will have to be present in the final synthetic dataset. The variables selected for the second case study are described in the Table 3.

Table 3. Variables selected for the fish synthetic dataset. The factor levels of the variable Size_type are Total length, Fork length and Standard length. The icons for the variables categories (Cat.) represents 📝 = description variables, 🌐 = spatial variables, 📆 = temporal variables, 📏 = methodological variables, 🦀 = taxonomic variables, 📈 = metric variables.Variables names in parentheses correspond to the DarwinCore (DwC) terms.

	Variable (DwC)	Cat.	Type	Unit	Description
1	DatasetID (datasetID)	📝	Factor		Dataset ID
2	Area (higherGeography)	🌐	Factor		Biogeographic area
3	Country (country)	🌐	Factor		Country
4	Archipelago (islandGroup)	🌐	Factor		Archipelago
5	Location (stateProvince)	🌐	Factor		Location or island within the country
6	Site (locality)	🌐	Factor		Site within the location
7	Replicate (parentEventID)	🌐	Integer		Replicate ID
8	Zone (habitat)	🌐	Factor		Reef zone
9	Latitude (decimalLatitude)	🌐	Numeric		Latitude of the site (decimal format)
10	Longitude (decimalLongitude)	🌐	Numeric		Longitude of the site (decimal format)
11	Depth (verbatimDepth)	🌐	Numeric	m	Mean depth
12	Year (year)	📆	Integer		Year
13	Date (eventDate)	📆	Date		Date (YYYY-MM-DD)
14	Method (samplingProtocol)	📏	Factor		Description of the method used
15	Observer	📏	Factor		Name of the diver
16	Family (family)	🦀	Factor		Family
17	Genus (genus)	🦀	Factor		Genus
18	Species (scientificName)	🦀	Factor		Species
19	Density	📈	Numeric	n. ind. 100 m-2	Number of individuals
20	Size	📈	Numeric	cm	Size of individuals
21	Size_type (measurementType)	📏	Factor		Size type used to measure the size

5. How to report issues?

Please report any bugs or issues HERE.

6. Reproducibility parameters

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Hmisc_4.5-0       Formula_1.2-4     survival_3.2-11   lattice_0.20-44   rfishbase_3.1.8  
 [6] leaflet_2.0.4.1   DT_0.18           formattable_0.2.1 lubridate_1.7.10  readxl_1.3.1     
[11] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.6       purrr_0.3.4       readr_1.4.0      
[16] tidyr_1.1.3       tibble_3.1.2      ggplot2_3.3.4     tidyverse_1.3.1  

loaded via a namespace (and not attached):
 [1] fs_1.5.0            RColorBrewer_1.1-2  progress_1.2.2      httr_1.4.2          gh_1.3.0           
 [6] tools_4.1.0         backports_1.2.1     utf8_1.2.1          R6_2.5.0            rpart_4.1-15       
[11] DBI_1.1.1           colorspace_2.0-1    nnet_7.3-16         withr_2.4.2         tidyselect_1.1.1   
[16] gridExtra_2.3       prettyunits_1.1.1   curl_4.3.1          compiler_4.1.0      cli_2.5.0          
[21] rvest_1.0.0         htmlTable_2.2.1     xml2_1.3.2          scales_1.1.1        checkmate_2.0.0    
[26] digest_0.6.27       foreign_0.8-81      rmarkdown_2.9       base64enc_0.1-3     jpeg_0.1-8.1       
[31] pkgconfig_2.0.3     htmltools_0.5.1.1   dbplyr_2.1.1        fastmap_1.1.0       htmlwidgets_1.5.3  
[36] rlang_0.4.11        rstudioapi_0.13     generics_0.1.0      jsonlite_1.7.2      crosstalk_1.1.1    
[41] magrittr_2.0.1      Matrix_1.3-4        Rcpp_1.0.6          munsell_0.5.0       fansi_0.5.0        
[46] lifecycle_1.0.0     stringi_1.6.2       yaml_2.2.1          grid_4.1.0          crayon_1.4.1       
[51] haven_2.4.1         splines_4.1.0       hms_1.1.0           knitr_1.33          pillar_1.6.1       
[56] reprex_2.0.0        glue_1.4.2          evaluate_0.14       arkdb_0.0.12        latticeExtra_0.6-29
[61] data.table_1.14.0   modelr_0.1.8        vctrs_0.3.8         png_0.1-7           cellranger_1.1.0   
[66] gtable_0.3.0        assertthat_0.2.1    cachem_1.0.5        xfun_0.24           broom_0.7.7        
[71] memoise_2.0.0       cluster_2.1.2       ellipsis_0.3.2

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
path_3a		path_3a
path_3b		path_3b
.gitignore		.gitignore
README.md		README.md
monitoring_workflow.Rproj		monitoring_workflow.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

path_3a

path_3a

path_3b

path_3b

.gitignore

.gitignore

README.md

README.md

monitoring_workflow.Rproj

monitoring_workflow.Rproj

Repository files navigation

A workflow to integrate ecological monitoring data from different sources

1. Abstract

2. How to download this project?

3. Path 3.A - Taxonomic re-categorization

3.1 Context

3.2 Project organization

3.3 Raw datasets description

3.4 Variables selected

4. Path 3.B - Taxonomical verification

4.1 Context

4.2 Project organization

4.3 Raw datasets description

4.4 Variables selected

5. How to report issues?

6. Reproducibility parameters

About

Releases

Packages

Languages

JWicquart/monitoring_workflow

Folders and files

Latest commit

History

Repository files navigation

A workflow to integrate ecological monitoring data from different sources

1. Abstract

2. How to download this project?

3. Path 3.A - Taxonomic re-categorization

3.1 Context

3.2 Project organization

3.3 Raw datasets description

3.4 Variables selected

4. Path 3.B - Taxonomical verification

4.1 Context

4.2 Project organization

4.3 Raw datasets description

4.4 Variables selected

5. How to report issues?

6. Reproducibility parameters

About

Topics

Resources

Stars

Watchers

Forks

Languages