Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download of wrong species list #705

Open
francescorota93 opened this issue Feb 20, 2024 · 16 comments
Open

Download of wrong species list #705

francescorota93 opened this issue Feb 20, 2024 · 16 comments

Comments

@francescorota93
Copy link

francescorota93 commented Feb 20, 2024

I need to download data for more than 2500 species. I use a similar script to this:

https://docs.ropensci.org/rgbif/articles/downloading_a_long_species_list.html

However, the query done with my species list download far too many species compared to the required and moreover, the data I got for many species were lacking in many areas.

I tried also to split the species list in less long lists (50 species per 50 lists), this time the occurrences are better downloaded, but in some cases I still get data from species which I have not inserted in the list.

Could you please help in that?

Session Info
@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 21, 2024

@francescorota93 Could you upload here the list of species which you are using? It is likely that synonyms are being included or the names you are using match to a different than you expect.

@francescorota93
Copy link
Author

Hi, thanks for your quick response.

Here is my species list (zip file that should contaiin an rds list).
list_species.zip.

This is my script:

match the names

gbif_taxon_keys <- species_list %>%
name_backbone_checklist() %>% # match to backbone
filter(!matchType == "NONE") %>% # get matched names
pull(usageKey)

check the names

check_names <- list()
for(i in 1:length(gbif_taxon_keys)){
nu <- name_usage(key = gbif_taxon_keys[i])
nu$data$species
check_names[[length(check_names) + 1]] <- nu$data$species
}

I tried several ways, one big request with the full list download species which I did not required, so I also tried splitting the list in chunks and made a specific request per chunk, here the full list query:

check how many spec are kept

length(gbif_taxon_keys) ## 2519 species

make unique

gbif_taxon_keys1 <- unique(gbif_taxon_keys) ## 2492 species

remove 7707728

match(gbif_taxon_keys1, 7707728) ## this was downloading all tracheophyta

gbif_taxon_keys1 <- gbif_taxon_keys1[-780] ##

occ_data <- occ_download(
pred_in("taxonKey", gbif_taxon_keys1),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4, -31.3 81.9, 69.1 81.9, 69.1 32.4, -31.3 32.4))"), ## "POLYGON((-11 34, 41 34, 41 72, -11 72, -11 34))" , clockwise (wrong) "POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))"
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

occ_data

This is the doi of one of the last dowloads: https://doi.org/10.15468/dl.t8am54

I guess there are some issues both with the species names and with the polygon probably.

Let me know, thanks :)

Best

Francesco

@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 21, 2024

@francescorota93

I think your main problem is that some of your names are matching to "higherrank", which basically means the names aren't present in the GBIF backbone. I would remove any names that match to higherrank, and this will probably reduce your download size a lot.

# A tibble: 4 × 2
# Groups:   status [4]
  status       n
  <chr>    <int>
1 ACCEPTED  2238
2 DOUBTFUL     2
3 SYNONYM    279
4 NA           5

  matchType      n
  <chr>      <int>
1 EXACT       2440
2 FUZZY          5
3 HIGHERRANK    74
4 NONE           5

@francescorota93
Copy link
Author

Ok thanks, I willl try to remove the HIGHERRANK and keep only the accepted names, I will let you know

@jhnwllr jhnwllr added this to the 3.8.0 milestone Feb 21, 2024
@francescorota93
Copy link
Author

I used this now:
gbif_taxon_keys <- species_list %>%
name_backbone_checklist() %>%
filter(matchType == "EXACT" & status == "ACCEPTED") %>% # match to backbone
filter(!matchType == "NONE") %>% # get matched names
pull(usageKey)

gbif_taxon_keys1 <- unique(gbif_taxon_keys) ## 2161 species

occ_data <- occ_download(
pred_in("taxonKey", gbif_taxon_keys1), # important to use pred_in
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4, -31.3 81.9, 69.1 81.9, 69.1 32.4, -31.3 32.4))"), ## "POLYGON((-11 34, 41 34, 41 72, -11 72, -11 34))" , clockwise (wrong) "POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))"
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

d <- occ_download_get('0007908-240216155721649') %>%
occ_download_import()

The number of species now it is fine, however for many the occurrences are biased (check e.g. Abies alba or Picea abies which result with few occurrences than I would expect). Should I keep the SYNONIM in status and FUZZY in matchType too?

@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 21, 2024

There might be some issues with your polygon being the wrong way still.

polygon from your code
POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))

polygon from download
POLYGON((-31.3 32.4, -31.3 81.9, 69.1 81.9, 69.1 32.4, -31.3 32.4))
https://www.gbif.org/occurrence/download/0007908-240216155721649

@francescorota93
Copy link
Author

Thanks, I fixed the polygon and made another query:

occ_data <- occ_download(
pred_in("taxonKey", gbif_taxon_keys1), # important to use pred_in
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within(gbif_bbox2wkt(minx = -31.3, miny = 32.4, maxx = 69.1, maxy = 81.8)), ## "POLYGON((-11 34, 41 34, 41 72, -11 72, -11 34))" , clockwise (wrong) "POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))"
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

occ_data

d <- occ_download_get('0008830-240216155721649') %>%
occ_download_import()

However, the number of occurrences is still lower than expected:

down_occ <- table(as.factor(d$species)) ##
head(down_occ)

                         Abies alba      Acer campestre         Acer opalus 
           1843                3052               13820                1506 

Acer platanoides Acer pseudoplatanus
13291 22743

I try again with the proper request and polygon, but splitting the list into chunks with less species.

@francescorota93
Copy link
Author

Splitting into chunks of less species now it is fine at least for the first chunks:

table(as.factor(d$species))

           Abies alba            Acer campestre               Acer opalus 
               201352                    436076                     56684 
     Acer platanoides       Acer pseudoplatanus       Achillea erba-rotta 
               327219                    694768                      8576 
 Achillea macrophylla      Achillea millefolium          Achillea nobilis 
                 5775                   1033674                     11727 
    Achillea ptarmica          Achillea setacea        Achillea tomentosa 
               242224                      5213                      3068 

Achnatherum calamagrostis Aconitum lycoctonum Aconitum napellus
68214 48626 41959
Aconitum variegatum Acrocordia gemmata Actaea spicata
6891 9662 96543
Actinidia chinensis Adenophora liliifolia Adenostyles alliariae
602 2370 39418
Adenostyles alpina Adiantum capillus-veneris Adonis vernalis
28410 22208 22452
Adoxa moschatellina Aegopodium podagraria Aesculus hippocastanum
115235 459150 163052
Aethusa cynapium Agaricus augustus Agaricus sylvaticus
108623 6310 8621
Agaricus sylvicola Agonimia allobata Agonimia tristicula
9558 1009 5142
Agrimonia eupatoria Agrimonia procera Agrostis canina
311111 22856 191572
Agrostis capillaris Agrostis gigantea Agrostis rupestris
832968 108800 22180
Agrostis schraderiana Agrostis stolonifera Ailanthus altissima
8867 752792 93875
Aira caryophyllea Ajuga chamaepitys Ajuga genevensis
83626 37629 36173
Ajuga pyramidalis Ajuga reptans Alcea rosea
70224 428209 29269
Alchemilla mollis Alchemilla monticola
21629 58794

I will let you know when the process is finished if it worked for all the species.

@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 22, 2024

@francescorota93

I made this download with just Abies alba, and it seems fine. I honestly don't think splitting into chunks is going to make a difference.

https://www.gbif.org/occurrence/download/0010747-240216155721649

occ_download(
pred_in("taxonKey", 2685484), 
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4,69.1 32.4,69.1 81.8,-31.3 81.8,-31.3 32.4))"),
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV"
)

@francescorota93
Copy link
Author

francescorota93 commented Feb 22, 2024

I don't know either why with a long list it downloads less occurrences than expected.
With only one species it is fine, but with a large list it does not work properly. Differently, if I keep 50 species per chunks it works properly.

See my comments above, with the full list Abies alba got 3052 occurrences, while when I used chunks it got 201352 occurrences.

@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 22, 2024

https://www.gbif.org/occurrence/download/0010766-240216155721649
I checked this BIG download with all of the taxonKeys and it appears that

And it appears Abies alba gets the right count
200877

@francescorota93
Copy link
Author

the download you were checking results like this to me:

It downloads data for 2313 species

length(table(as.factor(d$species)))
[1] 2313

but for Abies alba I have only 4346 observations:

table(as.factor(d$species))["Abies alba"]
Abies alba
4346

@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 23, 2024

This is what I did on our GBIF servers.

There could be some problem with importing the really large file.

val df = spark.read.
option("header", "true").
option("delimiter", "\t").
csv("0010766-240216155721649.csv")

df.count()
df.printSchema()

df.filter($"taxonKey" === "2685484").count()

Then there could be something deeper going on. Maybe occ_download_import isn't importing the entire download. So that is why your chunks method is working.

@jhnwllr
Copy link
Collaborator

jhnwllr commented Feb 23, 2024

Just a thought but since you are working with a really large download, you might want to try another file format. It isn't well documented but you can also download parquet files directly.

https://data-blog.gbif.org/post/apache-arrow-and-parquet/

occ_download(
pred_in("taxonKey", long_list_of_keys), 
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4,69.1 32.4,69.1 81.8,-31.3 81.8,-31.3 32.4))"),
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_PARQUET"
)

https://api.gbif.org/v1/occurrence/download/request/0013314-240216155721649.zip

@francescorota93
Copy link
Author

Ok great, thanks for your response and help. I will check also this parquet format.

@jhnwllr
Copy link
Collaborator

jhnwllr commented Mar 5, 2024

gbif/occurrence#340

@jhnwllr jhnwllr removed this from the 3.8.0 milestone May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants